• Open

    AI-enhanced iterative solvers for accelerating the solution of large scale parametrized linear systems of equations. (arXiv:2207.02543v1 [math.NA])
    Recent advances in the field of machine learning open a new era in high performance computing. Applications of machine learning algorithms for the development of accurate and cost-efficient surrogates of complex problems have already attracted major attention from scientists. Despite their powerful approximation capabilities, however, surrogates cannot produce the `exact' solution to the problem. To address this issue, this paper exploits up-to-date ML tools and delivers customized iterative solvers of linear equation systems, capable of solving large-scale parametrized problems at any desired level of accuracy. Specifically, the proposed approach consists of the following two steps. At first, a reduced set of model evaluations is performed and the corresponding solutions are used to establish an approximate mapping from the problem's parametric space to its solution space using deep feedforward neural networks and convolutional autoencoders. This mapping serves a means to obtain very accurate initial predictions of the system's response to new query points at negligible computational cost. Subsequently, an iterative solver inspired by the Algebraic Multigrid method in combination with Proper Orthogonal Decomposition, termed POD-2G, is developed that successively refines the initial predictions towards the exact system solutions. The application of POD-2G as a standalone solver or as preconditioner in the context of preconditioned conjugate gradient methods is demonstrated on several numerical examples of large scale systems, with the results indicating its superiority over conventional iterative solution schemes.  ( 3 min )
    DIWIFT: Discovering Instance-wise Influential Features for Tabular Data. (arXiv:2207.02773v1 [cs.LG])
    Tabular data is one of the most common data storage formats in business applications, ranging from retail, bank and E-commerce. These applications rely heavily on machine learning models to achieve business success. One of the critical problems in learning tabular data is to distinguish influential features from all the predetermined features. Global feature selection has been well-studied for quite some time, assuming that all instances have the same influential feature subsets. However, different instances rely on different feature subsets in practice, which also gives rise to that instance-wise feature selection receiving increasing attention in recent studies. In this paper, we first propose a novel method for discovering instance-wise influential features for tabular data (DIWIFT), the core of which is to introduce the influence function to measure the importance of an instance-wise feature. DIWIFT is capable of automatically discovering influential feature subsets of different sizes in different instances, which is different from global feature selection that considers all instances with the same influential feature subset. On the other hand, different from the previous instance-wise feature selection, DIWIFT minimizes the validation loss on the validation set and is thus more robust to the distribution shift existing in the training dataset and test dataset, which is important in tabular data. Finally, we conduct extensive experiments on both synthetic and real-world datasets to validate the effectiveness of our DIWIFT, compared it with baseline methods. Moreover, we also demonstrate the robustness of our method via some ablation experiments.  ( 3 min )
    Clustering with Semidefinite Programming and Fixed Point Iteration. (arXiv:2012.09202v3 [math.OC] UPDATED)
    We introduce a novel method for clustering using a semidefinite programming (SDP) relaxation of the Max k-Cut problem. The approach is based on a new methodology for rounding the solution of an SDP relaxation using iterated linear optimization. We show the vertices of the Max k-Cut relaxation correspond to partitions of the data into at most k sets. We also show the vertices are attractive fixed points of iterated linear optimization. Each step of this iterative process solves a relaxation of the closest vertex problem and leads to a new clustering problem where the underlying clusters are more clearly defined. Our experiments show that using fixed point iteration for rounding the Max k-Cut SDP relaxation leads to significantly better results when compared to randomized rounding.  ( 2 min )
    When does Bias Transfer in Transfer Learning?. (arXiv:2207.02842v1 [cs.LG])
    Using transfer learning to adapt a pre-trained "source model" to a downstream "target task" can dramatically increase performance with seemingly no downside. In this work, we demonstrate that there can exist a downside after all: bias transfer, or the tendency for biases of the source model to persist even after adapting the model to the target class. Through a combination of synthetic and natural experiments, we show that bias transfer both (a) arises in realistic settings (such as when pre-training on ImageNet or other standard datasets) and (b) can occur even when the target dataset is explicitly de-biased. As transfer-learned models are increasingly deployed in the real world, our work highlights the importance of understanding the limitations of pre-trained source models. Code is available at https://github.com/MadryLab/bias-transfer  ( 2 min )
    A Tutorial on the Spectral Theory of Markov Chains. (arXiv:2207.02296v1 [cs.LG])
    Markov chains are a class of probabilistic models that have achieved widespread application in the quantitative sciences. This is in part due to their versatility, but is compounded by the ease with which they can be probed analytically. This tutorial provides an in-depth introduction to Markov chains, and explores their connection to graphs and random walks. We utilize tools from linear algebra and graph theory to describe the transition matrices of different types of Markov chains, with a particular focus on exploring properties of the eigenvalues and eigenvectors corresponding to these matrices. The results presented are relevant to a number of methods in machine learning and data mining, which we describe at various stages. Rather than being a novel academic study in its own right, this text presents a collection of known results, together with some new concepts. Moreover, the tutorial focuses on offering intuition to readers rather than formal understanding, and only assumes basic exposure to concepts from linear algebra and probability theory. It is therefore accessible to students and researchers from a wide variety of disciplines.  ( 2 min )
    A Deep Model for Partial Multi-Label Image Classification with Curriculum Based Disambiguation. (arXiv:2207.02410v1 [cs.CV])
    In this paper, we study the partial multi-label (PML) image classification problem, where each image is annotated with a candidate label set consists of multiple relevant labels and other noisy labels. Existing PML methods typically design a disambiguation strategy to filter out noisy labels by utilizing prior knowledge with extra assumptions, which unfortunately is unavailable in many real tasks. Furthermore, because the objective function for disambiguation is usually elaborately designed on the whole training set, it can be hardly optimized in a deep model with SGD on mini-batches. In this paper, for the first time we propose a deep model for PML to enhance the representation and discrimination ability. On one hand, we propose a novel curriculum based disambiguation strategy to progressively identify ground-truth labels by incorporating the varied difficulties of different classes. On the other hand, a consistency regularization is introduced for model retraining to balance fitting identified easy labels and exploiting potential relevant labels. Extensive experimental results on the commonly used benchmark datasets show the proposed method significantly outperforms the SOTA methods.  ( 2 min )
    Scaling Private Deep Learning with Low-Rank and Sparse Gradients. (arXiv:2207.02699v1 [cs.LG])
    Applying Differentially Private Stochastic Gradient Descent (DPSGD) to training modern, large-scale neural networks such as transformer-based models is a challenging task, as the magnitude of noise added to the gradients at each iteration scales with model dimension, hindering the learning capability significantly. We propose a unified framework, $\textsf{LSG}$, that fully exploits the low-rank and sparse structure of neural networks to reduce the dimension of gradient updates, and hence alleviate the negative impacts of DPSGD. The gradient updates are first approximated with a pair of low-rank matrices. Then, a novel strategy is utilized to sparsify the gradients, resulting in low-dimensional, less noisy updates that are yet capable of retaining the performance of neural networks. Empirical evaluation on natural language processing and computer vision tasks shows that our method outperforms other state-of-the-art baselines.  ( 2 min )
    Towards the Use of Saliency Maps for Explaining Low-Quality Electrocardiograms to End Users. (arXiv:2207.02726v1 [cs.LG])
    When using medical images for diagnosis, either by clinicians or artificial intelligence (AI) systems, it is important that the images are of high quality. When an image is of low quality, the medical exam that produced the image often needs to be redone. In telemedicine, a common problem is that the quality issue is only flagged once the patient has left the clinic, meaning they must return in order to have the exam redone. This can be especially difficult for people living in remote regions, who make up a substantial portion of the patients at Portal Telemedicina, a digital healthcare organization based in Brazil. In this paper, we report on ongoing work regarding (i) the development of an AI system for flagging and explaining low-quality medical images in real-time, (ii) an interview study to understand the explanation needs of stakeholders using the AI system at OurCompany, and, (iii) a longitudinal user study design to examine the effect of including explanations on the workflow of the technicians in our clinics. To the best of our knowledge, this would be the first longitudinal study on evaluating the effects of XAI methods on end-users -- stakeholders that use AI systems but do not have AI-specific expertise. We welcome feedback and suggestions on our experimental setup.  ( 3 min )
    Pre-training Transformers for Molecular Property Prediction Using Reaction Prediction. (arXiv:2207.02724v1 [cs.LG])
    Molecular property prediction is essential in chemistry, especially for drug discovery applications. However, available molecular property data is often limited, encouraging the transfer of information from related data. Transfer learning has had a tremendous impact in fields like Computer Vision and Natural Language Processing signaling for its potential in molecular property prediction. We present a pre-training procedure for molecular representation learning using reaction data and use it to pre-train a SMILES Transformer. We fine-tune and evaluate the pre-trained model on 12 molecular property prediction tasks from MoleculeNet within physical chemistry, biophysics, and physiology and show a statistically significant positive effect on 5 of the 12 tasks compared to a non-pre-trained baseline model.  ( 2 min )
    Careful seeding for the k-medoids algorithm with incremental k++ cluster construction. (arXiv:2207.02404v1 [cs.LG])
    The k-medoids algorithm is a popular variant of the k-means algorithm and widely used in pattern recognition and machine learning. A main drawback of the k-medoids algorithm is that it can be trapped in local optima. An improved k-medoids algorithm (INCKM) was recently proposed to overcome this drawback, based on constructing a candidate medoids subset with a parameter choosing procedure, but it may fail when dealing with imbalanced datasets. In this paper, we propose a novel incremental k-medoids algorithm (INCKPP) which dynamically increases the number of clusters from 2 to k through a nonparametric and stochastic k-means++ search procedure. Our algorithm can overcome the parameter selection problem in the improved k-medoids algorithm, improve the clustering performance, and deal with imbalanced datasets very well. But our algorithm has a weakness in computation efficiency. To address this issue, we propose a fast INCKPP algorithm (called INCKPP$_{sample}$) which preserves the computational efficiency of the simple and fast k-medoids algorithm with an improved clustering performance. The proposed algorithm is compared with three state-of-the-art algorithms: the improved k-medoids algorithm (INCKM), the simple and fast k-medoids algorithm (FKM) and the k-means++ algorithm (KPP). Extensive experiments on both synthetic and real world datasets including imbalanced datasets illustrate the effectiveness of the proposed algorithm.  ( 2 min )
    Nonparametric Factor Trajectory Learning for Dynamic Tensor Decomposition. (arXiv:2207.02446v1 [cs.LG])
    Tensor decomposition is a fundamental framework to analyze data that can be represented by multi-dimensional arrays. In practice, tensor data is often accompanied by temporal information, namely the time points when the entry values were generated. This information implies abundant, complex temporal variation patterns. However, current methods always assume the factor representations of the entities in each tensor mode are static, and never consider their temporal evolution. To fill this gap, we propose NONparametric FActor Trajectory learning for dynamic tensor decomposition (NONFAT). We place Gaussian process (GP) priors in the frequency domain and conduct inverse Fourier transform via Gauss-Laguerre quadrature to sample the trajectory functions. In this way, we can overcome data sparsity and obtain robust trajectory estimates across long time horizons. Given the trajectory values at specific time points, we use a second-level GP to sample the entry values and to capture the temporal relationship between the entities. For efficient and scalable inference, we leverage the matrix Gaussian structure in the model, introduce a matrix Gaussian posterior, and develop a nested sparse variational learning algorithm. We have shown the advantage of our method in several real-world applications.  ( 2 min )
    Robust Counterfactual Explanations for Tree-Based Ensembles. (arXiv:2207.02739v1 [cs.LG])
    Counterfactual explanations inform ways to achieve a desired outcome from a machine learning model. However, such explanations are not robust to certain real-world changes in the underlying model (e.g., retraining the model, changing hyperparameters, etc.), questioning their reliability in several applications, e.g., credit lending. In this work, we propose a novel strategy -- that we call RobX -- to generate robust counterfactuals for tree-based ensembles, e.g., XGBoost. Tree-based ensembles pose additional challenges in robust counterfactual generation, e.g., they have a non-smooth and non-differentiable objective function, and they can change a lot in the parameter space under retraining on very similar data. We first introduce a novel metric -- that we call Counterfactual Stability -- that attempts to quantify how robust a counterfactual is going to be to model changes under retraining, and comes with desirable theoretical properties. Our proposed strategy RobX works with any counterfactual generation method (base method) and searches for robust counterfactuals by iteratively refining the counterfactual generated by the base method using our metric Counterfactual Stability. We compare the performance of RobX with popular counterfactual generation methods (for tree-based ensembles) across benchmark datasets. The results demonstrate that our strategy generates counterfactuals that are significantly more robust (nearly 100% validity after actual model changes) and also realistic (in terms of local outlier factor) over existing state-of-the-art methods.  ( 3 min )
    Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs. (arXiv:2207.02295v1 [cs.NI])
    Cloud datacenters are exponentially growing both in numbers and size. This increase results in a network activity surge that warrants better congestion avoidance. The resulting challenge is two-fold: (i) designing algorithms that can be custom-tuned to the complex traffic patterns of a given datacenter; but, at the same time (ii) run on low-level hardware with the required low latency of effective Congestion Control (CC). In this work, we present a Reinforcement Learning (RL) based CC solution that learns from certain traffic scenarios and successfully generalizes to others. We then distill the RL neural network policy into binary decision trees to achieve the desired $\mu$sec decision latency required for real-time inference with RDMA. We deploy the distilled policy on NVIDIA NICs in a real network and demonstrate state-of-the-art performance, balancing all tested metrics simultaneously: bandwidth, latency, fairness, and packet drops.  ( 2 min )
    Evaluating Robustness to Dataset Shift via Parametric Robustness Sets. (arXiv:2205.15947v2 [cs.LG] UPDATED)
    We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. To ensure that these shifts are plausible, we parameterize them in terms of interpretable changes in causal mechanisms of observed variables. This defines a parametric robustness set of plausible distributions and a corresponding worst-case loss. While the loss under an individual parametric shift can be estimated via reweighting techniques such as importance sampling, the resulting worst-case optimization problem is non-convex, and the estimate may suffer from large variance. For small shifts, however, we can construct a local second-order approximation to the loss under shift and cast the problem of finding a worst-case shift as a particular non-convex quadratic optimization problem, for which efficient algorithms are available. We demonstrate that this second-order approximation can be estimated directly for shifts in conditional exponential family models, and we bound the approximation error. We apply our approach to a computer vision task (classifying gender from images), revealing sensitivity to shifts in non-causal attributes.
    Unsupervised Recurrent Federated Learning for Edge Popularity Prediction in Privacy-Preserving Mobile Edge Computing Networks. (arXiv:2207.00755v2 [cs.MM] UPDATED)
    Nowadays wireless communication is rapidly reshaping entire industry sectors. In particular, mobile edge computing (MEC) as an enabling technology for industrial Internet of things (IIoT) brings powerful computing/storage infrastructure closer to the mobile terminals and, thereby, significant lowers the response latency. To reap the benefit of proactive caching at the network edge, precise knowledge on the popularity pattern among the end devices is essential. However, the complex and dynamic nature of the content popularity over space and time as well as the data-privacy requirements in many IIoT scenarios pose tough challenges to its acquisition. In this article, we propose an unsupervised and privacy-preserving popularity prediction framework for MEC-enabled IIoT. The concepts of local and global popularities are introduced and the time-varying popularity of each user is modelled as a model-free Markov chain. On this basis, a novel unsupervised recurrent federated learning (URFL) algorithm is proposed to predict the distributed popularity while achieve privacy preservation and unsupervised training. Simulations indicate that the proposed framework can enhance the prediction accuracy in terms of a reduced root-mean-squared error by up to $60.5\%-68.7\%$. Additionally, manual labeling and violation of users' data privacy are both avoided.
    Progressive Latent Replay for efficient Generative Rehearsal. (arXiv:2207.01562v2 [cs.CV] UPDATED)
    We introduce a new method for internal replay that modulates the frequency of rehearsal based on the depth of the network. While replay strategies mitigate the effects of catastrophic forgetting in neural networks, recent works on generative replay show that performing the rehearsal only on the deeper layers of the network improves the performance in continual learning. However, the generative approach introduces additional computational overhead, limiting its applications. Motivated by the observation that earlier layers of neural networks forget less abruptly, we propose to update network layers with varying frequency using intermediate-level features during replay. This reduces the computational burden by omitting computations for both deeper layers of the generator and earlier layers of the main model. We name our method Progressive Latent Replay and show that it outperforms Internal Replay while using significantly fewer resources.
    Flow Completion Network: Inferring the Fluid Dynamics from Incomplete Flow Information using Graph Neural Networks. (arXiv:2205.04739v2 [physics.flu-dyn] UPDATED)
    This paper introduces a novel neural network - flow completion network (FCN) - to infer the fluid dynamics, includ-ing the flow field and the force acting on the body, from the incomplete data based on Graph Convolution AttentionNetwork. The FCN is composed of several graph convolution layers and spatial attention layers. It is designed to inferthe velocity field and the vortex force contribution of the flow field when combined with the vortex force map (VFM)method. Compared with other neural networks adopted in fluid dynamics, the FCN is capable of dealing with bothstructured data and unstructured data. The performance of the proposed FCN is assessed by the computational fluiddynamics (CFD) data on the flow field around a circular cylinder. The force coefficients predicted by our model arevalidated against those obtained directly from CFD. Moreover, it is shown that our model effectively utilizes the exist-ing flow field information and the gradient information simultaneously, giving a better performance than the traditionalconvolution neural network (CNN)-based and deep neural network (DNN)-based models. Specifically, among all thecases of different Reynolds numbers and different proportions of the training dataset, the results show that the proposedFCN achieves a maximum norm mean square error of 5.86% in the test dataset, which is much lower than those of thetraditional CNN-based and DNN-based models (42.32% and 15.63% respectively).
    The rise of the lottery heroes: why zero-shot pruning is hard. (arXiv:2202.12400v2 [cs.LG] UPDATED)
    Recent advances in deep learning optimization showed that just a subset of parameters are really necessary to successfully train a model. Potentially, such a discovery has broad impact from the theory to application; however, it is known that finding these trainable sub-network is a typically costly process. This inhibits practical applications: can the learned sub-graph structures in deep learning models be found at training time? In this work we explore such a possibility, observing and motivating why common approaches typically fail in the extreme scenarios of interest, and proposing an approach which potentially enables training with reduced computational effort. The experiments on either challenging architectures and datasets suggest the algorithmic accessibility over such a computational gain, and in particular a trade-off between accuracy achieved and training complexity deployed emerges.
    Motley: Benchmarking Heterogeneity and Personalization in Federated Learning. (arXiv:2206.09262v2 [cs.LG] UPDATED)
    Personalized federated learning considers learning models unique to each client in a heterogeneous network. The resulting client-specific models have been purported to improve metrics such as accuracy, fairness, and robustness in federated networks. However, despite a plethora of work in this area, it remains unclear: (1) which personalization techniques are most effective in various settings, and (2) how important personalization truly is for realistic federated applications. To better answer these questions, we propose Motley, a benchmark for personalized federated learning. Motley consists of a suite of cross-device and cross-silo federated datasets from varied problem domains, as well as thorough evaluation metrics for better understanding the possible impacts of personalization. We establish baselines on the benchmark by comparing a number of representative personalized federated learning methods. These initial results highlight strengths and weaknesses of existing approaches, and raise several open questions for the community. Motley aims to provide a reproducible means with which to advance developments in personalized and heterogeneity-aware federated learning, as well as the related areas of transfer learning, meta-learning, and multi-task learning.
    Adversarially Trained Actor Critic for Offline Reinforcement Learning. (arXiv:2202.02446v2 [cs.LG] UPDATED)
    We propose Adversarially Trained Actor Critic (ATAC), a new model-free algorithm for offline reinforcement learning (RL) under insufficient data coverage, based on the concept of relative pessimism. ATAC is designed as a two-player Stackelberg game: A policy actor competes against an adversarially trained value critic, who finds data-consistent scenarios where the actor is inferior to the data-collection behavior policy. We prove that, when the actor attains no regret in the two-player game, running ATAC produces a policy that provably 1) outperforms the behavior policy over a wide range of hyperparameters that control the degree of pessimism, and 2) competes with the best policy covered by data with appropriately chosen hyperparameters. Compared with existing works, notably our framework offers both theoretical guarantees for general function approximation and a deep RL implementation scalable to complex environments and large datasets. In the D4RL benchmark, ATAC consistently outperforms state-of-the-art offline RL algorithms on a range of continuous control tasks.
    Unfolding AIS transmission behavior for vessel movement modeling on noisy data leveraging machine learning. (arXiv:2202.13867v2 [cs.LG] UPDATED)
    The oceans are a source of an impressive mixture of complex data that could be used to uncover relationships yet to be discovered. Such data comes from the oceans and their surface, such as Automatic Identification System (AIS) messages used for tracking vessels' trajectories. AIS messages are transmitted over radio or satellite at ideally periodic time intervals but vary irregularly over time. As such, this paper aims to model the AIS message transmission behavior through neural networks for forecasting upcoming AIS messages' content from multiple vessels, particularly in a simultaneous approach despite messages' temporal irregularities as outliers. We present a set of experiments comprising multiple algorithms for forecasting tasks with horizon sizes of varying lengths. Deep learning models (e.g., neural networks) revealed themselves to adequately preserve vessels' spatial awareness regardless of temporal irregularity. We show how convolutional layers, feed-forward networks, and recurrent neural networks can improve such tasks by working together. Experimenting with short, medium, and large-sized sequences of messages, our model achieved 36/37/38% of the Relative Percentage Difference - the lower, the better, whereas we observed 92/45/96% on the Elman's RNN, 51/52/40% on the GRU, and 129/98/61% on the LSTM. These results support our model as a driver for improving the prediction of vessel routes when analyzing multiple vessels of diverging types simultaneously under temporally noise data.
    SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy. (arXiv:2203.17001v2 [eess.AS] UPDATED)
    Deep learning based singing voice synthesis (SVS) systems have been demonstrated to flexibly generate singing with better qualities, compared to conventional statistical parametric based methods. However, neural systems are generally data-hungry and have difficulty to reach reasonable singing quality with limited public available training data. In this work, we explore different data augmentation methods to boost the training of SVS systems, including several strategies customized to SVS based on pitch augmentation and mix-up augmentation. To further stabilize the training, we introduce the cycle-consistent training strategy. Extensive experiments on two public singing databases demonstrate that our proposed augmentation methods and the stabilizing training strategy can significantly improve the performance on both objective and subjective evaluations.
    ADAST: Attentive Cross-domain EEG-based Sleep Staging Framework with Iterative Self-Training. (arXiv:2107.04470v4 [cs.LG] UPDATED)
    Sleep staging is of great importance in the diagnosis and treatment of sleep disorders. Recently, numerous data-driven deep learning models have been proposed for automatic sleep staging. They mainly train the model on a large public labeled sleep dataset and test it on a smaller one with subjects of interest. However, they usually assume that the train and test data are drawn from the same distribution, which may not hold in real-world scenarios. Unsupervised domain adaption (UDA) has been recently developed to handle this domain shift problem. However, previous UDA methods applied for sleep staging have two main limitations. First, they rely on a totally shared model for the domain alignment, which may lose the domain-specific information during feature extraction. Second, they only align the source and target distributions globally without considering the class information in the target domain, which hinders the classification performance of the model while testing. In this work, we propose a novel adversarial learning framework called ADAST to tackle the domain shift problem in the unlabeled target domain. First, we develop an unshared attention mechanism to preserve the domain-specific features in both domains. Second, we design an iterative self-training strategy to improve the classification performance on the target domain via target domain pseudo labels. We also propose dual distinct classifiers to increase the robustness and quality of the pseudo labels. The experimental results on six cross-domain scenarios validate the efficacy of our proposed framework and its advantage over state-of-the-art UDA methods. The source code is available at https://github.com/emadeldeen24/ADAST.
    Fast Density Estimation for Density-based Clustering Methods. (arXiv:2109.11383v3 [cs.LG] UPDATED)
    Density-based clustering algorithms are widely used for discovering clusters in pattern recognition and machine learning since they can deal with non-hyperspherical clusters and are robustness to handle outliers. However, the runtime of density-based algorithms are heavily dominated by finding fixed-radius near neighbors and calculating the density, which is time-consuming. Meanwhile, the traditional acceleration methods using indexing technique such as KD tree is not effective in processing high-dimensional data. In this paper, we propose a fast region query algorithm named fast principal component analysis pruning (called FPCAP) with the help of the fast principal component analysis technique in conjunction with geometric information provided by principal attributes of the data, which can process high-dimensional data and be easily applied to density-based methods to prune unnecessary distance calculations when finding neighbors and estimating densities. As an application in density-based clustering methods, FPCAP method was combined with the Density Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. And then, an improved DBSCAN (called IDBSCAN) is obtained, which preserves the advantage of DBSCAN and meanwhile, greatly reduces the computation of redundant distances. Experiments on seven benchmark datasets demonstrate that the proposed algorithm improves the computational efficiency significantly.
    On the Effects of Artificial Data Modification. (arXiv:2110.13968v2 [cs.LG] UPDATED)
    Data distortion is commonly applied in vision models during both training (e.g methods like MixUp and CutMix) and evaluation (e.g. shape-texture bias and robustness). This data modification can introduce artificial information. It is often assumed that the resulting artefacts are detrimental to training, whilst being negligible when analysing models. We investigate these assumptions and conclude that in some cases they are unfounded and lead to incorrect results. Specifically, we show current shape bias identification methods and occlusion robustness measures are biased and propose a fairer alternative for the latter. Subsequently, through a series of experiments we seek to correct and strengthen the community's perception of how augmenting affects learning of vision models. Based on our empirical results we argue that the impact of the artefacts must be understood and exploited rather than eliminated.
    MoTiAC: Multi-Objective Actor-Critics for Real-Time Bidding. (arXiv:2002.07408v2 [cs.AI] UPDATED)
    Online Real-Time Bidding (RTB) is a complex auction game among which advertisers struggle to bid for ad impressions when a user request occurs. Considering display cost, Return on Investment (ROI), and other influential Key Performance Indicators (KPIs), large ad platforms try to balance the trade-off among various goals in dynamics. To address the challenge, we propose a Multi-ObjecTive Actor-Critics algorithm based on reinforcement learning (RL), named MoTiAC, for the problem of bidding optimization with various goals. In MoTiAC, objective-specific agents update the global network asynchronously with different goals and perspectives, leading to a robust bidding policy. Unlike previous RL models, the proposed MoTiAC can simultaneously fulfill multi-objective tasks in complicated bidding environments. In addition, we mathematically prove that our model will converge to Pareto optimality. Finally, experiments on a large-scale real-world commercial dataset from Tencent verify the effectiveness of MoTiAC versus a set of recent approaches
    Enhancing Adversarial Attacks on Single-Layer NVM Crossbar-Based Neural Networks with Power Consumption Information. (arXiv:2207.02764v1 [cs.LG])
    Adversarial attacks on state-of-the-art machine learning models pose a significant threat to the safety and security of mission-critical autonomous systems. This paper considers the additional vulnerability of machine learning models when attackers can measure the power consumption of their underlying hardware platform. In particular, we explore the utility of power consumption information for adversarial attacks on non-volatile memory crossbar-based single-layer neural networks. Our results from experiments with MNIST and CIFAR-10 datasets show that power consumption can reveal important information about the neural network's weight matrix, such as the 1-norm of its columns. That information can be used to infer the sensitivity of the network's loss with respect to different inputs. We also find that surrogate-based black box attacks that utilize crossbar power information can lead to improved attack efficiency.
    Landscape analysis for shallow neural networks: complete classification of critical points for affine target functions. (arXiv:2103.10922v3 [cs.LG] UPDATED)
    In this paper, we analyze the landscape of the true loss of neural networks with one hidden layer and ReLU, leaky ReLU, or quadratic activation. In all three cases, we provide a complete classification of the critical points in the case where the target function is affine and one-dimensional. In particular, we show that there exist no local maxima and clarify the structure of saddle points. Moreover, we prove that non-global local minima can only be caused by `dead' ReLU neurons. In particular, they do not appear in the case of leaky ReLU or quadratic activation. Our approach is of a combinatorial nature and builds on a careful analysis of the different types of hidden neurons that can occur.
    Epistemic Neural Networks. (arXiv:2107.08924v5 [cs.LG] UPDATED)
    Intelligence relies on an agent's knowledge of what it does not know. This capability can be assessed based on the quality of joint predictions of labels across multiple inputs. Conventional neural networks lack this capability and, since most research has focused on marginal predictions, this shortcoming has been largely overlooked. We introduce the epistemic neural network (ENN) as an interface for models that represent uncertainty as required to generate useful joint predictions. While prior approaches to uncertainty modeling such as Bayesian neural networks can be expressed as ENNs, this new interface facilitates comparison of joint predictions and the design of novel architectures and algorithms. In particular, we introduce the epinet: an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty. With an epinet, conventional neural networks outperform very large ensembles, consisting of hundreds or more particles, with orders of magnitude less computation. We demonstrate this efficacy across synthetic data, ImageNet, and some reinforcement learning tasks. As part of this effort we open-source experiment code.
    Topological Information Retrieval with Dilation-Invariant Bottleneck Comparative Measures. (arXiv:2104.01672v3 [stat.ML] UPDATED)
    Appropriately representing elements in a database so that queries may be accurately matched is a central task in information retrieval; recently, this has been achieved by embedding the graphical structure of the database into a manifold in a hierarchy-preserving manner using a variety of metrics. Persistent homology is a tool commonly used in topological data analysis that is able to rigorously characterize a database in terms of both its hierarchy and connectivity structure. Computing persistent homology on a variety of embedded datasets reveals that some commonly used embeddings fail to preserve the connectivity. We show that those embeddings which successfully retain the database topology coincide in persistent homology by introducing two dilation-invariant comparative measures to capture this effect: in particular, they address the issue of metric distortion on manifolds. We provide an algorithm for their computation that exhibits greatly reduced time complexity over existing methods. We use these measures to perform the first instance of topology-based information retrieval and demonstrate its increased performance over the standard bottleneck distance for persistent homology. We showcase our approach on databases of different data varieties including text, videos, and medical images.
    NAS-Bench-360: Benchmarking Neural Architecture Search on Diverse Tasks. (arXiv:2110.05668v4 [cs.CV] UPDATED)
    Most existing neural architecture search (NAS) benchmarks and algorithms prioritize well-studied tasks, e.g. image classification on CIFAR or ImageNet. This makes the performance of NAS approaches in more diverse areas poorly understood. In this paper, we present NAS-Bench-360, a benchmark suite to evaluate methods on domains beyond those traditionally studied in architecture search, and use it to address the following question: do state-of-the-art NAS methods perform well on diverse tasks? To construct the benchmark, we curate ten tasks spanning a diverse array of application domains, dataset sizes, problem dimensionalities, and learning objectives. Each task is carefully chosen to interoperate with modern CNN-based search methods while possibly being far-afield from its original development domain. To speed up and reduce the cost of NAS research, for two of the tasks we release the precomputed performance of 15,625 architectures comprising a standard CNN search space. Experimentally, we show the need for more robust NAS evaluation of the kind NAS-Bench-360 enables by showing that several modern NAS procedures perform inconsistently across the ten tasks, with many catastrophically poor results. We also demonstrate how NAS-Bench-360 and its associated precomputed results will enable future scientific discoveries by testing whether several recent hypotheses promoted in the NAS literature hold on diverse tasks. NAS-Bench-360 is hosted at https://nb360.ml.cmu.edu.
    Machine Learning for Stuttering Identification: Review, Challenges and Future Directions. (arXiv:2107.04057v3 [cs.SD] UPDATED)
    Stuttering is a speech disorder during which the flow of speech is interrupted by involuntary pauses and repetition of sounds. Stuttering identification is an interesting interdisciplinary domain research problem which involves pathology, psychology, acoustics, and signal processing that makes it hard and complicated to detect. Recent developments in machine and deep learning have dramatically revolutionized speech domain, however minimal attention has been given to stuttering identification. This work fills the gap by trying to bring researchers together from interdisciplinary fields. In this paper, we review comprehensively acoustic features, statistical and deep learning based stuttering/disfluency classification methods. We also present several challenges and possible future directions.
    Graph Trees with Attention. (arXiv:2207.02760v1 [cs.LG])
    When dealing with tabular data, models based on regression and decision trees are a popular choice due to the high accuracy they provide on such tasks and their ease of application as compared to other model classes. Yet, when it comes to graph-structure data, current tree learning algorithms do not provide tools to manage the structure of the data other than relying on feature engineering. In this work we address the above gap, and introduce Graph Trees with Attention (GTA), a new family of tree-based learning algorithms that are designed to operate on graphs. GTA leverages both the graph structure and the features at the vertices and employs an attention mechanism that allows decisions to concentrate on sub-structures of the graph. We analyze GTA models and show that they are strictly more expressive than plain decision trees. We also demonstrate the benefits of GTA empirically on multiple graph and node prediction benchmarks. In these experiments, GTA always outperformed other tree-based models and often outperformed other types of graph-learning algorithms such as Graph Neural Networks (GNNs) and Graph Kernels. Finally, we also provide an explainability mechanism for GTA, and demonstrate it can provide intuitive explanations.
    Improved conformalized quantile regression. (arXiv:2207.02808v1 [stat.ML])
    Conformalized quantile regression is a procedure that inherits the advantages of conformal prediction and quantile regression. That is, we use quantile regression to estimate the true conditional quantile and then apply a conformal step on a calibration set to ensure marginal coverage. In this way, we get adaptive prediction intervals that account for heteroscedasticity. However, the aforementioned conformal step lacks adaptiveness as described in (Romano et al., 2019). To overcome this limitation, instead of applying a single conformal step after estimating conditional quantiles with quantile regression, we propose to cluster the explanatory variables weighted by their permutation importance with an optimized k-means and apply k conformal steps. To show that this improved version outperforms the classic version of conformalized quantile regression and is more adaptive to heteroscedasticity, we extensively compare the prediction intervals of both in open datasets.
    Avoiding Forgetting and Allowing Forward Transfer in Continual Learning via Sparse Networks. (arXiv:2110.05329v3 [cs.LG] UPDATED)
    Using task-specific components within a neural network in continual learning (CL) is a compelling strategy to address the stability-plasticity dilemma in fixed-capacity models without access to past data. Current methods focus only on selecting a sub-network for a new task that reduces forgetting of past tasks. However, this selection could limit the forward transfer of relevant past knowledge that helps in future learning. Our study reveals that satisfying both objectives jointly is more challenging when a unified classifier is used for all classes of seen tasks-class-Incremental Learning (class-IL)-as it is prone to ambiguities between classes across tasks. Moreover, the challenge increases when the semantic similarity of classes across tasks increases. To address this challenge, we propose a new CL method, named AFAF, that aims to Avoid Forgetting and Allow Forward transfer in class-IL using fix-capacity models. AFAF allocates a sub-network that enables selective transfer of relevant knowledge to a new task while preserving past knowledge, reusing some of the previously allocated components to utilize the fixed-capacity, and addressing class-ambiguities when similarities exist. The experiments show the effectiveness of AFAF in providing models with multiple CL desirable properties, while outperforming state-of-the-art methods on various challenging benchmarks with different semantic similarities.
    Novel Techniques to Assess Predictive Systems and Reduce Their Alarm Burden. (arXiv:2102.05691v3 [cs.LG] UPDATED)
    Machine prediction algorithms (e.g., binary classifiers) often are adopted on the basis of claimed performance using classic metrics such as sensitivity and predictive value. However, classifier performance depends heavily upon the context (workflow) in which the classifier operates. Classic metrics do not reflect the realized utility of a predictor unless certain implicit assumptions are met, and these assumptions cannot be met in many common clinical scenarios. This often results in suboptimal implementations and in disappointment when expected outcomes are not achieved. One common failure mode for classic metrics arises when multiple predictions can be made for the same event, particularly when redundant true positive predictions produce little additional value. This describes many clinical alerting systems. We explain why classic metrics cannot correctly represent predictor performance in such contexts, and introduce an improved performance assessment technique using utility functions to score predictions based on their utility in a specific workflow context. The resulting utility metrics (u-metrics) explicitly account for the effects of temporal relationships on prediction utility. Compared to traditional measures, u-metrics more accurately reflect the real world costs and benefits of a predictor operating in a live clinical context. The improvement can be significant. We also describe a formal approach to snoozing, a mitigation strategy in which some predictions are suppressed to improve predictor performance by reducing false positives while retaining event capture. Snoozing is especially useful for predictors that generate interruptive alarms. U-metrics correctly measure and predict the performance benefits of snoozing, whereas traditional metrics do not.
    DexMV: Imitation Learning for Dexterous Manipulation from Human Videos. (arXiv:2108.05877v5 [cs.LG] UPDATED)
    While significant progress has been made on understanding hand-object interactions in computer vision, it is still very challenging for robots to perform complex dexterous manipulation. In this paper, we propose a new platform and pipeline DexMV (Dexterous Manipulation from Videos) for imitation learning. We design a platform with: (i) a simulation system for complex dexterous manipulation tasks with a multi-finger robot hand and (ii) a computer vision system to record large-scale demonstrations of a human hand conducting the same tasks. In our novel pipeline, we extract 3D hand and object poses from videos, and propose a novel demonstration translation method to convert human motion to robot demonstrations. We then apply and benchmark multiple imitation learning algorithms with the demonstrations. We show that the demonstrations can indeed improve robot learning by a large margin and solve the complex tasks which reinforcement learning alone cannot solve. More details can be found in the project page: https://yzqin.github.io/dexmv
    Histopathology DatasetGAN: Synthesizing Large-Resolution Histopathology Datasets. (arXiv:2207.02712v1 [eess.IV])
    Self-supervised learning (SSL) methods are enabling an increasing number of deep learning models to be trained on image datasets in domains where labels are difficult to obtain. These methods, however, struggle to scale to the high resolution of medical imaging datasets, where they are critical for achieving good generalization on label-scarce medical image datasets. In this work, we propose the Histopathology DatasetGAN (HDGAN) framework, an extension of the DatasetGAN semi-supervised framework for image generation and segmentation that scales well to large-resolution histopathology images. We make several adaptations from the original framework, including updating the generative backbone, selectively extracting latent features from the generator, and switching to memory-mapped arrays. These changes reduce the memory consumption of the framework, improving its applicability to medical imaging domains. We evaluate HDGAN on a thrombotic microangiopathy high-resolution tile dataset, demonstrating strong performance on the high-resolution image-annotation generation task. We hope that this work enables more application of deep learning models to medical datasets, in addition to encouraging more exploration of self-supervised frameworks within the medical imaging domain.
    Learning with Neighbor Consistency for Noisy Labels. (arXiv:2202.02200v2 [cs.CV] UPDATED)
    Recent advances in deep learning have relied on large, labelled datasets to train high-capacity models. However, collecting large datasets in a time- and cost-efficient manner often results in label noise. We present a method for learning from noisy labels that leverages similarities between training examples in feature space, encouraging the prediction of each example to be similar to its nearest neighbours. Compared to training algorithms that use multiple models or distinct stages, our approach takes the form of a simple, additional regularization term. It can be interpreted as an inductive version of the classical, transductive label propagation algorithm. We thoroughly evaluate our method on datasets evaluating both synthetic (CIFAR-10, CIFAR-100) and realistic (mini-WebVision, WebVision, Clothing1M, mini-ImageNet-Red) noise, and achieve competitive or state-of-the-art accuracies across all of them.
    BFE and AdaBFE: A New Approach in Learning Rate Automation for Stochastic Optimization. (arXiv:2207.02763v1 [cs.LG])
    In this paper, a new gradient-based optimization approach by automatically adjusting the learning rate is proposed. This approach can be applied to design non-adaptive learning rate and adaptive learning rate. Firstly, I will introduce the non-adaptive learning rate optimization method: Binary Forward Exploration (BFE), and then the corresponding adaptive per-parameter learning rate method: Adaptive BFE (AdaBFE) is possible to be developed. This approach could be an alternative method to optimize the learning rate based on the stochastic gradient descent (SGD) algorithm besides the current non-adaptive learning rate methods e.g. SGD, momentum, Nesterov and the adaptive learning rate methods e.g. AdaGrad, AdaDelta, Adam... The purpose to develop this approach is not to beat the benchmark of other methods but just to provide a different perspective to optimize the gradient descent method, although some comparative study with previous methods will be made in the following sections. This approach is expected to be heuristic or inspire researchers to improve gradient-based optimization combined with previous methods.
    Architectural Optimization and Feature Learning for High-Dimensional Time Series Datasets. (arXiv:2202.13486v2 [cs.LG] UPDATED)
    As our ability to sense increases, we are experiencing a transition from data-poor problems, in which the central issue is a lack of relevant data, to data-rich problems, in which the central issue is to identify a few relevant features in a sea of observations. Motivated by applications in gravitational-wave astrophysics, we study the problem of predicting the presence of transient noise artifacts in a gravitational wave detector from a rich collection of measurements from the detector and its environment. We argue that feature learning--in which relevant features are optimized from data--is critical to achieving high accuracy. We introduce models that reduce the error rate by over 60% compared to the previous state of the art, which used fixed, hand-crafted features. Feature learning is useful not only because it improves performance on prediction tasks; the results provide valuable information about patterns associated with phenomena of interest that would otherwise be undiscoverable. In our application, features found to be associated with transient noise provide diagnostic information about its origin and suggest mitigation strategies. Learning in high-dimensional settings is challenging. Through experiments with a variety of architectures, we identify two key factors in successful models: sparsity, for selecting relevant variables within the high-dimensional observations; and depth, which confers flexibility for handling complex interactions and robustness with respect to temporal variations. We illustrate their significance through systematic experiments on real detector data. Our results provide experimental corroboration of common assumptions in the machine-learning community and have direct applicability to improving our ability to sense gravitational waves, as well as to many other problem settings with similarly high-dimensional, noisy, or partly irrelevant data.
    Self-supervised Detransformation Autoencoder for Representation Learning in Open Set Recognition. (arXiv:2105.13557v2 [cs.LG] UPDATED)
    The objective of Open set recognition (OSR) is to learn a classifier that can reject the unknown samples while classifying the known classes accurately. In this paper, we propose a self-supervision method, Detransformation Autoencoder (DTAE), for the OSR problem. This proposed method engages in learning representations that are invariant to the transformations of the input data. Experiments on several standard image datasets indicate that the pre-training process significantly improves the model performance in the OSR tasks. Meanwhile, our proposed self-supervision method achieves significant gains in detecting the unknown class and classifying the known classes. Moreover, our analysis indicates that DTAE can yield representations that contain more target class information and less transformation information than RotNet.
    Trading with the Momentum Transformer: An Intelligent and Interpretable Architecture. (arXiv:2112.08534v2 [cs.LG] UPDATED)
    We introduce the Momentum Transformer, an attention-based deep learning architecture which outperforms benchmark momentum and mean-reversion trading strategies. Unlike state-of-the-art Long Short-Term Memory (LSTM) architectures, which are sequential in nature, the attention mechanism provides our architecture with a direct connection to all previous time-steps. Our architecture enables us to learn longer-term dependencies, improves performance when considering returns net of transaction costs and naturally adapts to new market regimes, such as during the SARS-CoV-2 crisis. The Momentum Transformer is inherently interpretable, providing us with greater insights into our deep learning momentum trading strategy, including how it blends different classical strategies and the past time-steps which are of the greatest significance to the model.
    Detecting and Diagnosing Terrestrial Gravitational-Wave Mimics Through Feature Learning. (arXiv:2203.05086v2 [astro-ph.IM] UPDATED)
    As engineered systems grow in complexity, there is an increasing need for automatic methods that can detect, diagnose, and even correct transient anomalies that inevitably arise and can be difficult or impossible to diagnose and fix manually. Among the most sensitive and complex systems of our civilization are the detectors that search for incredibly small variations in distance caused by gravitational waves -- phenomena originally predicted by Albert Einstein to emerge and propagate through the universe as the result of collisions between black holes and other massive objects in deep space. The extreme complexity and precision of such detectors causes them to be subject to transient noise issues that can significantly limit their sensitivity and effectiveness. In this work, we present a demonstration of a method that can detect and characterize emergent transient anomalies of such massively complex systems. We illustrate the performance, precision, and adaptability of the automated solution via one of the prevalent issues limiting gravitational-wave discoveries: noise artifacts of terrestrial origin that contaminate gravitational wave observatories' highly sensitive measurements and can obscure or even mimic the faint astrophysical signals for which they are listening. Specifically, we demonstrate how a highly interpretable convolutional classifier can automatically learn to detect transient anomalies from auxiliary detector data without needing to observe the anomalies themselves. We also illustrate several other useful features of the model, including how it performs automatic variable selection to reduce tens of thousands of auxiliary data channels to only a few relevant ones; how it identifies behavioral signatures predictive of anomalies in those channels; and how it can be used to investigate individual anomalies and the channels associated with them.
    Stochastic normalizing flows as non-equilibrium transformations. (arXiv:2201.08862v3 [hep-lat] UPDATED)
    Normalizing flows are a class of deep generative models that provide a promising route to sample lattice field theories more efficiently than conventional Monte Carlo simulations. In this work we show that the theoretical framework of stochastic normalizing flows, in which neural-network layers are combined with Monte Carlo updates, is the same that underlies out-of-equilibrium simulations based on Jarzynski's equality, which have been recently deployed to compute free-energy differences in lattice gauge theories. We lay out a strategy to optimize the efficiency of this extended class of generative models and present examples of applications.
    Artificial Intelligence-Assisted Optimization and Multiphase Analysis of Polygon PEM Fuel Cells. (arXiv:2205.06768v2 [cs.NE] UPDATED)
    This article presents new hexagonal and pentagonal PEM fuel cell models. The models have been optimized after achieving improved cell performance. The input parameters of the multi-objective optimization algorithm were pressure and temperature at the inlet, and consumption and output powers were the objective parameters. The output data of the numerical simulation has been trained using deep neural networks and then modeled with polynomial regression. The target functions have been extracted using the RSM (Response Surface Method), and the targets were optimized using the multi-objective genetic algorithm (NSGA-II). Compared to the base model, the optimized Pentagonal and Hexagonal models increase the output current density by 21.8% and 39.9%, respectively.
    Speech Denoising in the Waveform Domain with Self-Attention. (arXiv:2202.07790v2 [cs.SD] UPDATED)
    In this work, we present CleanUNet, a causal speech denoising model on the raw waveform. The proposed model is based on an encoder-decoder architecture combined with several self-attention blocks to refine its bottleneck representations, which is crucial to obtain good results. The model is optimized through a set of losses defined over both waveform and multi-resolution spectrograms. The proposed method outperforms the state-of-the-art models in terms of denoised speech quality from various objective and subjective evaluation metrics.
    Deep Learning Approximation of Diffeomorphisms via Linear-Control Systems. (arXiv:2110.12393v2 [math.OC] UPDATED)
    In this paper we propose a Deep Learning architecture to approximate diffeomorphisms diffeotopic to the identity. We consider a control system of the form $\dot x = \sum_{i=1}^lF_i(x)u_i$, with linear dependence in the controls, and we use the corresponding flow to approximate the action of a diffeomorphism on a compact ensemble of points. Despite the simplicity of the control system, it has been recently shown that a Universal Approximation Property holds. The problem of minimizing the sum of the training error and of a regularizing term induces a gradient flow in the space of admissible controls. A possible training procedure for the discrete-time neural network consists in projecting the gradient flow onto a finite-dimensional subspace of the admissible controls. An alternative approach relies on an iterative method based on Pontryagin Maximum Principle for the numerical resolution of Optimal Control problems. Here the maximization of the Hamiltonian can be carried out with an extremely low computational effort, owing to the linear dependence of the system in the control variables.
    A Recurrent Differentiable Engine for Modeling Tensegrity Robots Trainable with Low-Frequency Data. (arXiv:2203.00041v2 [cs.RO] UPDATED)
    Tensegrity robots, composed of rigid rods and flexible cables, are difficult to accurately model and control given the presence of complex dynamics and high number of DoFs. Differentiable physics engines have been recently proposed as a data-driven approach for model identification of such complex robotic systems. These engines are often executed at a high-frequency to achieve accurate simulation. Ground truth trajectories for training differentiable engines, however, are not typically available at such high frequencies due to limitations of real-world sensors. The present work focuses on this frequency mismatch, which impacts the modeling accuracy. We proposed a recurrent structure for a differentiable physics engine of tensegrity robots, which can be trained effectively even with low-frequency trajectories. To train this new recurrent engine in a robust way, this work introduces relative to prior work: (i) a new implicit integration scheme, (ii) a progressive training pipeline, and (iii) a differentiable collision checker. A model of NASA's icosahedron SUPERballBot on MuJoCo is used as the ground truth system to collect training data. Simulated experiments show that once the recurrent differentiable engine has been trained given the low-frequency trajectories from MuJoCo, it is able to match the behavior of MuJoCo's system. The criterion for success is whether a locomotion strategy learned using the differentiable engine can be transferred back to the ground-truth system and result in a similar motion. Notably, the amount of ground truth data needed to train the differentiable engine, such that the policy is transferable to the ground truth system, is 1% of the data needed to train the policy directly on the ground-truth system.
    Benchmarking of DL Libraries and Models on Mobile Devices. (arXiv:2202.06512v2 [cs.LG] UPDATED)
    Deploying deep learning (DL) on mobile devices has been a notable trend in recent years. To support fast inference of on-device DL, DL libraries play a critical role as algorithms and hardware do. Unfortunately, no prior work ever dives deep into the ecosystem of modern DL libs and provides quantitative results on their performance. In this paper, we first build a comprehensive benchmark that includes 6 representative DL libs and 15 diversified DL models. We then perform extensive experiments on 10 mobile devices, which help reveal a complete landscape of the current mobile DL libs ecosystem. For example, we find that the best-performing DL lib is severely fragmented across different models and hardware, and the gap between those DL libs can be rather huge. In fact, the impacts of DL libs can overwhelm the optimizations from algorithms or hardware, e.g., model quantization and GPU/DSP-based heterogeneous computing. Finally, atop the observations, we summarize practical implications to different roles in the DL lib ecosystem.
    Reconstructing Nonlinear Dynamical Systems from Multi-Modal Time Series. (arXiv:2111.02922v3 [cs.LG] UPDATED)
    Empirically observed time series in physics, biology, or medicine, are commonly generated by some underlying dynamical system (DS) which is the target of scientific interest. There is an increasing interest to harvest machine learning methods to reconstruct this latent DS in a data-driven, unsupervised way. In many areas of science it is common to sample time series observations from many data modalities simultaneously, e.g. electrophysiological and behavioral time series in a typical neuroscience experiment. However, current machine learning tools for reconstructing DSs usually focus on just one data modality. Here we propose a general framework for multi-modal data integration for the purpose of nonlinear DS reconstruction and the analysis of cross-modal relations. This framework is based on dynamically interpretable recurrent neural networks as general approximators of nonlinear DSs, coupled to sets of modality-specific decoder models from the class of generalized linear models. Both an expectation-maximization and a variational inference algorithm for model training are advanced and compared. We show on nonlinear DS benchmarks that our algorithms can efficiently compensate for too noisy or missing information in one data channel by exploiting other channels, and demonstrate on experimental neuroscience data how the algorithm learns to link different data domains to the underlying dynamics.
    Adversarial Mask: Real-World Universal Adversarial Attack on Face Recognition Models. (arXiv:2111.10759v2 [cs.CV] UPDATED)
    Deep learning-based facial recognition (FR) models have demonstrated state-of-the-art performance in the past few years, even when wearing protective medical face masks became commonplace during the COVID-19 pandemic. Given the outstanding performance of these models, the machine learning research community has shown increasing interest in challenging their robustness. Initially, researchers presented adversarial attacks in the digital domain, and later the attacks were transferred to the physical domain. However, in many cases, attacks in the physical domain are conspicuous, and thus may raise suspicion in real-world environments (e.g., airports). In this paper, we propose Adversarial Mask, a physical universal adversarial perturbation (UAP) against state-of-the-art FR models that is applied on face masks in the form of a carefully crafted pattern. In our experiments, we examined the transferability of our adversarial mask to a wide range of FR model architectures and datasets. In addition, we validated our adversarial mask's effectiveness in real-world experiments (CCTV use case) by printing the adversarial pattern on a fabric face mask. In these experiments, the FR system was only able to identify 3.34% of the participants wearing the mask (compared to a minimum of 83.34% with other evaluated masks). A demo of our experiments can be found at: https://youtu.be/_TXkDO5z11w.
    Two-Sample Testing in Reinforcement Learning. (arXiv:2201.08078v2 [cs.LG] UPDATED)
    Value-based reinforcement-learning algorithms have shown strong performances in games, robotics, and other real-world applications. The most popular sample-based method is $Q$-Learning. It subsequently performs updates by adjusting the current $Q$-estimate towards the observed reward and the maximum of the $Q$-estimates of the next state. The procedure introduces maximization bias with approaches like Double $Q$-Learning. We frame the bias problem statistically and consider it an instance of estimating the maximum expected value (MEV) of a set of random variables. We propose the $T$-Estimator (TE) based on two-sample testing for the mean, that flexibly interpolates between over- and underestimation by adjusting the significance level of the underlying hypothesis tests. A generalization, termed $K$-Estimator (KE), obeys the same bias and variance bounds as the TE while relying on a nearly arbitrary kernel function. We introduce modifications of $Q$-Learning and the Bootstrapped Deep $Q$-Network (BDQN) using the TE and the KE. Furthermore, we propose an adaptive variant of the TE-based BDQN that dynamically adjusts the significance level to minimize the absolute estimation bias. All proposed estimators and algorithms are thoroughly tested and validated on diverse tasks and environments, illustrating the bias control and performance potential of the TE and KE.
    Expectation Distance-based Distributional Clustering for Noise-Robustness. (arXiv:2110.08871v3 [cs.LG] UPDATED)
    This paper presents a clustering technique that reduces the susceptibility to data noise by learning and clustering the data-distribution and then assigning the data to the cluster of its distribution and, in the process, reducing the impact of noise on clustering results. This method involves introducing a new distance among distributions, namely the expectation distance (denoted, ED), that goes beyond the state-of-art distribution distance of optimal mass transport (denoted, $W_2$ for $2$-Wasserstein): The latter essentially depends only on the marginal distributions while the former also employs the information about the joint distributions. Using the ED, the paper extends the classical $K$-means and $K$-medoids clustering to those over data-distributions (rather raw data) and introduces $K$-medoids using $W_2$. The paper also presents the closed-form expressions of the ED distance measure for the case when the uncertainty is Gaussian. The implementation results of the proposed ED and the $W_2$ distance measures to cluster real-world weather data are also presented, which involves efficiently extracting and using underlying uncertainty information in the form of means and variances (that, for example, is adequate to characterize Gaussian distributions). The results show striking performance improvement over classical clustering of raw data, with higher accuracy realized for ED. This is because while $W_2$ employs only the marginal distributions ignoring the correlations, the proposed ED also uses the joint distributions factoring the correlations into the distance measures.
    SE(3) Equivariant Graph Neural Networks with Complete Local Frames. (arXiv:2110.14811v2 [cs.CE] UPDATED)
    Group equivariance (e.g. SE(3) equivariance) is a critical physical symmetry in science, from classical and quantum physics to computational biology. It enables robust and accurate prediction under arbitrary reference transformations. In light of this, great efforts have been put on encoding this symmetry into deep neural networks, which has been shown to improve the generalization performance and data efficiency for downstream tasks. Constructing an equivariant neural network generally brings high computational costs to ensure expressiveness. Therefore, how to better trade-off the expressiveness and computational efficiency plays a core role in the design of the equivariant deep learning models. In this paper, we propose a framework to construct SE(3) equivariant graph neural networks that can approximate the geometric quantities efficiently. Inspired by differential geometry and physics, we introduce equivariant local complete frames to graph neural networks, such that tensor information at given orders can be projected onto the frames. The local frame is constructed to form an orthonormal basis that avoids direction degeneration and ensure completeness. Since the frames are built only by cross product operations, our method is computationally efficient. We evaluate our method on two tasks: Newton mechanics modeling and equilibrium molecule conformation generation. Extensive experimental results demonstrate that our model achieves the best or competitive performance in two types of datasets.
    Neural network stochastic differential equation models with applications to financial data forecasting. (arXiv:2111.13164v5 [cs.LG] UPDATED)
    In this article, we employ a collection of stochastic differential equations with drift and diffusion coefficients approximated by neural networks to predict the trend of chaotic time series which has big jump properties. Our contributions are, first, we propose a model called L\'evy induced stochastic differential equation network, which explores compounded stochastic differential equations with $\alpha$-stable L\'evy motion to model complex time series data and solve the problem through neural network approximation. Second, we theoretically prove the convergence of our algorithm with respect to hyper-parameters of the neural network, and obtain the error bound without curse of dimensionality. Finally, we illustrate our method by applying it to real financial time series data and find the accuracy increases through the use of non-Gaussian L\'evy processes. We also present detailed comparisons in terms of data patterns, various models, different shapes of L\'evy motion and the prediction lengths.
    A Unified Survey on Anomaly, Novelty, Open-Set, and Out-of-Distribution Detection: Solutions and Future Challenges. (arXiv:2110.14051v3 [cs.CV] UPDATED)
    Machine learning models often encounter samples that are diverged from the training distribution. Failure to recognize an out-of-distribution (OOD) sample, and consequently assign that sample to an in-class label significantly compromises the reliability of a model. The problem has gained significant attention due to its importance for safety deploying models in open-world settings. Detecting OOD samples is challenging due to the intractability of modeling all possible unknown distributions. To date, several research domains tackle the problem of detecting unfamiliar samples, including anomaly detection, novelty detection, one-class learning, open set recognition, and out-of-distribution detection. Despite having similar and shared concepts, out-of-distribution, open-set, and anomaly detection have been investigated independently. Accordingly, these research avenues have not cross-pollinated, creating research barriers. While some surveys intend to provide an overview of these approaches, they seem to only focus on a specific domain without examining the relationship between different domains. This survey aims to provide a cross-domain and comprehensive review of numerous eminent works in respective areas while identifying their commonalities. Researchers can benefit from the overview of research advances in different fields and develop future methodology synergistically. Furthermore, to the best of our knowledge, while there are surveys in anomaly detection or one-class learning, there is no comprehensive or up-to-date survey on out-of-distribution detection, which our survey covers extensively. Finally, having a unified cross-domain perspective, we discuss and shed light on future lines of research, intending to bring these fields closer together.
    Quantum Logic Gate Synthesis as a Markov Decision Process. (arXiv:1912.12002v2 [quant-ph] UPDATED)
    Reinforcement learning has witnessed recent applications to a variety of tasks in quantum programming. The underlying assumption is that those tasks could be modeled as Markov Decision Processes (MDPs). Here, we investigate the feasibility of this assumption by exploring its consequences for two fundamental tasks in quantum programming: state preparation and gate compilation. By forming discrete MDPs, focusing exclusively on the single-qubit case (both with and without noise), we solve for the optimal policy exactly through policy iteration. We find optimal paths that correspond to the shortest possible sequence of gates to prepare a state, or compile a gate, up to some target accuracy. As an example, we find sequences of $H$ and $T$ gates with length as small as $11$ producing $\sim 99\%$ fidelity for states of the form $(HT)^{n} |0\rangle$ with values as large as $n=10^{10}$. In the presence of gate noise, we demonstrate how the optimal policy adapts to the effects of noisy gates in order to achieve a higher state fidelity. Our work shows that one can meaningfully impose a discrete, stochastic and Markovian nature to a continuous, deterministic and non-Markovian quantum evolution, and provides theoretical insight into why reinforcement learning may be successfully used to find optimally short gate sequences in quantum programming.
    Astroconformer: Inferring Surface Gravity of Stars from Stellar Light Curves with Transformer. (arXiv:2207.02787v1 [astro-ph.SR])
    We introduce Astroconformer, a Transformer-based model to analyze stellar light curves from the Kepler mission. We demonstrate that Astrconformer can robustly infer the stellar surface gravity as a supervised task. Importantly, as Transformer captures long-range information in the time series, it outperforms the state-of-the-art data-driven method in the field, and the critical role of self-attention is proved through ablation experiments. Furthermore, the attention map from Astroconformer exemplifies the long-range correlation information learned by the model, leading to a more interpretable deep learning approach for asteroseismology. Besides data from Kepler, we also show that the method can generalize to sparse cadence light curves from the Rubin Observatory, paving the way for the new era of asteroseismology, harnessing information from long-cadence ground-based observations.
    Deep Learning-based automated classification of Chinese Speech Sound Disorders. (arXiv:2205.11748v4 [cs.SD] CROSS LISTED)
    This article describes a system for analyzing acoustic data to assist in the diagnosis and classification of children's speech sound disorders (SSDs) using a computer. The analysis concentrated on identifying and categorizing four distinct types of Chinese SSDs. The study collected and generated a speech corpus containing 2540 stopping, backing, final consonant deletion process (FCDP), and affrication samples from 90 children aged 3--6 years with normal or pathological articulatory features. Each recording was accompanied by a detailed diagnostic annotation by two speech-language pathologists (SLPs). Classification of the speech samples was accomplished using three well-established neural network models for image classification. The feature maps were created using three sets of Mel-frequency cepstral coefficients (MFCC) parameters extracted from speech sounds and aggregated into a three-dimensional data structure as model input. We employed six techniques for data augmentation to augment the available dataset while avoiding overfitting. The experiments examine the usability of four different categories of Chinese phrases and characters. Experiments with different data subsets demonstrate the system's ability to accurately detect the analyzed pronunciation disorders. The best multi-class classification using a single Chinese phrase achieves an accuracy of 74.4~percent.
    Federated Neural Architecture Search. (arXiv:2002.06352v5 [cs.LG] UPDATED)
    To preserve user privacy while enabling mobile intelligence, techniques have been proposed to train deep neural networks on decentralized data. However, training over decentralized data makes the design of neural architecture quite difficult as it already was. Such difficulty is further amplified when designing and deploying different neural architectures for heterogeneous mobile platforms. In this work, we propose an automatic neural architecture search into the decentralized training, as a new DNN training paradigm called Federated Neural Architecture Search, namely federated NAS. To deal with the primary challenge of limited on-client computational and communication resources, we present FedNAS, a highly optimized framework for efficient federated NAS. FedNAS fully exploits the key opportunity of insufficient model candidate re-training during the architecture search process, and incorporates three key optimizations: parallel candidates training on partial clients, early dropping candidates with inferior performance, and dynamic round numbers. Tested on large-scale datasets and typical CNN architectures, FedNAS achieves comparable model accuracy as state-of-the-art NAS algorithm that trains models with centralized data, and also reduces the client cost by up to two orders of magnitude compared to a straightforward design of federated NAS.
    A multi-task network approach for calculating discrimination-free insurance prices. (arXiv:2207.02799v1 [cs.LG])
    In applications of predictive modeling, such as insurance pricing, indirect or proxy discrimination is an issue of major concern. Namely, there exists the possibility that protected policyholder characteristics are implicitly inferred from non-protected ones by predictive models, and are thus having an undesirable (or illegal) impact on prices. A technical solution to this problem relies on building a best-estimate model using all policyholder characteristics (including protected ones) and then averaging out the protected characteristics for calculating individual prices. However, such approaches require full knowledge of policyholders' protected characteristics, which may in itself be problematic. Here, we address this issue by using a multi-task neural network architecture for claim predictions, which can be trained using only partial information on protected characteristics, and it produces prices that are free from proxy discrimination. We demonstrate the use of the proposed model and we find that its predictive accuracy is comparable to a conventional feedforward neural network (on full information). However, this multi-task network has clearly superior performance in the case of partially missing policyholder information.
    Integral Probability Metrics PAC-Bayes Bounds. (arXiv:2207.00614v2 [stat.ML] UPDATED)
    We present a PAC-Bayes-style generalization bound which enables the replacement of the KL-divergence with a variety of Integral Probability Metrics (IPM). We provide instances of this bound with the IPM being the total variation metric and the Wasserstein distance. A notable feature of the obtained bounds is that they naturally interpolate between classical uniform convergence bounds in the worst case (when the prior and posterior are far away from each other), and preferable bounds in better cases (when the posterior and prior are close). This illustrates the possibility of reinforcing classical generalization bounds with algorithm- and data-dependent components, thus making them more suitable to analyze algorithms that use a large hypothesis space.
    Simple and Efficient Heterogeneous Graph Neural Network. (arXiv:2207.02547v1 [cs.LG])
    Heterogeneous graph neural networks (HGNNs) deliver the powerful capability to embed rich structural and semantic information of a heterogeneous graph into low-dimensional node representations. Existing HGNNs usually learn to embed information using hierarchy attention mechanism and repeated neighbor aggregation, suffering from unnecessary complexity and redundant computation. This paper proposes Simple and Efficient Heterogeneous Graph Neural Network (SeHGNN) which reduces this excess complexity through avoiding overused node-level attention within the same relation and pre-computing the neighbor aggregation in the pre-processing stage. Unlike previous work, SeHGNN utilizes a light-weight parameter-free neighbor aggregator to learn structural information for each metapath, and a transformer-based semantic aggregator to combine semantic information across metapaths for the final embedding of each node. As a result, SeHGNN offers the simple network structure, high prediction accuracy, and fast training speed. Extensive experiments on five real-world heterogeneous graphs demonstrate the superiority of SeHGNN over the state-of-the-arts on both the accuracy and training speed. Codes are available at https://github.com/ICT-GIMLab/SeHGNN.
    Transformers discover an elementary calculation system exploiting local attention and grid-like problem representation. (arXiv:2207.02536v1 [cs.LG])
    Mathematical reasoning is one of the most impressive achievements of human intellect but remains a formidable challenge for artificial intelligence systems. In this work we explore whether modern deep learning architectures can learn to solve a symbolic addition task by discovering effective arithmetic procedures. Although the problem might seem trivial at first glance, generalizing arithmetic knowledge to operations involving a higher number of terms, possibly composed by longer sequences of digits, has proven extremely challenging for neural networks. Here we show that universal transformers equipped with local attention and adaptive halting mechanisms can learn to exploit an external, grid-like memory to carry out multi-digit addition. The proposed model achieves remarkable accuracy even when tested with problems requiring extrapolation outside the training distribution; most notably, it does so by discovering human-like calculation strategies such as place value alignment.
    A Hybrid Approach for Binary Classification of Imbalanced Data. (arXiv:2207.02738v1 [cs.LG])
    Binary classification with an imbalanced dataset is challenging. Models tend to consider all samples as belonging to the majority class. Although existing solutions such as sampling methods, cost-sensitive methods, and ensemble learning methods improve the poor accuracy of the minority class, these methods are limited by overfitting problems or cost parameters that are difficult to decide. We propose HADR, a hybrid approach with dimension reduction that consists of data block construction, dimentionality reduction, and ensemble learning with deep neural network classifiers. We evaluate the performance on eight imbalanced public datasets in terms of recall, G-mean, and AUC. The results show that our model outperforms state-of-the-art methods.
    Text Enriched Sparse Hyperbolic Graph Convolutional Networks. (arXiv:2207.02368v1 [cs.IR])
    Heterogeneous networks, which connect informative nodes containing text with different edge types, are routinely used to store and process information in various real-world applications. Graph Neural Networks (GNNs) and their hyperbolic variants provide a promising approach to encode such networks in a low-dimensional latent space through neighborhood aggregation and hierarchical feature extraction, respectively. However, these approaches typically ignore metapath structures and the available semantic information. Furthermore, these approaches are sensitive to the noise present in the training data. To tackle these limitations, in this paper, we propose Text Enriched Sparse Hyperbolic Graph Convolution Network (TESH-GCN) to capture the graph's metapath structures using semantic signals and further improve prediction in large heterogeneous graphs. In TESH-GCN, we extract semantic node information, which successively acts as a connection signal to extract relevant nodes' local neighborhood and graph-level metapath features from the sparse adjacency tensor in a reformulated hyperbolic graph convolution layer. These extracted features in conjunction with semantic features from the language model (for robustness) are used for the final downstream task. Experiments on various heterogeneous graph datasets show that our model outperforms the current state-of-the-art approaches by a large margin on the task of link prediction. We also report a reduction in both the training time and model parameters compared to the existing hyperbolic approaches through a reformulated hyperbolic graph convolution. Furthermore, we illustrate the robustness of our model by experimenting with different levels of simulated noise in both the graph structure and text, and also, present a mechanism to explain TESH-GCN's prediction by analyzing the extracted metapaths.
    Cascaded Deep Hybrid Models for Multistep Household Energy Consumption Forecasting. (arXiv:2207.02589v1 [cs.LG])
    Sustainability requires increased energy efficiency with minimal waste. The future power systems should thus provide high levels of flexibility iin controling energy consumption. Precise projections of future energy demand/load at the aggregate and on the individual site levels are of great importance for decision makers and professionals in the energy industry. Forecasting energy loads has become more advantageous for energy providers and customers, allowing them to establish an efficient production strategy to satisfy demand. This study introduces two hybrid cascaded models for forecasting multistep household power consumption in different resolutions. The first model integrates Stationary Wavelet Transform (SWT), as an efficient signal preprocessing technique, with Convolutional Neural Networks and Long Short Term Memory (LSTM). The second hybrid model combines SWT with a self-attention based neural network architecture named transformer. The major constraint of using time-frequency analysis methods such as SWT in multistep energy forecasting problems is that they require sequential signals, making signal reconstruction problematic in multistep forecasting applications.The cascaded models can efficiently address this problem through using the recursive outputs. Experimental results show that the proposed hybrid models achieve superior prediction performance compared to the existing multistep power consumption prediction methods. The results will pave the way for more accurate and reliable forecasting of household power consumption.
    Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via Online Experiment Design. (arXiv:2207.02575v1 [cs.LG])
    While much progress has been made in understanding the minimax sample complexity of reinforcement learning (RL) -- the complexity of learning on the "worst-case" instance -- such measures of complexity often do not capture the true difficulty of learning. In practice, on an "easy" instance, we might hope to achieve a complexity far better than that achievable on the worst-case instance. In this work we seek to understand the "instance-dependent" complexity of learning near-optimal policies (PAC RL) in the setting of RL with linear function approximation. We propose an algorithm, \textsc{Pedel}, which achieves a fine-grained instance-dependent measure of complexity, the first of its kind in the RL with function approximation setting, thereby capturing the difficulty of learning on each particular problem instance. Through an explicit example, we show that \textsc{Pedel} yields provable gains over low-regret, minimax-optimal algorithms and that such algorithms are unable to hit the instance-optimal rate. Our approach relies on a novel online experiment design-based procedure which focuses the exploration budget on the "directions" most relevant to learning a near-optimal policy, and may be of independent interest.
    The Intrinsic Manifolds of Radiological Images and their Role in Deep Learning. (arXiv:2207.02797v1 [eess.IV])
    The manifold hypothesis is a core mechanism behind the success of deep learning, so understanding the intrinsic manifold structure of image data is central to studying how neural networks learn from the data. Intrinsic dataset manifolds and their relationship to learning difficulty have recently begun to be studied for the common domain of natural images, but little such research has been attempted for radiological images. We address this here. First, we compare the intrinsic manifold dimensionality of radiological and natural images. We also investigate the relationship between intrinsic dimensionality and generalization ability over a wide range of datasets. Our analysis shows that natural image datasets generally have a higher number of intrinsic dimensions than radiological images. However, the relationship between generalization ability and intrinsic dimensionality is much stronger for medical images, which could be explained as radiological images having intrinsic features that are more difficult to learn. These results give a more principled underpinning for the intuition that radiological images can be more challenging to apply deep learning to than natural image datasets common to machine learning research. We believe rather than directly applying models developed for natural images to the radiological imaging domain, more care should be taken to developing architectures and algorithms that are more tailored to the specific characteristics of this domain. The research shown in our paper, demonstrating these characteristics and the differences from natural images, is an important first step in this direction.
    Pure Transformers are Powerful Graph Learners. (arXiv:2207.02505v1 [cs.LG])
    We show that standard Transformers without graph-specific modifications can lead to promising results in graph learning both in theory and practice. Given a graph, we simply treat all nodes and edges as independent tokens, augment them with token embeddings, and feed them to a Transformer. With an appropriate choice of token embeddings, we prove that this approach is theoretically at least as expressive as an invariant graph network (2-IGN) composed of equivariant linear layers, which is already more expressive than all message-passing Graph Neural Networks (GNN). When trained on a large-scale graph dataset (PCQM4Mv2), our method coined Tokenized Graph Transformer (TokenGT) achieves significantly better results compared to GNN baselines and competitive results compared to Transformer variants with sophisticated graph-specific inductive bias. Our implementation is available at https://github.com/jw9730/tokengt.
    Instance-optimal PAC Algorithms for Contextual Bandits. (arXiv:2207.02357v1 [stat.ML])
    In the stochastic contextual bandit setting, regret-minimizing algorithms have been extensively researched, but their instance-minimizing best-arm identification counterparts remain seldom studied. In this work, we focus on the stochastic bandit problem in the $(\epsilon,\delta)$-$\textit{PAC}$ setting: given a policy class $\Pi$ the goal of the learner is to return a policy $\pi\in \Pi$ whose expected reward is within $\epsilon$ of the optimal policy with probability greater than $1-\delta$. We characterize the first $\textit{instance-dependent}$ PAC sample complexity of contextual bandits through a quantity $\rho_{\Pi}$, and provide matching upper and lower bounds in terms of $\rho_{\Pi}$ for the agnostic and linear contextual best-arm identification settings. We show that no algorithm can be simultaneously minimax-optimal for regret minimization and instance-dependent PAC for best-arm identification. Our main result is a new instance-optimal and computationally efficient algorithm that relies on a polynomial number of calls to an argmax oracle.
    Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation. (arXiv:2205.14141v2 [cs.CV] UPDATED)
    Masked image modeling (MIM) learns representations with remarkably good fine-tuning performances, overshadowing previous prevalent pre-training approaches such as image classification, instance contrastive learning, and image-text alignment. In this paper, we show that the inferior fine-tuning performance of these pre-training approaches can be significantly improved by a simple post-processing in the form of feature distillation (FD). The feature distillation converts the old representations to new representations that have a few desirable properties just like those representations produced by MIM. These properties, which we aggregately refer to as optimization friendliness, are identified and analyzed by a set of attention- and optimization-related diagnosis tools. With these properties, the new representations show strong fine-tuning performance. Specifically, the contrastive self-supervised learning methods are made as competitive in fine-tuning as the state-of-the-art masked image modeling (MIM) algorithms. The CLIP models' fine-tuning performance is also significantly improved, with a CLIP ViT-L model reaching \textbf{89.0%} top-1 accuracy on ImageNet-1K classification. On the 3-billion-parameter SwinV2-G model, the fine-tuning accuracy on ADE20K semantic segmentation is improved by +1.5 mIoU to \textbf{61.4 mIoU}, creating a new record. More importantly, our work provides a way for the future research to focus more effort on the generality and scalability of the learnt representations without being pre-occupied with optimization friendliness since it can be enhanced rather easily. The code will be available at https://github.com/SwinTransformer/Feature-Distillation.
    Characterizing and Mitigating the Difficulty in Training Physics-informed Artificial Neural Networks under Pointwise Constraints. (arXiv:2206.09321v2 [cs.LG] UPDATED)
    Neural networks can be used to learn the solution of partial differential equations (PDEs) on arbitrary domains without requiring a computational mesh. Common approaches integrate differential operators in training neural networks using a structured loss function. The most common training algorithm for neural networks is backpropagation which relies on the gradient of the loss function with respect to the parameters of the network. In this work, we characterize the difficulty of training neural networks on physics by investigating the impact of differential operators in corrupting the back propagated gradients. Particularly, we show that perturbations present in the output of a neural network model during early stages of training lead to higher levels of noise in a structured loss function that is composed of high-order differential operators. These perturbations consequently corrupt the back-propagated gradients and impede convergence. We mitigate this issue by introducing auxiliary flux parameters to obtain a system of first-order differential equations. We formulate a non-linear unconstrained optimization problem using the augmented Lagrangian method that properly constrains the boundary conditions and adaptively focus on regions of higher gradients that are difficult to learn. We apply our approach to learn the solution of various benchmark PDE problems and demonstrate orders of magnitude improvement over existing approaches.
    Self-Normalized Density Map (SNDM) for Counting Microbiological Objects. (arXiv:2203.09474v2 [cs.CV] UPDATED)
    The statistical properties of the density map (DM) approach to counting microbiological objects on images are studied in detail. The DM is given by U$^2$-Net. Two statistical methods for deep neural networks are utilized: the bootstrap and the Monte Carlo (MC) dropout. The detailed analysis of the uncertainties for the DM predictions leads to a deeper understanding of the DM model's deficiencies. Based on our investigation, we propose a self-normalization module in the network. The improved network model, called \textit{Self-Normalized Density Map} (SNDM), can correct its output density map by itself to accurately predict the total number of objects in the image. The SNDM architecture outperforms the original model. Moreover, both statistical frameworks -- bootstrap and MC dropout -- have consistent statistical results for SNDM, which were not observed in the original model. The SNDM efficiency is comparable with the detector-base models, such as Faster and Cascade R-CNN detectors.
    A Heterogeneous Graph Based Framework for Multimodal Neuroimaging Fusion Learning. (arXiv:2110.08465v4 [cs.LG] UPDATED)
    Graph neural networks (GNNs) provide powerful insights for brain neuroimaging technology from the view of graphical networks. However, most existing GNN-based models assume that the neuroimaging-produced brain connectome network is a homogeneous graph with single types of nodes and edges. In fact, emerging studies have reported and emphasized the significance of heterogeneity among human brain activities, especially between the two cerebral hemispheres. Thus, homogeneous-structured brain network-based graph methods are insufficient for modelling complicated cerebral activity states. To overcome this problem, in this paper, we present a heterogeneous graph neural network (HeBrainGNN) for multimodal brain neuroimaging fusion learning. We first model the brain network as a heterogeneous graph with multitype nodes (i.e., left and right hemispheric nodes) and multitype edges (i.e., intra- and interhemispheric edges). Then, we propose a self-supervised pretraining strategy based on a heterogeneous brain network to address the potential overfitting problem caused by the conflict between a large parameter size and a small medical data sample size. Our results show the superiority of the proposed model over other existing methods in brain-related disease prediction tasks. Ablation experiments show that our heterogeneous graph-based model attaches more importance to hemishpheric connections that may be neglected due to their low strength by previous homogeneous graph models. Other experiments also indicate that our proposed model with a pretraining strategy alleviates the problem of limited labelled data and yields a significant improvement in accuracy.
    Deep Contrastive Patch-Based Subspace Learning for Camera Image Signal Processing. (arXiv:2104.00253v3 [eess.IV] UPDATED)
    Camera Image Signal Processing(ISP) pipelines, including deep learning trained versions, can get appealing results in different image signal processing tasks. However, most if not all of these methods tend to apply a single filter that is homogeneous over the entire image. This is also particularly true when an encoder-decoder type deep architecture is trained for the task. However, it is natural to view a camera image as heterogeneous, as the color intensity and the artificial noise are distributed vastly different, even across the two dimensional domain of a single image. Varied Moire ringing, motion-blur, color-bleaching or lens based projection distortions can all potentially lead to a heterogeneous image artifact filtering problem. In this paper, we present a specific patch-based, local subspace deep neural network that improves Camera ISP to be robust to heterogeneous artifacts (especially image denoising). We call our three-fold deep trained model the Patch Subspace Learning Autoencoder (PSL-AE). PSL-AE does not necessarily assume uniform image distortion levels nor repeated nor similar artifact types within the image. Rather, PSL-AE first diagnostically encodes patches extracted from noisy and clean image pairs, with different artifact type and distortion levels, by contrastive learning. Then, each image's patches are encoded into soft-clusters in their appropriate latent sub-space, using a prior mixture model. Lastly, the decoders of the PSL-AE are also trained in an unsupervised manner customized for the image patches in each soft-cluster. Our experimental results demonstrates the flexibility and performance that one can achieve through improved heterogeneous filtering, both from synthesized artifacts but also realistic SIDD image pairs.
    Domain Adaptive Hand Keypoint and Pixel Localization in the Wild. (arXiv:2203.08344v4 [cs.CV] UPDATED)
    We aim to improve the performance of regressing hand keypoints and segmenting pixel-level hand masks under new imaging conditions (e.g., outdoors) when we only have labeled images taken under very different conditions (e.g., indoors). In the real world, it is important that the model trained for both tasks works under various imaging conditions. However, their variation covered by existing labeled hand datasets is limited. Thus, it is necessary to adapt the model trained on the labeled images (source) to unlabeled images (target) with unseen imaging conditions. While self-training domain adaptation methods (i.e., learning from the unlabeled target images in a self-supervised manner) have been developed for both tasks, their training may degrade performance when the predictions on the target images are noisy. To avoid this, it is crucial to assign a low importance (confidence) weight to the noisy predictions during self-training. In this paper, we propose to utilize the divergence of two predictions to estimate the confidence of the target image for both tasks. These predictions are given from two separate networks, and their divergence helps identify the noisy predictions. To integrate our proposed confidence estimation into self-training, we propose a teacher-student framework where the two networks (teachers) provide supervision to a network (student) for self-training, and the teachers are learned from the student by knowledge distillation. Our experiments show its superiority over state-of-the-art methods in adaptation settings with different lighting, grasping objects, backgrounds, and camera viewpoints. Our method improves by 4% the multi-task score on HO3D compared to the latest adversarial adaptation method. We also validate our method on Ego4D, egocentric videos with rapid changes in imaging conditions outdoors.
    SAAC: Safe Reinforcement Learning as an Adversarial Game of Actor-Critics. (arXiv:2204.09424v2 [cs.LG] UPDATED)
    Although Reinforcement Learning (RL) is effective for sequential decision-making problems under uncertainty, it still fails to thrive in real-world systems where risk or safety is a binding constraint. In this paper, we formulate the RL problem with safety constraints as a non-zero-sum game. While deployed with maximum entropy RL, this formulation leads to a safe adversarially guided soft actor-critic framework, called SAAC. In SAAC, the adversary aims to break the safety constraint while the RL agent aims to maximize the constrained value function given the adversary's policy. The safety constraint on the agent's value function manifests only as a repulsion term between the agent's and the adversary's policies. Unlike previous approaches, SAAC can address different safety criteria such as safe exploration, mean-variance risk sensitivity, and CVaR-like coherent risk sensitivity. We illustrate the design of the adversary for these constraints. Then, in each of these variations, we show the agent differentiates itself from the adversary's unsafe actions in addition to learning to solve the task. Finally, for challenging continuous control tasks, we demonstrate that SAAC achieves faster convergence, better efficiency, and fewer failures to satisfy the safety constraints than risk-averse distributional RL and risk-neutral soft actor-critic algorithms.
    Predicting Kidney Transplant Survival using Multiple Feature Representations for HLAs. (arXiv:2103.03305v2 [cs.LG] UPDATED)
    Kidney transplantation can significantly enhance living standards for people suffering from end-stage renal disease. A significant factor that affects graft survival time (the time until the transplant fails and the patient requires another transplant) for kidney transplantation is the compatibility of the Human Leukocyte Antigens (HLAs) between the donor and recipient. In this paper, we propose 4 new biologically-relevant feature representations for incorporating HLA information into machine learning-based survival analysis algorithms. We evaluate our proposed HLA feature representations on a database of over 100,000 transplants and find that they improve prediction accuracy by about 1%, modest at the patient level but potentially significant at a societal level. Accurate prediction of survival times can improve transplant survival outcomes, enabling better allocation of donors to recipients and reducing the number of re-transplants due to graft failure with poorly matched donors.
    Variational Flow Graphical Model. (arXiv:2207.02722v1 [stat.ML])
    This paper introduces a novel approach to embed flow-based models with hierarchical structures. The proposed framework is named Variational Flow Graphical (VFG) Model. VFGs learn the representation of high dimensional data via a message-passing scheme by integrating flow-based functions through variational inference. By leveraging the expressive power of neural networks, VFGs produce a representation of the data using a lower dimension, thus overcoming the drawbacks of many flow-based models, usually requiring a high dimensional latent space involving many trivial variables. Aggregation nodes are introduced in the VFG models to integrate forward-backward hierarchical information via a message passing scheme. Maximizing the evidence lower bound (ELBO) of data likelihood aligns the forward and backward messages in each aggregation node achieving a consistency node state. Algorithms have been developed to learn model parameters through gradient updating regarding the ELBO objective. The consistency of aggregation nodes enable VFGs to be applicable in tractable inference on graphical structures. Besides representation learning and numerical inference, VFGs provide a new approach for distribution modeling on datasets with graphical latent structures. Additionally, theoretical study shows that VFGs are universal approximators by leveraging the implicitly invertible flow-based structures. With flexible graphical structures and superior excessive power, VFGs could potentially be used to improve probabilistic inference. In the experiments, VFGs achieves improved evidence lower bound (ELBO) and likelihood values on multiple datasets.
    PAC Prediction Sets for Meta-Learning. (arXiv:2207.02440v1 [cs.LG])
    Uncertainty quantification is a key component of machine learning models targeted at safety-critical systems such as in healthcare or autonomous vehicles. We study this problem in the context of meta learning, where the goal is to quickly adapt a predictor to new tasks. In particular, we propose a novel algorithm to construct \emph{PAC prediction sets}, which capture uncertainty via sets of labels, that can be adapted to new tasks with only a few training examples. These prediction sets satisfy an extension of the typical PAC guarantee to the meta learning setting; in particular, the PAC guarantee holds with high probability over future tasks. We demonstrate the efficacy of our approach on four datasets across three application domains: mini-ImageNet and CIFAR10-C in the visual domain, FewRel in the language domain, and the CDC Heart Dataset in the medical domain. In particular, our prediction sets satisfy the PAC guarantee while having smaller size compared to other baselines that also satisfy this guarantee.
    Enabling Fast Deep Learning on Tiny Energy-Harvesting IoT Devices. (arXiv:2111.14051v3 [cs.LG] UPDATED)
    Energy harvesting (EH) IoT devices that operate intermittently without batteries, coupled with advances in deep neural networks (DNNs), have opened up new opportunities for enabling sustainable smart applications. Nevertheless, implementing those computation and memory-intensive intelligent algorithms on EH devices is extremely difficult due to the challenges of limited resources and intermittent power supply that causes frequent failures. To address those challenges, this paper proposes a methodology that enables fast deep learning with low-energy accelerators for tiny energy harvesting devices. We first propose $RAD$, a resource-aware structured DNN training framework, which employs block circulant matrix and structured pruning to achieve high compression for leveraging the advantage of various vector operation accelerators. A DNN implementation method, $ACE$, is then proposed that employs low-energy accelerators to profit maximum performance with small energy consumption. Finally, we further design $FLEX$, the system support for intermittent computation in energy harvesting situations. Experimental results from three different DNN models demonstrate that $RAD$, $ACE$, and $FLEX$ can enable fast and correct inference on energy harvesting devices with up to 4.26X runtime reduction, up to 7.7X energy reduction with higher accuracy over the state-of-the-art.
    Fast Sparse Decision Tree Optimization via Reference Ensembles. (arXiv:2112.00798v7 [cs.LG] UPDATED)
    Sparse decision tree optimization has been one of the most fundamental problems in AI since its inception and is a challenge at the core of interpretable machine learning. Sparse decision tree optimization is computationally hard, and despite steady effort since the 1960's, breakthroughs have only been made on the problem within the past few years, primarily on the problem of finding optimal sparse decision trees. However, current state-of-the-art algorithms often require impractical amounts of computation time and memory to find optimal or near-optimal trees for some real-world datasets, particularly those having several continuous-valued features. Given that the search spaces of these decision tree optimization problems are massive, can we practically hope to find a sparse decision tree that competes in accuracy with a black box machine learning model? We address this problem via smart guessing strategies that can be applied to any optimal branch-and-bound-based decision tree algorithm. We show that by using these guesses, we can reduce the run time by multiple orders of magnitude, while providing bounds on how far the resulting trees can deviate from the black box's accuracy and expressive power. Our approach enables guesses about how to bin continuous features, the size of the tree, and lower bounds on the error for the optimal decision tree. Our experiments show that in many cases we can rapidly construct sparse decision trees that match the accuracy of black box models. To summarize: when you are having trouble optimizing, just guess.
    Online Bilevel Optimization: Regret Analysis of Online Alternating Gradient Methods. (arXiv:2207.02829v1 [math.OC])
    Online optimization is a well-established optimization paradigm that aims to make a sequence of correct decisions given knowledge of the correct answer to previous decision tasks. Bilevel programming involves a hierarchical optimization problem where the feasible region of the so-called outer problem is restricted by the graph of the solution set mapping of the inner problem. This paper brings these two ideas together and studies an online bilevel optimization setting in which a sequence of time-varying bilevel problems are revealed one after the other. We extend the known regret bounds for single-level online algorithms to the bilevel setting. Specifically, we introduce new notions of bilevel regret, develop an online alternating time-averaged gradient method that is capable of leveraging smoothness, and provide regret bounds in terms of the path-length of the inner and outer minimizer sequences.
    AutoSpeed: A Linked Autoencoder Approach for Pulse-Echo Speed-of-Sound Imaging for Medical Ultrasound. (arXiv:2207.02392v1 [eess.IV])
    Quantitative ultrasound, e.g., speed-of-sound (SoS) in tissues, provides information about tissue properties that have diagnostic value. Recent studies showed the possibility of extracting SoS information from pulse-echo ultrasound raw data (a.k.a. RF data) using deep neural networks that are fully trained on simulated data. These methods take sensor domain data, i.e., RF data, as input and train a network in an end-to-end fashion to learn the implicit mapping between the RF data domain and SoS domain. However, such networks are prone to overfitting to simulated data which results in poor performance and instability when tested on measured data. We propose a novel method for SoS mapping employing learned representations from two linked autoencoders. We test our approach on simulated and measured data acquired from human breast mimicking phantoms. We show that SoS mapping is possible using linked autoencoders. The proposed method has a Mean Absolute Percentage Error (MAPE) of 2.39% on the simulated data. On the measured data, the predictions of the proposed method are close to the expected values with MAPE of 1.1%. Compared to an end-to-end trained network, the proposed method shows higher stability and reproducibility.
    TractoFormer: A Novel Fiber-level Whole Brain Tractography Analysis Framework Using Spectral Embedding and Vision Transformers. (arXiv:2207.02327v1 [eess.IV])
    Diffusion MRI tractography is an advanced imaging technique for quantitative mapping of the brain's structural connectivity. Whole brain tractography (WBT) data contains over hundreds of thousands of individual fiber streamlines (estimated brain connections), and this data is usually parcellated to create compact representations for data analysis applications such as disease classification. In this paper, we propose a novel parcellation-free WBT analysis framework, TractoFormer, that leverages tractography information at the level of individual fiber streamlines and provides a natural mechanism for interpretation of results using the attention mechanism of transformers. TractoFormer includes two main contributions. First, we propose a novel and simple 2D image representation of WBT, TractoEmbedding, to encode 3D fiber spatial relationships and any feature of interest that can be computed from individual fibers (such as FA or MD). Second, we design a network based on vision transformers (ViTs) that includes: 1) data augmentation to overcome model overfitting on small datasets, 2) identification of discriminative fibers for interpretation of results, and 3) ensemble learning to leverage fiber information from different brain regions. In a synthetic data experiment, TractoFormer successfully identifies discriminative fibers with simulated group differences. In a disease classification experiment comparing several methods, TractoFormer achieves the highest accuracy in classifying schizophrenia vs control. Discriminative fibers are identified in left hemispheric frontal and parietal superficial white matter regions, which have previously been shown to be affected in schizophrenia patients.
    Ordinal Regression via Binary Preference vs Simple Regression: Statistical and Experimental Perspectives. (arXiv:2207.02454v1 [cs.LG])
    Ordinal regression with anchored reference samples (ORARS) has been proposed for predicting the subjective Mean Opinion Score (MOS) of input stimuli automatically. The ORARS addresses the MOS prediction problem by pairing a test sample with each of the pre-scored anchored reference samples. A trained binary classifier is then used to predict which sample, test or anchor, is better statistically. Posteriors of the binary preference decision are then used to predict the MOS of the test sample. In this paper, rigorous framework, analysis, and experiments to demonstrate that ORARS are advantageous over simple regressions are presented. The contributions of this work are: 1) Show that traditional regression can be reformulated into multiple preference tests to yield a better performance, which is confirmed with simulations experimentally; 2) Generalize ORARS to other regression problems and verify its effectiveness; 3) Provide some prerequisite conditions which can insure proper application of ORARS.
    Effective and Efficient Training for Sequential Recommendation using Recency Sampling. (arXiv:2207.02643v1 [cs.IR])
    Many modern sequential recommender systems use deep neural networks, which can effectively estimate the relevance of items but require a lot of time to train. Slow training increases expenses, hinders product development timescales and prevents the model from being regularly updated to adapt to changing user preferences. Training such sequential models involves appropriately sampling past user interactions to create a realistic training objective. The existing training objectives have limitations. For instance, next item prediction never uses the beginning of the sequence as a learning target, thereby potentially discarding valuable data. On the other hand, the item masking used by BERT4Rec is only weakly related to the goal of the sequential recommendation; therefore, it requires much more time to obtain an effective model. Hence, we propose a novel Recency-based Sampling of Sequences training objective that addresses both limitations. We apply our method to various recent and state-of-the-art model architectures - such as GRU4Rec, Caser, and SASRec. We show that the models enhanced with our method can achieve performances exceeding or very close to stateof-the-art BERT4Rec, but with much less training time.
    Tractable Dendritic RNNs for Reconstructing Nonlinear Dynamical Systems. (arXiv:2207.02542v1 [cs.LG])
    In many scientific disciplines, we are interested in inferring the nonlinear dynamical system underlying a set of observed time series, a challenging task in the face of chaotic behavior and noise. Previous deep learning approaches toward this goal often suffered from a lack of interpretability and tractability. In particular, the high-dimensional latent spaces often required for a faithful embedding, even when the underlying dynamics lives on a lower-dimensional manifold, can hamper theoretical analysis. Motivated by the emerging principles of dendritic computation, we augment a dynamically interpretable and mathematically tractable piecewise-linear (PL) recurrent neural network (RNN) by a linear spline basis expansion. We show that this approach retains all the theoretically appealing properties of the simple PLRNN, yet boosts its capacity for approximating arbitrary nonlinear dynamical systems in comparatively low dimensions. We employ two frameworks for training the system, one combining back-propagation-through-time (BPTT) with teacher forcing, and another based on fast and scalable variational inference. We show that the dendritically expanded PLRNN achieves better reconstructions with fewer parameters and dimensions on various dynamical systems benchmarks and compares favorably to other methods, while retaining a tractable and interpretable structure.
    Ensemble feature selection with clustering for analysis of high-dimensional, correlated clinical data in the search for Alzheimer's disease biomarkers. (arXiv:2207.02380v1 [cs.LG])
    Healthcare datasets often contain groups of highly correlated features, such as features from the same biological system. When feature selection is applied to these datasets to identify the most important features, the biases inherent in some multivariate feature selectors due to correlated features make it difficult for these methods to distinguish between the important and irrelevant features and the results of the feature selection process can be unstable. Feature selection ensembles, which aggregate the results of multiple individual base feature selectors, have been investigated as a means of stabilising feature selection results, but do not address the problem of correlated features. We present a novel framework to create feature selection ensembles from multivariate feature selectors while taking into account the biases produced by groups of correlated features, using agglomerative hierarchical clustering in a pre-processing step. These methods were applied to two real-world datasets from studies of Alzheimer's disease (AD), a progressive neurodegenerative disease that has no cure and is not yet fully understood. Our results show a marked improvement in the stability of features selected over the models without clustering, and the features selected by these models are in keeping with the findings in the AD literature.
    Strong Heuristics for Named Entity Linking. (arXiv:2207.02824v1 [cs.CL])
    Named entity linking (NEL) in news is a challenging endeavour due to the frequency of unseen and emerging entities, which necessitates the use of unsupervised or zero-shot methods. However, such methods tend to come with caveats, such as no integration of suitable knowledge bases (like Wikidata) for emerging entities, a lack of scalability, and poor interpretability. Here, we consider person disambiguation in Quotebank, a massive corpus of speaker-attributed quotations from the news, and investigate the suitability of intuitive, lightweight, and scalable heuristics for NEL in web-scale corpora. Our best performing heuristic disambiguates 94% and 63% of the mentions on Quotebank and the AIDA-CoNLL benchmark, respectively. Additionally, the proposed heuristics compare favourably to the state-of-the-art unsupervised and zero-shot methods, Eigenthemes and mGENRE, respectively, thereby serving as strong baselines for unsupervised and zero-shot entity linking.
    Rethinking the Importance of Sampling in Physics-informed Neural Networks. (arXiv:2207.02338v1 [cs.LG])
    Physics-informed neural networks (PINNs) have emerged as a powerful tool for solving partial differential equations (PDEs) in a variety of domains. While previous research in PINNs has mainly focused on constructing and balancing loss functions during training to avoid poor minima, the effect of sampling collocation points on the performance of PINNs has largely been overlooked. In this work, we find that the performance of PINNs can vary significantly with different sampling strategies, and using a fixed set of collocation points can be quite detrimental to the convergence of PINNs to the correct solution. In particular, (1) we hypothesize that training of PINNs rely on successful "propagation" of solution from initial and/or boundary condition points to interior points, and PINNs with poor sampling strategies can get stuck at trivial solutions if there are \textit{propagation failures}. (2) We demonstrate that propagation failures are characterized by highly imbalanced PDE residual fields where very high residuals are observed over very narrow regions. (3) To mitigate propagation failure, we propose a novel \textit{evolutionary sampling} (Evo) method that can incrementally accumulate collocation points in regions of high PDE residuals. We further provide an extension of Evo to respect the principle of causality while solving time-dependent PDEs. We empirically demonstrate the efficacy and efficiency of our proposed methods in a variety of PDE problems.
    Quantitative Assessment of DESIS Hyperspectral Data for Plant Biodiversity Estimation in Australia. (arXiv:2207.02482v1 [cs.LG])
    Diversity of terrestrial plants plays a key role in maintaining a stable, healthy, and productive ecosystem. Though remote sensing has been seen as a promising and cost-effective proxy for estimating plant diversity, there is a lack of quantitative studies on how confidently plant diversity can be inferred from spaceborne hyperspectral data. In this study, we assessed the ability of hyperspectral data captured by the DLR Earth Sensing Imaging Spectrometer (DESIS) for estimating plant species richness in the Southern Tablelands and Snowy Mountains regions in southeast Australia. Spectral features were firstly extracted from DESIS spectra with principal component analysis, canonical correlation analysis, and partial least squares analysis. Then regression was conducted between the extracted features and plant species richness with ordinary least squares regression, kernel ridge regression, and Gaussian process regression. Results were assessed with the coefficient of correlation ($r$) and Root-Mean-Square Error (RMSE), based on a two-fold cross validation scheme. With the best performing model, $r$ is 0.71 and RMSE is 5.99 for the Southern Tablelands region, while $r$ is 0.62 and RMSE is 6.20 for the Snowy Mountains region. The assessment results reported in this study provide supports for future studies on understanding the relationship between spaceborne hyperspectral measurements and terrestrial plant biodiversity.
    Cooperative Distribution Alignment via JSD Upper Bound. (arXiv:2207.02286v1 [cs.LG])
    Unsupervised distribution alignment estimates a transformation that maps two or more source distributions to a shared aligned distribution given only samples from each distribution. This task has many applications including generative modeling, unsupervised domain adaptation, and socially aware learning. Most prior works use adversarial learning (i.e., min-max optimization), which can be challenging to optimize and evaluate. A few recent works explore non-adversarial flow-based (i.e., invertible) approaches, but they lack a unified perspective and are limited in efficiently aligning multiple distributions. Therefore, we propose to unify and generalize previous flow-based approaches under a single non-adversarial framework, which we prove is equivalent to minimizing an upper bound on the Jensen-Shannon Divergence (JSD). Importantly, our problem reduces to a min-min, i.e., cooperative, problem and can provide a natural evaluation metric for unsupervised distribution alignment. We present empirical results of our framework on both simulated and real-world datasets to demonstrate the benefits of our approach.
    Composite FORCE learning of chaotic echo state networks for time-series prediction. (arXiv:2207.02420v1 [cs.LG])
    Echo state network (ESN), a kind of recurrent neural networks, consists of a fixed reservoir in which neurons are connected randomly and recursively and obtains the desired output only by training output connection weights. First-order reduced and controlled error (FORCE) learning is an online supervised training approach that can change the chaotic activity of ESNs into specified activity patterns. This paper proposes a composite FORCE learning method based on recursive least squares to train ESNs whose initial activity is spontaneously chaotic, where a composite learning technique featured by dynamic regressor extension and memory data exploitation is applied to enhance parameter convergence. The proposed method is applied to a benchmark problem about predicting chaotic time series generated by the Mackey-Glass system, and numerical results have shown that it significantly improves learning and prediction performances compared with existing methods.
    Private Matrix Approximation and Geometry of Unitary Orbits. (arXiv:2207.02794v1 [cs.DS])
    Consider the following optimization problem: Given $n \times n$ matrices $A$ and $\Lambda$, maximize $\langle A, U\Lambda U^*\rangle$ where $U$ varies over the unitary group $\mathrm{U}(n)$. This problem seeks to approximate $A$ by a matrix whose spectrum is the same as $\Lambda$ and, by setting $\Lambda$ to be appropriate diagonal matrices, one can recover matrix approximation problems such as PCA and rank-$k$ approximation. We study the problem of designing differentially private algorithms for this optimization problem in settings where the matrix $A$ is constructed using users' private data. We give efficient and private algorithms that come with upper and lower bounds on the approximation error. Our results unify and improve upon several prior works on private matrix approximation problems. They rely on extensions of packing/covering number bounds for Grassmannians to unitary orbits which should be of independent interest.
    Predicting is not Understanding: Recognizing and Addressing Underspecification in Machine Learning. (arXiv:2207.02598v1 [cs.LG])
    Machine learning (ML) models are typically optimized for their accuracy on a given dataset. However, this predictive criterion rarely captures all desirable properties of a model, in particular how well it matches a domain expert's understanding of a task. Underspecification refers to the existence of multiple models that are indistinguishable in their in-domain accuracy, even though they differ in other desirable properties such as out-of-distribution (OOD) performance. Identifying these situations is critical for assessing the reliability of ML models. We formalize the concept of underspecification and propose a method to identify and partially address it. We train multiple models with an independence constraint that forces them to implement different functions. They discover predictive features that are otherwise ignored by standard empirical risk minimization (ERM), which we then distill into a global model with superior OOD performance. Importantly, we constrain the models to align with the data manifold to ensure that they discover meaningful features. We demonstrate the method on multiple datasets in computer vision (collages, WILDS-Camelyon17, GQA) and discuss general implications of underspecification. Most notably, in-domain performance cannot serve for OOD model selection without additional assumptions.  ( 2 min )
    Unified Embeddings of Structural and Functional Connectome via a Function-Constrained Structural Graph Variational Auto-Encoder. (arXiv:2207.02328v1 [q-bio.NC])
    Graph theoretical analyses have become standard tools in modeling functional and anatomical connectivity in the brain. With the advent of connectomics, the primary graphs or networks of interest are structural connectome (derived from DTI tractography) and functional connectome (derived from resting-state fMRI). However, most published connectome studies have focused on either structural or functional connectome, yet complementary information between them, when available in the same dataset, can be jointly leveraged to improve our understanding of the brain. To this end, we propose a function-constrained structural graph variational autoencoder (FCS-GVAE) capable of incorporating information from both functional and structural connectome in an unsupervised fashion. This leads to a joint low-dimensional embedding that establishes a unified spatial coordinate system for comparing across different subjects. We evaluate our approach using the publicly available OASIS-3 Alzheimer's disease (AD) dataset and show that a variational formulation is necessary to optimally encode functional brain dynamics. Further, the proposed joint embedding approach can more accurately distinguish different patient sub-populations than approaches that do not use complementary connectome information.  ( 2 min )
    Multi-Contrast MRI Segmentation Trained on Synthetic Images. (arXiv:2207.02469v1 [eess.IV])
    In our comprehensive experiments and evaluations, we show that it is possible to generate multiple contrast (even all synthetically) and use synthetically generated images to train an image segmentation engine. We showed promising segmentation results tested on real multi-contrast MRI scans when delineating muscle, fat, bone and bone marrow, all trained on synthetic images. Based on synthetic image training, our segmentation results were as high as 93.91\%, 94.11\%, 91.63\%, 95.33\%, for muscle, fat, bone, and bone marrow delineation, respectively. Results were not significantly different from the ones obtained when real images were used for segmentation training: 94.68\%, 94.67\%, 95.91\%, and 96.82\%, respectively.  ( 2 min )
    When does SGD favor flat minima? A quantitative characterization via linear stability. (arXiv:2207.02628v1 [stat.ML])
    The observation that stochastic gradient descent (SGD) favors flat minima has played a fundamental role in understanding implicit regularization of SGD and guiding the tuning of hyperparameters. In this paper, we provide a quantitative explanation of this striking phenomenon by relating the particular noise structure of SGD to its \emph{linear stability} (Wu et al., 2018). Specifically, we consider training over-parameterized models with square loss. We prove that if a global minimum $\theta^*$ is linearly stable for SGD, then it must satisfy $\|H(\theta^*)\|_F\leq O(\sqrt{B}/\eta)$, where $\|H(\theta^*)\|_F, B,\eta$ denote the Frobenius norm of Hessian at $\theta^*$, batch size, and learning rate, respectively. Otherwise, SGD will escape from that minimum \emph{exponentially} fast. Hence, for minima accessible to SGD, the flatness -- as measured by the Frobenius norm of the Hessian -- is bounded independently of the model size and sample size. The key to obtaining these results is exploiting the particular geometry awareness of SGD noise: 1) the noise magnitude is proportional to loss value; 2) the noise directions concentrate in the sharp directions of local landscape. This property of SGD noise provably holds for linear networks and random feature models (RFMs) and is empirically verified for nonlinear networks. Moreover, the validity and practical relevance of our theoretical findings are justified by extensive numerical experiments.  ( 3 min )
    Compositional Generalization in Grounded Language Learning via Induced Model Sparsity. (arXiv:2207.02518v1 [cs.CL])
    We provide a study of how induced model sparsity can help achieve compositional generalization and better sample efficiency in grounded language learning problems. We consider simple language-conditioned navigation problems in a grid world environment with disentangled observations. We show that standard neural architectures do not always yield compositional generalization. To address this, we design an agent that contains a goal identification module that encourages sparse correlations between words in the instruction and attributes of objects, composing them together to find the goal. The output of the goal identification module is the input to a value iteration network planner. Our agent maintains a high level of performance on goals containing novel combinations of properties even when learning from a handful of demonstrations. We examine the internal representations of our agent and find the correct correspondences between words in its dictionary and attributes in the environment.  ( 2 min )
    Ultra-Low-Bitrate Speech Coding with Pretrained Transformers. (arXiv:2207.02262v1 [cs.SD])
    Speech coding facilitates the transmission of speech over low-bandwidth networks with minimal distortion. Neural-network based speech codecs have recently demonstrated significant improvements in quality over traditional approaches. While this new generation of codecs is capable of synthesizing high-fidelity speech, their use of recurrent or convolutional layers often restricts their effective receptive fields, which prevents them from compressing speech efficiently. We propose to further reduce the bitrate of neural speech codecs through the use of pretrained Transformers, capable of exploiting long-range dependencies in the input signal due to their inductive bias. As such, we use a pretrained Transformer in tandem with a convolutional encoder, which is trained end-to-end with a quantizer and a generative adversarial net decoder. Our numerical experiments show that supplementing the convolutional encoder of a neural speech codec with Transformer speech embeddings yields a speech codec with a bitrate of $600\,\mathrm{bps}$ that outperforms the original neural speech codec in synthesized speech quality when trained at the same bitrate. Subjective human evaluations suggest that the quality of the resulting codec is comparable or better than that of conventional codecs operating at three to four times the rate.  ( 2 min )
    voxel2vec: A Natural Language Processing Approach to Learning Distributed Representations for Scientific Data. (arXiv:2207.02565v1 [cs.LG])
    Relationships in scientific data, such as the numerical and spatial distribution relations of features in univariate data, the scalar-value combinations' relations in multivariate data, and the association of volumes in time-varying and ensemble data, are intricate and complex. This paper presents voxel2vec, a novel unsupervised representation learning model, which is used to learn distributed representations of scalar values/scalar-value combinations in a low-dimensional vector space. Its basic assumption is that if two scalar values/scalar-value combinations have similar contexts, they usually have high similarity in terms of features. By representing scalar values/scalar-value combinations as symbols, voxel2vec learns the similarity between them in the context of spatial distribution and then allows us to explore the overall association between volumes by transfer prediction. We demonstrate the usefulness and effectiveness of voxel2vec by comparing it with the isosurface similarity map of univariate data and applying the learned distributed representations to feature classification for multivariate data and to association analysis for time-varying and ensemble data.  ( 2 min )
    Query-Efficient Adversarial Attack Based on Latin Hypercube Sampling. (arXiv:2207.02391v1 [cs.CV])
    In order to be applicable in real-world scenario, Boundary Attacks (BAs) were proposed and ensured one hundred percent attack success rate with only decision information. However, existing BA methods craft adversarial examples by leveraging a simple random sampling (SRS) to estimate the gradient, consuming a large number of model queries. To overcome the drawback of SRS, this paper proposes a Latin Hypercube Sampling based Boundary Attack (LHS-BA) to save query budget. Compared with SRS, LHS has better uniformity under the same limited number of random samples. Therefore, the average on these random samples is closer to the true gradient than that estimated by SRS. Various experiments are conducted on benchmark datasets including MNIST, CIFAR, and ImageNet-1K. Experimental results demonstrate the superiority of the proposed LHS-BA over the state-of-the-art BA methods in terms of query efficiency. The source codes are publicly available at https://github.com/GZHU-DVL/LHS-BA.  ( 2 min )
    Distillation to Enhance the Portability of Risk Models Across Institutions with Large Patient Claims Database. (arXiv:2207.02445v1 [cs.LG])
    Artificial intelligence, and particularly machine learning (ML), is increasingly developed and deployed to support healthcare in a variety of settings. However, clinical decision support (CDS) technologies based on ML need to be portable if they are to be adopted on a broad scale. In this respect, models developed at one institution should be reusable at another. Yet there are numerous examples of portability failure, particularly due to naive application of ML models. Portability failure can lead to suboptimal care and medical errors, which ultimately could prevent the adoption of ML-based CDS in practice. One specific healthcare challenge that could benefit from enhanced portability is the prediction of 30-day readmission risk. Research to date has shown that deep learning models can be effective at modeling such risk. In this work, we investigate the practicality of model portability through a cross-site evaluation of readmission prediction models. To do so, we apply a recurrent neural network, augmented with self-attention and blended with expert features, to build readmission prediction models for two independent large scale claims datasets. We further present a novel transfer learning technique that adapts the well-known method of born-again network (BAN) training. Our experiments show that direct application of ML models trained at one institution and tested at another institution perform worse than models trained and tested at the same institution. We further show that the transfer learning approach based on the BAN produces models that are better than those trained on just a single institution's data. Notably, this improvement is consistent across both sites and occurs after a single retraining, which illustrates the potential for a cheap and general model transfer mechanism of readmission risk prediction.  ( 3 min )
    Generalization to translation shifts: a study in architectures and augmentations. (arXiv:2207.02349v1 [cs.CV])
    We provide a detailed evaluation of various image classification architectures (convolutional, vision transformer, and fully connected MLP networks) and data augmentation techniques towards generalization to large spacial translation shifts. We make the following observations: (a) In the absence of data augmentation, all architectures, including convolutional networks suffer degradation in performance when evaluated on translated test distributions. Understandably, both the in-distribution accuracy as well as degradation to shifts is significantly worse for non-convolutional architectures. (b) Across all architectures, even a minimal augmentation of $4$ pixel random crop improves the robustness of performance to much larger magnitude shifts of up to $1/4$ of image size ($8$-$16$ pixels) in the test data -- suggesting a form of meta generalization from augmentation. For non-convolutional architectures, while the absolute accuracy is still low, we see dramatic improvements in robustness to large translation shifts. (c) With sufficiently advanced augmentation ($4$ pixel crop+RandAugmentation+Erasing+MixUp) pipeline all architectures can be trained to have competitive performance, both in terms of in-distribution accuracy as well as generalization to large translation shifts.  ( 2 min )
    Improving Trustworthiness of AI Disease Severity Rating in Medical Imaging with Ordinal Conformal Prediction Sets. (arXiv:2207.02238v1 [cs.LG])
    The regulatory approval and broad clinical deployment of medical AI have been hampered by the perception that deep learning models fail in unpredictable and possibly catastrophic ways. A lack of statistically rigorous uncertainty quantification is a significant factor undermining trust in AI results. Recent developments in distribution-free uncertainty quantification present practical solutions for these issues by providing reliability guarantees for black-box models on arbitrary data distributions as formally valid finite-sample prediction intervals. Our work applies these new uncertainty quantification methods -- specifically conformal prediction -- to a deep-learning model for grading the severity of spinal stenosis in lumbar spine MRI. We demonstrate a technique for forming ordinal prediction sets that are guaranteed to contain the correct stenosis severity within a user-defined probability (confidence interval). On a dataset of 409 MRI exams processed by the deep-learning model, the conformal method provides tight coverage with small prediction set sizes. Furthermore, we explore the potential clinical applicability of flagging cases with high uncertainty predictions (large prediction sets) by quantifying an increase in the prevalence of significant imaging abnormalities (e.g. motion artifacts, metallic artifacts, and tumors) that could degrade confidence in predictive performance when compared to a random sample of cases.  ( 2 min )
    Putting the Con in Context: Identifying Deceptive Actors in the Game of Mafia. (arXiv:2207.02253v1 [cs.CL])
    While neural networks demonstrate a remarkable ability to model linguistic content, capturing contextual information related to a speaker's conversational role is an open area of research. In this work, we analyze the effect of speaker role on language use through the game of Mafia, in which participants are assigned either an honest or a deceptive role. In addition to building a framework to collect a dataset of Mafia game records, we demonstrate that there are differences in the language produced by players with different roles. We confirm that classification models are able to rank deceptive players as more suspicious than honest ones based only on their use of language. Furthermore, we show that training models on two auxiliary tasks outperforms a standard BERT-based text classification approach. We also present methods for using our trained models to identify features that distinguish between player roles, which could be used to assist players during the Mafia game.  ( 2 min )
    Information Compression and Performance Evaluation of Tic-Tac-Toe's Evaluation Function Using Singular Value Decomposition. (arXiv:2207.02449v1 [cs.LG])
    We approximated the evaluation function for the game Tic-Tac-Toe by singular value decomposition (SVD) and investigated the effect of approximation accuracy on winning rate. We first prepared the perfect evaluation function of Tic-Tac-Toe and performed low-rank approximation by considering the evaluation function as a ninth-order tensor. We found that we can reduce the amount of information of the evaluation function by 70% without significantly degrading the performance. Approximation accuracy and winning rate were strongly correlated but not perfectly proportional. We also investigated how the decomposition method of the evaluation function affects the performance. We considered two decomposition methods: simple SVD regarding the evaluation function as a matrix and the Tucker decomposition by higher-order SVD (HOSVD). At the same compression ratio, the strategy with the approximated evaluation function obtained by HOSVD exhibited a significantly higher winning rate than that obtained by SVD. These results suggest that SVD can effectively compress board game strategies and an optimal compression method that depends on the game exists.  ( 2 min )
    Many-body localized hidden Born machine. (arXiv:2207.02346v1 [quant-ph])
    Born Machines are quantum-inspired generative models that leverage the probabilistic nature of quantum states. Here, we present a new architecture called many-body localized (MBL) hidden Born machine that uses both MBL dynamics and hidden units as learning resources. We theoretically prove that MBL Born machines possess more expressive power than classical models, and the introduction of hidden units boosts its learning power. We numerically demonstrate that the MBL hidden Born machine is capable of learning a toy dataset consisting of patterns of MNIST handwritten digits, quantum data obtained from quantum many-body states, and non-local parity data. In order to understand the mechanism behind learning, we track physical quantities such as von Neumann entanglement entropy and Hamming distance during learning, and compare the learning outcomes in the MBL, thermal, and Anderson localized phases. We show that the superior learning power of the MBL phase relies importantly on both localization and interaction. Our architecture and algorithm provide novel strategies of utilizing quantum many-body systems as learning resources, and reveal a powerful connection between disorder, interaction, and learning in quantum systems.  ( 2 min )
    OpenLDN: Learning to Discover Novel Classes for Open-World Semi-Supervised Learning. (arXiv:2207.02261v1 [cs.CV])
    Semi-supervised learning (SSL) is one of the dominant approaches to address the annotation bottleneck of supervised learning. Recent SSL methods can effectively leverage a large repository of unlabeled data to improve performance while relying on a small set of labeled data. One common assumption in most SSL methods is that the labeled and unlabeled data are from the same underlying data distribution. However, this is hardly the case in many real-world scenarios, which limits their applicability. In this work, instead, we attempt to solve the recently proposed challenging open-world SSL problem that does not make such an assumption. In the open-world SSL problem, the objective is to recognize samples of known classes, and simultaneously detect and cluster samples belonging to novel classes present in unlabeled data. This work introduces OpenLDN that utilizes a pairwise similarity loss to discover novel classes. Using a bi-level optimization rule this pairwise similarity loss exploits the information available in the labeled set to implicitly cluster novel class samples, while simultaneously recognizing samples from known classes. After discovering novel classes, OpenLDN transforms the open-world SSL problem into a standard SSL problem to achieve additional performance gains using existing SSL methods. Our extensive experiments demonstrate that OpenLDN outperforms the current state-of-the-art methods on multiple popular classification benchmarks while providing a better accuracy/training time trade-off.  ( 3 min )
    GAMa: Cross-view Video Geo-localization. (arXiv:2207.02431v1 [cs.CV])
    The existing work in cross-view geo-localization is based on images where a ground panorama is matched to an aerial image. In this work, we focus on ground videos instead of images which provides additional contextual cues which are important for this task. There are no existing datasets for this problem, therefore we propose GAMa dataset, a large-scale dataset with ground videos and corresponding aerial images. We also propose a novel approach to solve this problem. At clip-level, a short video clip is matched with corresponding aerial image and is later used to get video-level geo-localization of a long video. Moreover, we propose a hierarchical approach to further improve the clip-level geolocalization. It is a challenging dataset, unaligned and limited field of view, and our proposed method achieves a Top-1 recall rate of 19.4% and 45.1% @1.0mile. Code and dataset are available at following link: https://github.com/svyas23/GAMa.  ( 2 min )
    Guiding Machine Perception with Psychophysics. (arXiv:2207.02241v1 [cs.CV])
    {G}{ustav} Fechner's 1860 delineation of psychophysics, the measurement of sensation in relation to its stimulus, is widely considered to be the advent of modern psychological science. In psychophysics, a researcher parametrically varies some aspects of a stimulus, and measures the resulting changes in a human subject's experience of that stimulus; doing so gives insight to the determining relationship between a sensation and the physical input that evoked it. This approach is used heavily in perceptual domains, including signal detection, threshold measurement, and ideal observer analysis. Scientific fields like vision science have always leaned heavily on the methods and procedures of psychophysics, but there is now growing appreciation of them by machine learning researchers, sparked by widening overlap between biological and artificial perception \cite{rojas2011automatic, scheirer2014perceptual,escalera2014chalearn,zhang2018agil, grieggs2021measuring}. Machine perception that is guided by behavioral measurements, as opposed to guidance restricted to arbitrarily assigned human labels, has significant potential to fuel further progress in artificial intelligence.  ( 2 min )
    EEPT: Early Discovery of Emerging Entities in Twitter with Semantic Similarity. (arXiv:2207.02434v1 [cs.CL])
    Some events which happen in the future could be important for companies, governments, and even our personal life. Prediction of these events before their establishment is helpful for efficient decision-making. We call such events emerging entities. They have not taken place yet, and there is no information about them in KB. However, some clues exist in different areas, especially on social media. Thus, retrieving these type of entities are possible. This paper proposes a method of early discovery of emerging entities. We use semantic clustering of short messages. To evaluate the performance of our proposal, we devise and utilize a performance evaluation metric. The results show that our proposed method finds those emerging entities of which Twitter trends are not always capable.  ( 2 min )
    Transfer Learning for Rapid Extraction of Thickness from Optical Spectra of Semiconductor Thin Films. (arXiv:2207.02209v1 [cs.LG])
    High-throughput experimentation with autonomous workflows, increasingly used to screen and optimize optoelectronic thin films, requires matching throughput of downstream characterizations. Despite being essential, thickness characterization lags in throughput. Although optical spectroscopic methods, e.g., spectrophotometry, provide quick measurements, a critical bottleneck is the ensuing manual fitting of optical oscillation models to the measured reflection and transmission. This study presents a machine-learning (ML) framework called thicknessML, which rapidly extracts film thickness from spectroscopic reflection and transmission. thicknessML leverages transfer learning to generalize to materials of different underlying optical oscillator models (i.e., different material classes).We demonstrate that thicknessML can extract film thickness from six perovskite samples in a two-stage process: (1) pre-training on a generic simulated dataset of Tauc-Lorentz oscillator, and (2) transfer learning to a simulated perovskite dataset of several literature perovskite refractive indices. Results show a pre-training thickness mean absolute percentage error (MAPE) of 5-7% and an experimental thickness MAPE of 6-19%.  ( 2 min )
    Learning Task Embeddings for Teamwork Adaptation in Multi-Agent Reinforcement Learning. (arXiv:2207.02249v1 [cs.MA])
    Successful deployment of multi-agent reinforcement learning often requires agents to adapt their behaviour. In this work, we discuss the problem of teamwork adaptation in which a team of agents needs to adapt their policies to solve novel tasks with limited fine-tuning. Motivated by the intuition that agents need to be able to identify and distinguish tasks in order to adapt their behaviour to the current task, we propose to learn multi-agent task embeddings (MATE). These task embeddings are trained using an encoder-decoder architecture optimised for reconstruction of the transition and reward functions which uniquely identify tasks. We show that a team of agents is able to adapt to novel tasks when provided with task embeddings. We propose three MATE training paradigms: independent MATE, centralised MATE, and mixed MATE which vary in the information used for the task encoding. We show that the embeddings learned by MATE identify tasks and provide useful information which agents leverage during adaptation to novel tasks.  ( 2 min )
    Linear Jamming Bandits: Sample-Efficient Learning for Non-Coherent Digital Jamming. (arXiv:2207.02365v1 [cs.LG])
    It has been shown (Amuru et al. 2015) that online learning algorithms can be effectively used to select optimal physical layer parameters for jamming against digital modulation schemes without a priori knowledge of the victim's transmission strategy. However, this learning problem involves solving a multi-armed bandit problem with a mixed action space that can grow very large. As a result, convergence to the optimal jamming strategy can be slow, especially when the victim and jammer's symbols are not perfectly synchronized. In this work, we remedy the sample efficiency issues by introducing a linear bandit algorithm that accounts for inherent similarities between actions. Further, we propose context features which are well-suited for the statistical features of the non-coherent jamming problem and demonstrate significantly improved convergence behavior compared to the prior art. Additionally, we show how prior knowledge about the victim's transmissions can be seamlessly integrated into the learning framework. We finally discuss limitations in the asymptotic regime.  ( 2 min )
    Multi-Label Retinal Disease Classification using Transformers. (arXiv:2207.02335v1 [cs.CV])
    Early detection of retinal diseases is one of the most important means of preventing partial or permanent blindness in patients. In this research, a novel multi-label classification system is proposed for the detection of multiple retinal diseases, using fundus images collected from a variety of sources. First, a new multi-label retinal disease dataset, the MuReD dataset, is constructed, using a number of publicly available datasets for fundus disease classification. Next, a sequence of post-processing steps is applied to ensure the quality of the image data and the range of diseases, present in the dataset. For the first time in fundus multi-label disease classification, a transformer-based model optimized through extensive experimentation is used for image analysis and decision making. Numerous experiments are performed to optimize the configuration of the proposed system. It is shown that the approach performs better than state-of-the-art works on the same task by 7.9% and 8.1% in terms of AUC score for disease detection and disease classification, respectively. The obtained results further support the potential applications of transformer-based architectures in the medical imaging field.  ( 3 min )
    BioTABQA: Instruction Learning for Biomedical Table Question Answering. (arXiv:2207.02419v1 [cs.CL])
    Table Question Answering (TQA) is an important but under-explored task. Most of the existing QA datasets are in unstructured text format and only few of them use tables as the context. To the best of our knowledge, none of TQA datasets exist in the biomedical domain where tables are frequently used to present information. In this paper, we first curate a table question answering dataset, BioTABQA, using 22 templates and the context from a biomedical textbook on differential diagnosis. BioTABQA can not only be used to teach a model how to answer questions from tables but also evaluate how a model generalizes to unseen questions, an important scenario for biomedical applications. To achieve the generalization evaluation, we divide the templates into 17 training and 5 cross-task evaluations. Then, we develop two baselines using single and multi-tasks learning on BioTABQA. Furthermore, we explore instructional learning, a recent technique showing impressive generalizing performance. Experimental results show that our instruction-tuned model outperforms single and multi-task baselines on an average by ~23% and ~6% across various evaluation settings, and more importantly, instruction-tuned model outperforms baselines by ~5% on cross-tasks.  ( 2 min )
    Federated and Transfer Learning: A Survey on Adversaries and Defense Mechanisms. (arXiv:2207.02337v1 [cs.LG])
    The advent of federated learning has facilitated large-scale data exchange amongst machine learning models while maintaining privacy. Despite its brief history, federated learning is rapidly evolving to make wider use more practical. One of the most significant advancements in this domain is the incorporation of transfer learning into federated learning, which overcomes fundamental constraints of primary federated learning, particularly in terms of security. This chapter performs a comprehensive survey on the intersection of federated and transfer learning from a security point of view. The main goal of this study is to uncover potential vulnerabilities and defense mechanisms that might compromise the privacy and performance of systems that use federated and transfer learning.  ( 2 min )
    Towards Realistic Semi-Supervised Learning. (arXiv:2207.02269v1 [cs.CV])
    Deep learning is pushing the state-of-the-art in many computer vision applications. However, it relies on large annotated data repositories, and capturing the unconstrained nature of the real-world data is yet to be solved. Semi-supervised learning (SSL) complements the annotated training data with a large corpus of unlabeled data to reduce annotation cost. The standard SSL approach assumes unlabeled data are from the same distribution as annotated data. Recently, ORCA [9] introduce a more realistic SSL problem, called open-world SSL, by assuming that the unannotated data might contain samples from unknown classes. This work proposes a novel approach to tackle SSL in open-world setting, where we simultaneously learn to classify known and unknown classes. At the core of our method, we utilize sample uncertainty and incorporate prior knowledge about class distribution to generate reliable pseudo-labels for unlabeled data belonging to both known and unknown classes. Our extensive experimentation showcases the effectiveness of our approach on several benchmark datasets, where it substantially outperforms the existing state-of-the-art on seven diverse datasets including CIFAR-100 (17.6%), ImageNet-100 (5.7%), and Tiny ImageNet (9.9%).  ( 2 min )
    Swin Deformable Attention U-Net Transformer (SDAUT) for Explainable Fast MRI. (arXiv:2207.02390v1 [cs.CV])
    Fast MRI aims to reconstruct a high fidelity image from partially observed measurements. Exuberant development in fast MRI using deep learning has been witnessed recently. Meanwhile, novel deep learning paradigms, e.g., Transformer based models, are fast-growing in natural language processing and promptly developed for computer vision and medical image analysis due to their prominent performance. Nevertheless, due to the complexity of the Transformer, the application of fast MRI may not be straightforward. The main obstacle is the computational cost of the self-attention layer, which is the core part of the Transformer, can be expensive for high resolution MRI inputs. In this study, we propose a new Transformer architecture for solving fast MRI that coupled Shifted Windows Transformer with U-Net to reduce the network complexity. We incorporate deformable attention to construe the explainability of our reconstruction model. We empirically demonstrate that our method achieves consistently superior performance on the fast MRI task. Besides, compared to state-of-the-art Transformer models, our method has fewer network parameters while revealing explainability. The code is publicly available at https://github.com/ayanglab/SDAUT.  ( 2 min )
    Transformers are Adaptable Task Planners. (arXiv:2207.02442v1 [cs.RO])
    Every home is different, and every person likes things done in their particular way. Therefore, home robots of the future need to both reason about the sequential nature of day-to-day tasks and generalize to user's preferences. To this end, we propose a Transformer Task Planner(TTP) that learns high-level actions from demonstrations by leveraging object attribute-based representations. TTP can be pre-trained on multiple preferences and shows generalization to unseen preferences using a single demonstration as a prompt in a simulated dishwasher loading task. Further, we demonstrate real-world dish rearrangement using TTP with a Franka Panda robotic arm, prompted using a single human demonstration.  ( 2 min )
    State-Augmented Learnable Algorithms for Resource Management in Wireless Networks. (arXiv:2207.02242v1 [cs.LG])
    We consider resource management problems in multi-user wireless networks, which can be cast as optimizing a network-wide utility function, subject to constraints on the long-term average performance of users across the network. We propose a state-augmented algorithm for solving the aforementioned radio resource management (RRM) problems, where, alongside the instantaneous network state, the RRM policy takes as input the set of dual variables corresponding to the constraints, which evolve depending on how much the constraints are violated during execution. We theoretically show that the proposed state-augmented algorithm leads to feasible and near-optimal RRM decisions. Moreover, focusing on the problem of wireless power control using graph neural network (GNN) parameterizations, we demonstrate the superiority of the proposed RRM algorithm over baseline methods across a suite of numerical experiments.  ( 2 min )
  • Open

    Stochastic normalizing flows as non-equilibrium transformations. (arXiv:2201.08862v3 [hep-lat] UPDATED)
    Normalizing flows are a class of deep generative models that provide a promising route to sample lattice field theories more efficiently than conventional Monte Carlo simulations. In this work we show that the theoretical framework of stochastic normalizing flows, in which neural-network layers are combined with Monte Carlo updates, is the same that underlies out-of-equilibrium simulations based on Jarzynski's equality, which have been recently deployed to compute free-energy differences in lattice gauge theories. We lay out a strategy to optimize the efficiency of this extended class of generative models and present examples of applications.
    Distributional neural networks for electricity price forecasting. (arXiv:2207.02832v1 [q-fin.ST])
    We present a novel approach to probabilistic electricity price forecasting (EPF) which utilizes distributional artificial neural networks. The novel network structure for EPF is based on a regularized distributional multilayer perceptron (DMLP) which contains a probability layer. Using the TensorFlow Probability framework, the neural network's output is defined to be a distribution, either normal or potentially skewed and heavy-tailed Johnson's SU (JSU). The method is compared against state-of-the-art benchmarks in a forecasting study. The study comprises forecasting involving day-ahead electricity prices in the German market. The results show evidence of the importance of higher moments when modeling electricity prices.
    State-Augmented Learnable Algorithms for Resource Management in Wireless Networks. (arXiv:2207.02242v1 [cs.LG])
    We consider resource management problems in multi-user wireless networks, which can be cast as optimizing a network-wide utility function, subject to constraints on the long-term average performance of users across the network. We propose a state-augmented algorithm for solving the aforementioned radio resource management (RRM) problems, where, alongside the instantaneous network state, the RRM policy takes as input the set of dual variables corresponding to the constraints, which evolve depending on how much the constraints are violated during execution. We theoretically show that the proposed state-augmented algorithm leads to feasible and near-optimal RRM decisions. Moreover, focusing on the problem of wireless power control using graph neural network (GNN) parameterizations, we demonstrate the superiority of the proposed RRM algorithm over baseline methods across a suite of numerical experiments.
    Evaluating Robustness to Dataset Shift via Parametric Robustness Sets. (arXiv:2205.15947v2 [cs.LG] UPDATED)
    We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. To ensure that these shifts are plausible, we parameterize them in terms of interpretable changes in causal mechanisms of observed variables. This defines a parametric robustness set of plausible distributions and a corresponding worst-case loss. While the loss under an individual parametric shift can be estimated via reweighting techniques such as importance sampling, the resulting worst-case optimization problem is non-convex, and the estimate may suffer from large variance. For small shifts, however, we can construct a local second-order approximation to the loss under shift and cast the problem of finding a worst-case shift as a particular non-convex quadratic optimization problem, for which efficient algorithms are available. We demonstrate that this second-order approximation can be estimated directly for shifts in conditional exponential family models, and we bound the approximation error. We apply our approach to a computer vision task (classifying gender from images), revealing sensitivity to shifts in non-causal attributes.
    Epistemic Neural Networks. (arXiv:2107.08924v5 [cs.LG] UPDATED)
    Intelligence relies on an agent's knowledge of what it does not know. This capability can be assessed based on the quality of joint predictions of labels across multiple inputs. Conventional neural networks lack this capability and, since most research has focused on marginal predictions, this shortcoming has been largely overlooked. We introduce the epistemic neural network (ENN) as an interface for models that represent uncertainty as required to generate useful joint predictions. While prior approaches to uncertainty modeling such as Bayesian neural networks can be expressed as ENNs, this new interface facilitates comparison of joint predictions and the design of novel architectures and algorithms. In particular, we introduce the epinet: an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty. With an epinet, conventional neural networks outperform very large ensembles, consisting of hundreds or more particles, with orders of magnitude less computation. We demonstrate this efficacy across synthetic data, ImageNet, and some reinforcement learning tasks. As part of this effort we open-source experiment code.
    Variational Flow Graphical Model. (arXiv:2207.02722v1 [stat.ML])
    This paper introduces a novel approach to embed flow-based models with hierarchical structures. The proposed framework is named Variational Flow Graphical (VFG) Model. VFGs learn the representation of high dimensional data via a message-passing scheme by integrating flow-based functions through variational inference. By leveraging the expressive power of neural networks, VFGs produce a representation of the data using a lower dimension, thus overcoming the drawbacks of many flow-based models, usually requiring a high dimensional latent space involving many trivial variables. Aggregation nodes are introduced in the VFG models to integrate forward-backward hierarchical information via a message passing scheme. Maximizing the evidence lower bound (ELBO) of data likelihood aligns the forward and backward messages in each aggregation node achieving a consistency node state. Algorithms have been developed to learn model parameters through gradient updating regarding the ELBO objective. The consistency of aggregation nodes enable VFGs to be applicable in tractable inference on graphical structures. Besides representation learning and numerical inference, VFGs provide a new approach for distribution modeling on datasets with graphical latent structures. Additionally, theoretical study shows that VFGs are universal approximators by leveraging the implicitly invertible flow-based structures. With flexible graphical structures and superior excessive power, VFGs could potentially be used to improve probabilistic inference. In the experiments, VFGs achieves improved evidence lower bound (ELBO) and likelihood values on multiple datasets.
    Improved conformalized quantile regression. (arXiv:2207.02808v1 [stat.ML])
    Conformalized quantile regression is a procedure that inherits the advantages of conformal prediction and quantile regression. That is, we use quantile regression to estimate the true conditional quantile and then apply a conformal step on a calibration set to ensure marginal coverage. In this way, we get adaptive prediction intervals that account for heteroscedasticity. However, the aforementioned conformal step lacks adaptiveness as described in (Romano et al., 2019). To overcome this limitation, instead of applying a single conformal step after estimating conditional quantiles with quantile regression, we propose to cluster the explanatory variables weighted by their permutation importance with an optimized k-means and apply k conformal steps. To show that this improved version outperforms the classic version of conformalized quantile regression and is more adaptive to heteroscedasticity, we extensively compare the prediction intervals of both in open datasets.
    Topological Information Retrieval with Dilation-Invariant Bottleneck Comparative Measures. (arXiv:2104.01672v3 [stat.ML] UPDATED)
    Appropriately representing elements in a database so that queries may be accurately matched is a central task in information retrieval; recently, this has been achieved by embedding the graphical structure of the database into a manifold in a hierarchy-preserving manner using a variety of metrics. Persistent homology is a tool commonly used in topological data analysis that is able to rigorously characterize a database in terms of both its hierarchy and connectivity structure. Computing persistent homology on a variety of embedded datasets reveals that some commonly used embeddings fail to preserve the connectivity. We show that those embeddings which successfully retain the database topology coincide in persistent homology by introducing two dilation-invariant comparative measures to capture this effect: in particular, they address the issue of metric distortion on manifolds. We provide an algorithm for their computation that exhibits greatly reduced time complexity over existing methods. We use these measures to perform the first instance of topology-based information retrieval and demonstrate its increased performance over the standard bottleneck distance for persistent homology. We showcase our approach on databases of different data varieties including text, videos, and medical images.
    Don't Pay Attention to the Noise: Learning Self-supervised Representations of Light Curves with a Denoising Time Series Transformer. (arXiv:2207.02777v1 [astro-ph.IM])
    Astrophysical light curves are particularly challenging data objects due to the intensity and variety of noise contaminating them. Yet, despite the astronomical volumes of light curves available, the majority of algorithms used to process them are still operating on a per-sample basis. To remedy this, we propose a simple Transformer model -- called Denoising Time Series Transformer (DTST) -- and show that it excels at removing the noise and outliers in datasets of time series when trained with a masked objective, even when no clean targets are available. Moreover, the use of self-attention enables rich and illustrative queries into the learned representations. We present experiments on real stellar light curves from the Transiting Exoplanet Space Satellite (TESS), showing advantages of our approach compared to traditional denoising techniques.
    Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via Online Experiment Design. (arXiv:2207.02575v1 [cs.LG])
    While much progress has been made in understanding the minimax sample complexity of reinforcement learning (RL) -- the complexity of learning on the "worst-case" instance -- such measures of complexity often do not capture the true difficulty of learning. In practice, on an "easy" instance, we might hope to achieve a complexity far better than that achievable on the worst-case instance. In this work we seek to understand the "instance-dependent" complexity of learning near-optimal policies (PAC RL) in the setting of RL with linear function approximation. We propose an algorithm, \textsc{Pedel}, which achieves a fine-grained instance-dependent measure of complexity, the first of its kind in the RL with function approximation setting, thereby capturing the difficulty of learning on each particular problem instance. Through an explicit example, we show that \textsc{Pedel} yields provable gains over low-regret, minimax-optimal algorithms and that such algorithms are unable to hit the instance-optimal rate. Our approach relies on a novel online experiment design-based procedure which focuses the exploration budget on the "directions" most relevant to learning a near-optimal policy, and may be of independent interest.
    PAC Prediction Sets for Meta-Learning. (arXiv:2207.02440v1 [cs.LG])
    Uncertainty quantification is a key component of machine learning models targeted at safety-critical systems such as in healthcare or autonomous vehicles. We study this problem in the context of meta learning, where the goal is to quickly adapt a predictor to new tasks. In particular, we propose a novel algorithm to construct \emph{PAC prediction sets}, which capture uncertainty via sets of labels, that can be adapted to new tasks with only a few training examples. These prediction sets satisfy an extension of the typical PAC guarantee to the meta learning setting; in particular, the PAC guarantee holds with high probability over future tasks. We demonstrate the efficacy of our approach on four datasets across three application domains: mini-ImageNet and CIFAR10-C in the visual domain, FewRel in the language domain, and the CDC Heart Dataset in the medical domain. In particular, our prediction sets satisfy the PAC guarantee while having smaller size compared to other baselines that also satisfy this guarantee.
    When does SGD favor flat minima? A quantitative characterization via linear stability. (arXiv:2207.02628v1 [stat.ML])
    The observation that stochastic gradient descent (SGD) favors flat minima has played a fundamental role in understanding implicit regularization of SGD and guiding the tuning of hyperparameters. In this paper, we provide a quantitative explanation of this striking phenomenon by relating the particular noise structure of SGD to its \emph{linear stability} (Wu et al., 2018). Specifically, we consider training over-parameterized models with square loss. We prove that if a global minimum $\theta^*$ is linearly stable for SGD, then it must satisfy $\|H(\theta^*)\|_F\leq O(\sqrt{B}/\eta)$, where $\|H(\theta^*)\|_F, B,\eta$ denote the Frobenius norm of Hessian at $\theta^*$, batch size, and learning rate, respectively. Otherwise, SGD will escape from that minimum \emph{exponentially} fast. Hence, for minima accessible to SGD, the flatness -- as measured by the Frobenius norm of the Hessian -- is bounded independently of the model size and sample size. The key to obtaining these results is exploiting the particular geometry awareness of SGD noise: 1) the noise magnitude is proportional to loss value; 2) the noise directions concentrate in the sharp directions of local landscape. This property of SGD noise provably holds for linear networks and random feature models (RFMs) and is empirically verified for nonlinear networks. Moreover, the validity and practical relevance of our theoretical findings are justified by extensive numerical experiments.
    Adaptive deep learning for nonparametric time series regression. (arXiv:2207.02546v1 [math.ST])
    In this paper, we develop a general theory for adaptive nonparametric estimation of mean functions of nonstationary and nonlinear time series using deep neural networks (DNNs). We first consider two types of DNN estimators, non-penalized and sparse-penalized DNN estimators, and establish their generalization error bounds for general nonstationary time series. We then derive minimax lower bounds for estimating mean functions belonging to a wide class of nonlinear autoregressive (AR) models that include nonlinear generalized additive AR, single index, and threshold AR models. Building upon the results, we show that the sparse-penalized DNN estimator is adaptive and attains the minimax optimal rates up to a poly-logarithmic factor for many nonlinear AR models. Through numerical simulations, we demonstrate the usefulness of the DNN methods for estimating nonlinear AR models with intrinsic low-dimensional structures and discontinuous or rough mean functions, which is consistent with our theory.
    Neural network stochastic differential equation models with applications to financial data forecasting. (arXiv:2111.13164v5 [cs.LG] UPDATED)
    In this article, we employ a collection of stochastic differential equations with drift and diffusion coefficients approximated by neural networks to predict the trend of chaotic time series which has big jump properties. Our contributions are, first, we propose a model called L\'evy induced stochastic differential equation network, which explores compounded stochastic differential equations with $\alpha$-stable L\'evy motion to model complex time series data and solve the problem through neural network approximation. Second, we theoretically prove the convergence of our algorithm with respect to hyper-parameters of the neural network, and obtain the error bound without curse of dimensionality. Finally, we illustrate our method by applying it to real financial time series data and find the accuracy increases through the use of non-Gaussian L\'evy processes. We also present detailed comparisons in terms of data patterns, various models, different shapes of L\'evy motion and the prediction lengths.
    Trading with the Momentum Transformer: An Intelligent and Interpretable Architecture. (arXiv:2112.08534v2 [cs.LG] UPDATED)
    We introduce the Momentum Transformer, an attention-based deep learning architecture which outperforms benchmark momentum and mean-reversion trading strategies. Unlike state-of-the-art Long Short-Term Memory (LSTM) architectures, which are sequential in nature, the attention mechanism provides our architecture with a direct connection to all previous time-steps. Our architecture enables us to learn longer-term dependencies, improves performance when considering returns net of transaction costs and naturally adapts to new market regimes, such as during the SARS-CoV-2 crisis. The Momentum Transformer is inherently interpretable, providing us with greater insights into our deep learning momentum trading strategy, including how it blends different classical strategies and the past time-steps which are of the greatest significance to the model.
    Private Matrix Approximation and Geometry of Unitary Orbits. (arXiv:2207.02794v1 [cs.DS])
    Consider the following optimization problem: Given $n \times n$ matrices $A$ and $\Lambda$, maximize $\langle A, U\Lambda U^*\rangle$ where $U$ varies over the unitary group $\mathrm{U}(n)$. This problem seeks to approximate $A$ by a matrix whose spectrum is the same as $\Lambda$ and, by setting $\Lambda$ to be appropriate diagonal matrices, one can recover matrix approximation problems such as PCA and rank-$k$ approximation. We study the problem of designing differentially private algorithms for this optimization problem in settings where the matrix $A$ is constructed using users' private data. We give efficient and private algorithms that come with upper and lower bounds on the approximation error. Our results unify and improve upon several prior works on private matrix approximation problems. They rely on extensions of packing/covering number bounds for Grassmannians to unitary orbits which should be of independent interest.
    Reconstructing Nonlinear Dynamical Systems from Multi-Modal Time Series. (arXiv:2111.02922v3 [cs.LG] UPDATED)
    Empirically observed time series in physics, biology, or medicine, are commonly generated by some underlying dynamical system (DS) which is the target of scientific interest. There is an increasing interest to harvest machine learning methods to reconstruct this latent DS in a data-driven, unsupervised way. In many areas of science it is common to sample time series observations from many data modalities simultaneously, e.g. electrophysiological and behavioral time series in a typical neuroscience experiment. However, current machine learning tools for reconstructing DSs usually focus on just one data modality. Here we propose a general framework for multi-modal data integration for the purpose of nonlinear DS reconstruction and the analysis of cross-modal relations. This framework is based on dynamically interpretable recurrent neural networks as general approximators of nonlinear DSs, coupled to sets of modality-specific decoder models from the class of generalized linear models. Both an expectation-maximization and a variational inference algorithm for model training are advanced and compared. We show on nonlinear DS benchmarks that our algorithms can efficiently compensate for too noisy or missing information in one data channel by exploiting other channels, and demonstrate on experimental neuroscience data how the algorithm learns to link different data domains to the underlying dynamics.
    Many-body localized hidden Born machine. (arXiv:2207.02346v1 [quant-ph])
    Born Machines are quantum-inspired generative models that leverage the probabilistic nature of quantum states. Here, we present a new architecture called many-body localized (MBL) hidden Born machine that uses both MBL dynamics and hidden units as learning resources. We theoretically prove that MBL Born machines possess more expressive power than classical models, and the introduction of hidden units boosts its learning power. We numerically demonstrate that the MBL hidden Born machine is capable of learning a toy dataset consisting of patterns of MNIST handwritten digits, quantum data obtained from quantum many-body states, and non-local parity data. In order to understand the mechanism behind learning, we track physical quantities such as von Neumann entanglement entropy and Hamming distance during learning, and compare the learning outcomes in the MBL, thermal, and Anderson localized phases. We show that the superior learning power of the MBL phase relies importantly on both localization and interaction. Our architecture and algorithm provide novel strategies of utilizing quantum many-body systems as learning resources, and reveal a powerful connection between disorder, interaction, and learning in quantum systems.
    Linear Jamming Bandits: Sample-Efficient Learning for Non-Coherent Digital Jamming. (arXiv:2207.02365v1 [cs.LG])
    It has been shown (Amuru et al. 2015) that online learning algorithms can be effectively used to select optimal physical layer parameters for jamming against digital modulation schemes without a priori knowledge of the victim's transmission strategy. However, this learning problem involves solving a multi-armed bandit problem with a mixed action space that can grow very large. As a result, convergence to the optimal jamming strategy can be slow, especially when the victim and jammer's symbols are not perfectly synchronized. In this work, we remedy the sample efficiency issues by introducing a linear bandit algorithm that accounts for inherent similarities between actions. Further, we propose context features which are well-suited for the statistical features of the non-coherent jamming problem and demonstrate significantly improved convergence behavior compared to the prior art. Additionally, we show how prior knowledge about the victim's transmissions can be seamlessly integrated into the learning framework. We finally discuss limitations in the asymptotic regime.
    Expectation Distance-based Distributional Clustering for Noise-Robustness. (arXiv:2110.08871v3 [cs.LG] UPDATED)
    This paper presents a clustering technique that reduces the susceptibility to data noise by learning and clustering the data-distribution and then assigning the data to the cluster of its distribution and, in the process, reducing the impact of noise on clustering results. This method involves introducing a new distance among distributions, namely the expectation distance (denoted, ED), that goes beyond the state-of-art distribution distance of optimal mass transport (denoted, $W_2$ for $2$-Wasserstein): The latter essentially depends only on the marginal distributions while the former also employs the information about the joint distributions. Using the ED, the paper extends the classical $K$-means and $K$-medoids clustering to those over data-distributions (rather raw data) and introduces $K$-medoids using $W_2$. The paper also presents the closed-form expressions of the ED distance measure for the case when the uncertainty is Gaussian. The implementation results of the proposed ED and the $W_2$ distance measures to cluster real-world weather data are also presented, which involves efficiently extracting and using underlying uncertainty information in the form of means and variances (that, for example, is adequate to characterize Gaussian distributions). The results show striking performance improvement over classical clustering of raw data, with higher accuracy realized for ED. This is because while $W_2$ employs only the marginal distributions ignoring the correlations, the proposed ED also uses the joint distributions factoring the correlations into the distance measures.
    Integral Probability Metrics PAC-Bayes Bounds. (arXiv:2207.00614v2 [stat.ML] UPDATED)
    We present a PAC-Bayes-style generalization bound which enables the replacement of the KL-divergence with a variety of Integral Probability Metrics (IPM). We provide instances of this bound with the IPM being the total variation metric and the Wasserstein distance. A notable feature of the obtained bounds is that they naturally interpolate between classical uniform convergence bounds in the worst case (when the prior and posterior are far away from each other), and preferable bounds in better cases (when the posterior and prior are close). This illustrates the possibility of reinforcing classical generalization bounds with algorithm- and data-dependent components, thus making them more suitable to analyze algorithms that use a large hypothesis space.
    Instance-optimal PAC Algorithms for Contextual Bandits. (arXiv:2207.02357v1 [stat.ML])
    In the stochastic contextual bandit setting, regret-minimizing algorithms have been extensively researched, but their instance-minimizing best-arm identification counterparts remain seldom studied. In this work, we focus on the stochastic bandit problem in the $(\epsilon,\delta)$-$\textit{PAC}$ setting: given a policy class $\Pi$ the goal of the learner is to return a policy $\pi\in \Pi$ whose expected reward is within $\epsilon$ of the optimal policy with probability greater than $1-\delta$. We characterize the first $\textit{instance-dependent}$ PAC sample complexity of contextual bandits through a quantity $\rho_{\Pi}$, and provide matching upper and lower bounds in terms of $\rho_{\Pi}$ for the agnostic and linear contextual best-arm identification settings. We show that no algorithm can be simultaneously minimax-optimal for regret minimization and instance-dependent PAC for best-arm identification. Our main result is a new instance-optimal and computationally efficient algorithm that relies on a polynomial number of calls to an argmax oracle.
    Conditional Distribution Function Estimation Using Neural Networks for Censored and Uncensored Data. (arXiv:2207.02384v1 [stat.ME])
    Most work in neural networks focuses on estimating the conditional mean of a continuous response variable given a set of covariates.In this article, we consider estimating the conditional distribution function using neural networks for both censored and uncensored data. The algorithm is built upon the data structure particularly constructed for the Cox regression with time-dependent covariates. Without imposing any model assumption, we consider a loss function that is based on the full likelihood where the conditional hazard function is the only unknown nonparametric parameter, for which unconstraint optimization methods can be applied. Through simulation studies, we show the proposed method possesses desirable performance, whereas the partial likelihood method and the traditional neural networks with $L_2$ loss yield biased estimates when model assumptions are violated. We further illustrate the proposed method with several real-world data sets. The implementation of the proposed methods is made available at https://github.com/bingqing0729/NNCDE.

  • Open

    7+ Best Books to Learn Neural Networks in 2022 for Beginners (Updated)
    submitted by /u/Lakshmireddys [link] [comments]  ( 83 min )
    What are artificial intelligences that can automatically edit music, images, texts, beats in some way?
    submitted by /u/xXNOdrugsForMEXx [link] [comments]  ( 84 min )
    I got some midjourney invites left !
    I don’t got any friends to give the invites to so who needs one! submitted by /u/projhect-AI [link] [comments]  ( 84 min )
    Is there an app/site/software that uses AI image recognition to organize images by similarity? I'm looking to sort a bunch of dall-e images
    Tried to explain as much as possible in the title. I did a "run" of DALL-E and I have already used photoshop's macros to crop each of them in a different file bc I feel like there's an interesting experience in watching it go through similar but different iteractions, but I would like it to be sorted by similarity to make the most impact. Can any of you recommend me a way to do that? The first result I found in google pinged the antivirus so I felt like getting recommendations was the way to go. Here's an example of that kind of images I'm talking about https://imgur.com/a/miG2WWZ submitted by /u/quiteawhile [link] [comments]  ( 84 min )
    Elon Musk: "I hope that the AI is nice to us ... I've lost a lot of sleep thinking about AI as an existential risk ... I think there should probably should be a regulatory agency that oversees advanced AI, because it's a public safety risk." (2-minute clip)
    submitted by /u/Farnectarine4825 [link] [comments]  ( 84 min )
    AI Dream 61 - EPIC Nebula Exploration by AI
    submitted by /u/LordPewPew777 [link] [comments]  ( 84 min )
    Meta's latest open source AI can translate 200 languages
    submitted by /u/much_successes [link] [comments]  ( 85 min )
    Want to animate your photos from midjourney in 3D, high resolution 4k? Check out my new tutorial!
    submitted by /u/nalr00n [link] [comments]  ( 84 min )
    No Language Left Behind: Translating 200 languages with a single model - by Meta AI
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 84 min )
    when AGI hits its stride, the cost of all goods and services will fall
    submitted by /u/bartturner [link] [comments]  ( 84 min )
    Socially engineered
    I created this Reddit account in 2021 for Crypto only. It's not been used for about 7 months. I'm creating this thread to highlight a pattern I'm noticing in email sent from [noreply@redditmail.com](mailto:noreply@redditmail.com). Every email sent in 2021 up to October was totally related to my Crypto interest and activity here on Reddit. From October 2nd 2021 to April 22nd 2022 there was a gap where Reddit did not send any highlights or promotional emails. It appears all of the emails sent this year know more about me than I have ever shared on Reddit. While I was modding a 3DS it sent 3DS suggestions. Since December I've been experimenting with GPT-3, and now I'm suggested content from this forum. Since childhood I've had a passionate interest in robotics, AI, and software/hardware in g…  ( 103 min )
    “Universal explainers”
    What do you think of David Deutsch's theory of “universal explainers”? https://www.lesswrong.com/posts/HDyePg6oySYQ9hY4i/david-deutsch-on-universal-explainers-and-ai submitted by /u/Equal-Lingonberry517 [link] [comments]  ( 83 min )
    Websites/Programs for testing artificial intelligence
    What are the sites/programs that you can test some kind of artificial intelligence for free and uncomplicated? submitted by /u/NaturalMagicCat [link] [comments]  ( 84 min )
  • Open

    7+ Best Books to Learn Neural Networks in 2022 for Beginners (Updated)
    submitted by /u/Lakshmireddys [link] [comments]  ( 84 min )
    A Tutorial on Using Using Neural Style PT to Transfer the Style of One Image to Another
    View the tutorial here: HERE This tutorial teaches you how to transfer the style of one image to another image using neural-style-pt. Below is a imgur gallery showing off the transformation process. https://imgur.com/gallery/iMlkkQi Let me know if you have any questions or comments. submitted by /u/mshriver2 [link] [comments]  ( 84 min )
    Has anyone tried using an external NVIDIA GPU for machine learning on a MacBook Pro?
    submitted by /u/PopOk539 [link] [comments]  ( 84 min )
  • Open

    Break through language barriers with Amazon Transcribe, Amazon Translate, and Amazon Polly
    Imagine a surgeon taking video calls with patients across the globe without the need of a human translator. What if a fledgling startup could easily expand their product across borders and into new geographical markets by offering fluid, accurate, multilingual customer support and sales, all without the need of a live human translator? What happens […]  ( 10 min )
  • Open

    Dijkstra extends Pythagoras
    Suppose a triangle has sides a, b, and c. Label the angles opposite these three sides α, β, and γ respectively. Edsger Dijkstra published (EWD975-0) a note proving the following extension of the Pythagorean theorem: sgn(α + β – γ) = sgn(a² + b² – c²). Here the sgn function is -1, 0, or 1 […] Dijkstra extends Pythagoras first appeared on John D. Cook.  ( 4 min )
  • Open

    [D] How would you measure the correlation of the gradient across iterations?
    One simple thing one could do is take the dot product between the current and the n-1 gradient. But this will of course not be very meaningful as what really matters is a (sort-of) average correlation across several iterations, which will not be revealed from doing such a local comparison (using gradients from step n and n-1). Ideally it would be a calculation that would not require keeping around old gradients. Any ideas? submitted by /u/fasttosmile [link] [comments]  ( 85 min )
    [D] Handling OOV in sequence generation
    What are some methods to handle OOV words when generating sequences? For example for some n-gram implementations, I've seen all tokens removed from the candidate list of words to be sampled from given the prior n-gram, and if there are no other candidates the generated text is ended. Curious to learn about some other methods to deal with OOV. submitted by /u/MLJungle [link] [comments]  ( 85 min )
    [R] CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning
    Paper: https://arxiv.org/pdf/2207.01780.pdf Github: https://github.com/salesforce/CodeRL Abstract: Program synthesis or code generation aims to generate a program that satisfies a problem specification. Recent approaches using large-scale pretrained language models (LMs) have shown promising results, yet they have some critical limitations. In particular, they often follow a standard supervised fine-tuning procedure to train a code generation model only from the pairs of natural-language problem descriptions and ground-truth programs. Such paradigm largely ignores some important but potentially useful signals in the problem specification such as unit tests, which thus often results in poor performance when solving complex unseen coding tasks. To address the limitations, we propose "CodeRL", a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning (RL). Specifically, during training, we treat the code-generating LM as an actor network, and introduce a critic network that is trained to predict the functional correctness of generated programs and provide dense feedback signals to the actor. During inference, we introduce a new generation procedure with a critical sampling strategy that allows a model to automatically regenerate programs based on feedback from example unit tests and critic scores. For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives, larger model sizes, and better pretraining data. Our method not only achieves new SOTA results on the challenging APPS benchmark, but also shows strong zero-shot transfer capability with new SOTA results on the simpler MBPP benchmark. https://preview.redd.it/goglny8a30a91.jpg?width=1218&format=pjpg&auto=webp&s=a6f50319637cf85fed2de1d08b407478f6a227aa https://preview.redd.it/vav9glra30a91.jpg?width=1234&format=pjpg&auto=webp&s=19ef106847c090fab438338fad912f1afd75db1a submitted by /u/Singularian2501 [link] [comments]  ( 86 min )
    [D] Why aren't there much people working on causal machine learning?
    It seems Judea Pearl, Yoshua Bengio, Elias Bareinboim and a handful of other researchers are only people who are working on causal inference and machine learning. Is causal machine learning still a niche field? Also, do you know any researcher working on causal machine learning at Berkeley? submitted by /u/After_Philosopher572 [link] [comments]  ( 87 min )
    [D] Object Detection trained on simulated renderings unable to converge on real images - why?
    I wrote a program in Unity that generated millions of fake images using the HDRP rendering pipeline. For starters I only want to detect a bottle of "ITO EN" ice-tea. Here is an example (left is real, right is the fake rendering). I have a simple 3 layer resnet CNN with 3 blocks each, and use a Global Average Pooling layer at the end to visualize the detection. Using the simulation dataset only I get an accuracy of 97% or higher. Using the real dataset I only get ~70% accuracy. I wanna add that this is not a result of over-training, (a) because I use validation set and stop training if it hasn't improved and d (b) the test set performs very well. This is infuriating, because the image dataset is extremely diverse and I use a ton of image transformations in order to provide a very high level of diversity. I also use various levels of lighting, bloom, camera exposures, motion blur, changing materials for all assets, as well as changing the properties for the target (the bottle), such as glossiness, reflection, emissive lighting, and so on. Here is an example for the rendered dataset that is used for training, and here is an example for the real dataset. Anyone got an idea why this isn't working out? submitted by /u/tmuxed [link] [comments]  ( 88 min )
    [D] How to correctly transform Cityscapes Masks to Bounding Boxes?
    As the title suggests, I would like to know the correct way to pre-process the cityscapes dataset for object detection. There are multiple ways how this can be done. There is a version in Detectron2, in MM Detection, there is this. Which one is the correct way, without getting errors in the labels? Anybody worked with this before? Would be glad if anybody might have an idea. submitted by /u/SeucheAchat9115 [link] [comments]  ( 86 min )
    [P] Tutorial: Serverless MLOps pipelines with Vertex AI and ZenML
    At ZenML, we created a guide to easily run MLOps pipelines on Google Cloud Platform with Vertex AI. I thought I'd share it here because I think it might be useful for people who are just starting MLOps on GCP. Blog post: https://blog.zenml.io/vertex-ai-blog/ Full video: https://youtu.be/qgvmvexGv_c Why is this better than going through the Vertex AI SDK? ZenML steps and pipeline can be written with a simple decorator pattern that is easily approachable for a #datascientist. ZenML takes cares of storing and versioning pythonic objects between steps of a ZenML provides first-class integrations into other MLOps tools that you can leverage natively in your pipelines. For example, you can track experiments on MLFlow easily. ZenML pipelines can be run locally first, and then deployed instantly. You can run a ZenML pipeline not only on Vertex, but also #Airflow, #Kubeflow, #Kubernetes, or whereever else you'd like!📷Watch the full video: https://www.youtube.com/watch?v=qgvmvexGv_c&ab_channel=ZenML I bet the GCP Vertex AI folk here might like the above video. It isn't just about ZenML either but more of a broader look into the different components that go into running ML in production on GCP (Container registry, Cloud Storage, Secret Manager, Vertex, Cloud SQL) Would love to hear more feedback on the video or blog! submitted by /u/htahir1 [link] [comments]  ( 86 min )
    [P] Some new Sherlock Holmes stories (GPT-3)
    I thought I share some of the Sherlock Holmes stories I created with various prompts using GPT-3. Might be fun for some fans, but overall the stories are all a bit superficial, although some of them made me laugh (the fourth one in particular). ​ John Watson was having a cup of tea in his flat when he heard a knock at the door. He got up to answer it, and found Sherlock Holmes standing there, looking rather grave. "Watson, I'm afraid I have some bad news," said Sherlock. "I've just been to Baker Street, where I found Mrs. Hudson in a state of hysterics. It seems that a Mr. Bartholomew Jones was found dead in his study, and Mrs. Hudson is convinced that it was murder." "That is rather shocking," said Watson. "Do you have any idea who might have done it?" "I have some suspicions," …  ( 100 min )
    [R] Detectron2 STMDA-RetinaNet
    Hello, i am happy to share with you one of my latest work for domain adaptation built on top of Detectron2 object detector model (RetinaNet). Link to the github repo STMDA-RetinaNet: https://github.com/fpv-iplab/STMDA-RetinaNet submitted by /u/CapitalShake3085 [link] [comments]  ( 85 min )
    [R] How Machine Learning is Used in Finance and Banking
    Machine learning solutions are already embedded in the finance and banking industry. In this article, we reviewed the most popular use cases of ML in banking and shared practical tips on how to implement it into your business.https://exadel.com/news/how-machine-learning-is-used-in-finance-and-banking submitted by /u/lklimusheuskaja [link] [comments]  ( 85 min )
    Jupyter Notebook Competition coming up! [News]
    The Jupyter Notebook Competition deadline is fast approaching! https://preview.redd.it/gy6m0myhyx991.png?width=1920&format=png&auto=webp&s=3039abe962df07df74740772994f17502fa686bb Don't miss out on your chance to contribute to a community-driven resource of notebooks on the Copernicus WEkEO platform, AND be in with a chance of winning cash prizes! Visit: https://www.eumetsat.int/features/new-jupyter-notebook-competition submitted by /u/EUMETSAT [link] [comments]  ( 85 min )
    [News] Ian Goodfellow joins DeepMind as a Research Scientist
    Per his tweet at https://twitter.com/goodfellow_ian/status/1544638709039091717, Goodfellow will be a research scientist under Oriol Vinyals' Deep Learning team. submitted by /u/The_Removed [link] [comments]  ( 90 min )
    [P] Comparing DevOps into MLOps to analyse tools doing well in the market
    Hi all, I've been an active practitioner in Deep Learning and then wanted to build something in MLOps. So wanted to dig deeper in how DevOps evolved and wanted to check if MLOps can take the same path. The findings are really great. Absolutely every tool doing well in the market is a clear replacement for DevOps tool in MLOps. Here is my blog on it. Looking for feedback. If you have any comments, let me know. Will add them. https://sachinchandra.substack.com/p/bringing-software-development-principles submitted by /u/scb_11 [link] [comments]  ( 89 min )
  • Open

    MLGO: A Machine Learning Framework for Compiler Optimization
    Posted by Yundi Qian, Software Engineer, Google Research and Mircea Trofin, Software Engineer, Google Core The question of how to compile faster and smaller code arose together with the birth of modem computers. Better code optimization can significantly reduce the operational cost of large datacenter applications. The size of compiled code matters the most to mobile and embedded systems or software deployed on secure boot partitions, where the compiled binary must fit in tight code size budgets. With advances in the field, the headroom has been heavily squeezed with increasingly complicated heuristics, impeding maintenance and further improvements. Recent research has shown that machine learning (ML) can unlock more opportunities in compiler optimization by replacing complicated heuri…  ( 25 min )
  • Open

    "Offline RL Policies Should be Trained to be Adaptive", Ghosh et al 2022
    submitted by /u/gwern [link] [comments]  ( 84 min )
    Reinforcement Learning without Reward Engineering
    submitted by /u/Euphetar [link] [comments]  ( 84 min )
    d4rl PyTorch Dataloader
    I need to load some offline RL data, which is accessible via a similar interface as `d4rl`. It uses a HDF5 file for storage under the hood. I want to write a Dataloader in PyTorch, which is something I haven't done before for custom data. I have started implementing a custom subclass to PyTorch's `Dataset`. In the docs it says that `__getitem__` shall return one example at the given index. I'm worried that naively getting one data point from the HDF5 file and returning that will be way too slow. Am I going to have to come up with a very smart `__getitem__` function that loads more than required from disk, saves that in a smart data structure, and next time checks that data structure first before issuing a I/O request? Edit: typo submitted by /u/lemlo100 [link] [comments]  ( 84 min )
    Multi-Armed Bandit versions
    Hello everyone! I just started working with multi-armed bandits. I have two directions I could explore, and if anyone know any resources (book, research papers. etc) that would be awesome! I would like to implement multiple agent which can share knowledge with each other. For example, two agents who sell ice cream. They want to offer the best flavor, but sell at different locations which can affect which is the actual best flavor at that location. So the best option might not be the same, but if a lot of costumers start to buy a specific flavor at one place, it might be worth exploring for the other agent. Trying to determine best price (continuous value), arms are now placed at different prices. Since we now have distinct prices, the actual best price will most likely not be among the options. How could one tackle this problem? I’m not expecting anyone to take the time and explain my problems, but if you know of any good resources, please share! Thanks in advance! :) submitted by /u/AnkanTV [link] [comments]  ( 85 min )
  • Open

    Art by Artificial Intelligence: AI Generated Paintings
    AI has brought a new life to art.  ( 7 min )
    Your Predictions Are Only As Good As Your Data
    Testing Data Vs Training Data In Machine Learning  ( 14 min )
  • Open

    Startup lets doctors classify skin conditions with the snap of a picture
    Piction Health, founded by Susan Conover SM ’15, uses machine learning to help physicians identify and manage skin disease.  ( 8 min )
  • Open

    An Empirical Study of Implicit Regularization in Deep Offline RL. (arXiv:2207.02099v1 [cs.LG])
    Deep neural networks are the most commonly used function approximators in offline Reinforcement Learning these days. Prior works have shown that neural nets trained with TD-learning and gradient descent can exhibit implicit regularization that can be characterized by under-parameterization of these networks. Specifically, the rank of the penultimate feature layer, also called \textit{effective rank}, has been observed to drastically collapse during the training. In turn, this collapse has been argued to reduce the model's ability to further adapt in later stages of learning, leading to the diminished final performance. Such an association between the effective rank and performance makes effective rank compelling for offline RL, primarily for offline policy evaluation. In this work, we conduct a careful empirical study on the relation between effective rank and performance on three offline RL datasets : bsuite, Atari, and DeepMind lab. We observe that a direct association exists only in restricted settings and disappears in the more extensive hyperparameter sweeps. Also, we empirically identify three phases of learning that explain the impact of implicit regularization on the learning dynamics and found that bootstrapping alone is insufficient to explain the collapse of the effective rank. Further, we show that several other factors could confound the relationship between effective rank and performance and conclude that studying this association under simplistic assumptions could be highly misleading.  ( 3 min )
    Regret analysis of the Piyavskii-Shubert algorithm for global Lipschitz optimization. (arXiv:2002.02390v4 [cs.LG] UPDATED)
    We consider the problem of maximizing a non-concave Lipschitz multivariate function over a compact domain by sequentially querying its (possibly perturbed) values. We study a natural algorithm designed originally by Piyavskii and Shubert in 1972, for which we prove new bounds on the number of evaluations of the function needed to reach or certify a given optimization accuracy. Our analysis uses a bandit-optimization viewpoint and solves an open problem from Hansen et al.\ (1991) by bounding the number of evaluations to certify a given accuracy with a near-optimal sum of packing numbers.  ( 2 min )
    De-Biasing Generative Models using Counterfactual Methods. (arXiv:2207.01575v2 [cs.LG] UPDATED)
    Variational autoencoders (VAEs) and other generative methods have garnered growing interest not just for their generative properties but also for the ability to dis-entangle a low-dimensional latent variable space. However, few existing generative models take causality into account. We propose a new decoder based framework named the Causal Counterfactual Generative Model (CCGM), which includes a partially trainable causal layer in which a part of a causal model can be learned without significantly impacting reconstruction fidelity. By learning the causal relationships between image semantic labels or tabular variables, we can analyze biases, intervene on the generative model, and simulate new scenarios. Furthermore, by modifying the causal structure, we can generate samples outside the domain of the original training data and use such counterfactual models to de-bias datasets. Thus, datasets with known biases can still be used to train the causal generative model and learn the causal relationships, but we can produce de-biased datasets on the generative side. Our proposed method combines a causal latent space VAE model with specific modification to emphasize causal fidelity, enabling finer control over the causal layer and the ability to learn a robust intervention framework. We explore how better disentanglement of causal learning and encoding/decoding generates higher causal intervention quality. We also compare our model against similar research to demonstrate the need for explicit generative de-biasing beyond interventions. Our initial experiments show that our model can generate images and tabular data with high fidelity to the causal framework and accommodate explicit de-biasing to ignore undesired relationships in the causal data compared to the baseline.  ( 3 min )
    Benchmarking Deep AUROC Optimization: Loss Functions and Algorithmic Choices. (arXiv:2203.14177v3 [cs.LG] UPDATED)
    The area under the ROC curve (AUROC) has been vigorously applied for imbalanced classification and moreover combined with deep learning techniques. However, there is no existing work that provides sound information for peers to choose appropriate deep AUROC maximization techniques. In this work, we fill this gap from three aspects. (i) We benchmark a variety of loss functions with different algorithmic choices for deep AUROC optimization problem. We study the loss functions in two categories: pairwise loss and composite loss, which includes a total of 10 loss functions. Interestingly, we find composite loss, as an innovative loss function class, shows more competitive performance than pairwise loss from both training convergence and testing generalization perspectives. Nevertheless, data with more corrupted labels favors a pairwise symmetric loss. (ii) Moreover, we benchmark and highlight the essential algorithmic choices such as positive sampling rate, regularization, normalization/activation, and optimizers. Key findings include: higher positive sampling rate is likely to be beneficial for deep AUROC maximization; different datasets favors different weights of regularizations; appropriate normalization techniques, such as sigmoid and $\ell_2$ score normalization, could improve model performance. (iii) For optimization aspect, we benchmark SGD-type, Momentum-type, and Adam-type optimizers for both pairwise and composite loss. Our findings show that although Adam-type method is more competitive from training perspective, but it does not outperform others from testing perspective.  ( 3 min )
    Accelerating Hamiltonian Monte Carlo via Chebyshev Integration Time. (arXiv:2207.02189v1 [cs.LG])
    Hamiltonian Monte Carlo (HMC) is a popular method in sampling. While there are quite a few works of studying this method on various aspects, an interesting question is how to choose its integration time to achieve acceleration. In this work, we consider accelerating the process of sampling from a distribution $\pi(x) \propto \exp(-f(x))$ via HMC via time-varying integration time. When the potential $f$ is $L$-smooth and $m$-strongly convex, i.e.\ for sampling from a log-smooth and strongly log-concave target distribution $\pi$, it is known that under a constant integration time, the number of iterations that ideal HMC takes to get an $\epsilon$ Wasserstein-2 distance to the target $\pi$ is $O( \kappa \log \frac{1}{\epsilon} )$, where $\kappa := \frac{L}{m}$ is the condition number. We propose a scheme of time-varying integration time based on the roots of Chebyshev polynomials. We show that in the case of quadratic potential $f$, i.e., when the target $\pi$ is a Gaussian distribution, ideal HMC with this choice of integration time only takes $O( \sqrt{\kappa} \log \frac{1}{\epsilon} )$ number of iterations to reach Wasserstein-2 distance less than $\epsilon$; this improvement on the dependence on condition number is akin to acceleration in optimization. The design and analysis of HMC with the proposed integration time is built on the tools of Chebyshev polynomials. Experiments find the advantage of adopting our scheme of time-varying integration time even for sampling from distributions with smooth strongly convex potentials that are not quadratic.  ( 3 min )
    Data-driven synchronization-avoiding algorithms in the explicit distributed structural analysis of soft tissue. (arXiv:2207.02194v1 [cs.DC])
    We propose a data-driven framework to increase the computational efficiency of the explicit finite element method in the structural analysis of soft tissue. An encoder-decoder long short-term memory deep neural network is trained based on the data produced by an explicit, distributed finite element solver. We leverage this network to predict synchronized displacements at shared nodes, minimizing the amount of communication between processors. We perform extensive numerical experiments to quantify the accuracy and stability of the proposed synchronization-avoiding algorithm.  ( 2 min )
    Learning Stochastic Shortest Path with Linear Function Approximation. (arXiv:2110.12727v3 [cs.LG] UPDATED)
    We study the stochastic shortest path (SSP) problem in reinforcement learning with linear function approximation, where the transition kernel is represented as a linear mixture of unknown models. We call this class of SSP problems as linear mixture SSPs. We propose a novel algorithm with Hoeffding-type confidence sets for learning the linear mixture SSP, which can attain an $\tilde{\mathcal{O}}(d B_{\star}^{1.5}\sqrt{K/c_{\min}})$ regret. Here $K$ is the number of episodes, $d$ is the dimension of the feature mapping in the mixture model, $B_{\star}$ bounds the expected cumulative cost of the optimal policy, and $c_{\min}>0$ is the lower bound of the cost function. Our algorithm also applies to the case when $c_{\min} = 0$, and an $\tilde{\mathcal{O}}(K^{2/3})$ regret is guaranteed. To the best of our knowledge, this is the first algorithm with a sublinear regret guarantee for learning linear mixture SSP. Moreover, we design a refined Bernstein-type confidence set and propose an improved algorithm, which provably achieves an $\tilde{\mathcal{O}}(d B_{\star}\sqrt{K/c_{\min}})$ regret. In complement to the regret upper bounds, we also prove a lower bound of $\Omega(dB_{\star} \sqrt{K})$. Hence, our improved algorithm matches the lower bound up to a $1/\sqrt{c_{\min}}$ factor and poly-logarithmic factors, achieving a near-optimal regret guarantee.  ( 3 min )
    $\pi$VAE: a stochastic process prior for Bayesian deep learning with MCMC. (arXiv:2002.06873v5 [cs.LG] UPDATED)
    Stochastic processes provide a mathematically elegant way model complex data. In theory, they provide flexible priors over function classes that can encode a wide range of interesting assumptions. In practice, however, efficient inference by optimisation or marginalisation is difficult, a problem further exacerbated with big data and high dimensional input spaces. We propose a novel variational autoencoder (VAE) called the prior encoding variational autoencoder ($\pi$VAE). The $\pi$VAE is finitely exchangeable and Kolmogorov consistent, and thus is a continuous stochastic process. We use $\pi$VAE to learn low dimensional embeddings of function classes. We show that our framework can accurately learn expressive function classes such as Gaussian processes, but also properties of functions to enable statistical inference (such as the integral of a log Gaussian process). For popular tasks, such as spatial interpolation, $\pi$VAE achieves state-of-the-art performance both in terms of accuracy and computational efficiency. Perhaps most usefully, we demonstrate that the low dimensional independently distributed latent space representation learnt provides an elegant and scalable means of performing Bayesian inference for stochastic processes within probabilistic programming languages such as Stan.  ( 3 min )
    The StarCraft Multi-Agent Challenges+ : Learning of Multi-Stage Tasks and Environmental Factors without Precise Reward Functions. (arXiv:2207.02007v1 [cs.LG])
    In this paper, we propose a novel benchmark called the StarCraft Multi-Agent Challenges+, where agents learn to perform multi-stage tasks and to use environmental factors without precise reward functions. The previous challenges (SMAC) recognized as a standard benchmark of Multi-Agent Reinforcement Learning are mainly concerned with ensuring that all agents cooperatively eliminate approaching adversaries only through fine manipulation with obvious reward functions. This challenge, on the other hand, is interested in the exploration capability of MARL algorithms to efficiently learn implicit multi-stage tasks and environmental factors as well as micro-control. This study covers both offensive and defensive scenarios. In the offensive scenarios, agents must learn to first find opponents and then eliminate them. The defensive scenarios require agents to use topographic features. For example, agents need to position themselves behind protective structures to make it harder for enemies to attack. We investigate MARL algorithms under SMAC+ and observe that recent approaches work well in similar settings to the previous challenges, but misbehave in offensive scenarios. Additionally, we observe that an enhanced exploration approach has a positive effect on performance but is not able to completely solve all scenarios. This study proposes new directions for future research.  ( 3 min )
    Probability density estimation for sets of large graphs with respect to spectral information using stochastic block models. (arXiv:2207.02168v1 [cs.LG])
    For graph-valued data sampled iid from a distribution $\mu$, the sample moments are computed with respect to a choice of metric. In this work, we equip the set of graphs with the pseudo-metric defined by the $\ell_2$ norm between the eigenvalues of the respective adjacency matrices. We use this pseudo metric and the respective sample moments of a graph valued data set to infer the parameters of a distribution $\hat{\mu}$ and interpret this distribution as an approximation of $\mu$. We verify experimentally that complex distributions $\mu$ can be approximated well taking this approach.  ( 2 min )
    Federated Split GANs. (arXiv:2207.01750v1 [cs.LG])
    Mobile devices and the immense amount and variety of data they generate are key enablers of machine learning (ML)-based applications. Traditional ML techniques have shifted toward new paradigms such as federated (FL) and split learning (SL) to improve the protection of user's data privacy. However, these paradigms often rely on server(s) located in the edge or cloud to train computationally-heavy parts of a ML model to avoid draining the limited resource on client devices, resulting in exposing device data to such third parties. This work proposes an alternative approach to train computationally-heavy ML models in user's devices themselves, where corresponding device data resides. Specifically, we focus on GANs (generative adversarial networks) and leverage their inherent privacy-preserving attribute. We train the discriminative part of a GAN with raw data on user's devices, whereas the generative model is trained remotely (e.g., server) for which there is no need to access sensor true data. Moreover, our approach ensures that the computational load of training the discriminative model is shared among user's devices-proportional to their computation capabilities-by means of SL. We implement our proposed collaborative training scheme of a computationally-heavy GAN model in real resource-constrained devices. The results show that our system preserves data privacy, keeps a short training time, and yields same accuracy of model training in unconstrained devices (e.g., cloud). Our code can be found on https://github.com/YukariSonz/FSL-GAN  ( 3 min )
    Task-agnostic Defense against Adversarial Patch Attacks. (arXiv:2207.01795v1 [cs.CV])
    Adversarial patch attacks mislead neural networks by injecting adversarial pixels within a designated local region. Patch attacks can be highly effective in a variety of tasks and physically realizable via attachment (e.g. a sticker) to the real-world objects. Despite the diversity in attack patterns, adversarial patches tend to be highly textured and different in appearance from natural images. We exploit this property and present PatchZero, a task-agnostic defense against white-box adversarial patches. Specifically, our defense detects the adversarial pixels and "zeros out" the patch region by repainting with mean pixel values. We formulate the patch detection problem as a semantic segmentation task such that our model can generalize to patches of any size and shape. We further design a two-stage adversarial training scheme to defend against the stronger adaptive attacks. We thoroughly evaluate PatchZero on the image classification (ImageNet, RESISC45), object detection (PASCAL VOC), and video classification (UCF101) datasets. Our method achieves SOTA robust accuracy without any degradation in the benign performance.  ( 2 min )
    Individual Topology Structure of Eye Movement Trajectories. (arXiv:2205.10667v4 [cs.CV] UPDATED)
    Traditionally, extracting patterns from eye movement data relies on statistics of different macro-events such as fixations and saccades. This requires an additional preprocessing step to separate the eye movement subtypes, often with a number of parameters on which the classification results depend. Besides that, definitions of such macro events are formulated in different ways by different researchers. We propose an application of a new class of features to the quantitative analysis of personal eye movement trajectories structure. This new class of features based on algebraic topology allows extracting patterns from different modalities of gaze such as time series of coordinates and amplitudes, heatmaps, and point clouds in a unified way at all scales from micro to macro. We experimentally demonstrate the competitiveness of the new class of features with the traditional ones and their significant synergy while being used together for the person authentication task on the recently published eye movement trajectories dataset.  ( 2 min )
    Predicting Out-of-Domain Generalization with Local Manifold Smoothness. (arXiv:2207.02093v1 [cs.LG])
    Understanding how machine learning models generalize to new environments is a critical part of their safe deployment. Recent work has proposed a variety of complexity measures that directly predict or theoretically bound the generalization capacity of a model. However, these methods rely on a strong set of assumptions that in practice are not always satisfied. Motivated by the limited settings in which existing measures can be applied, we propose a novel complexity measure based on the local manifold smoothness of a classifier. We define local manifold smoothness as a classifier's output sensitivity to perturbations in the manifold neighborhood around a given test point. Intuitively, a classifier that is less sensitive to these perturbations should generalize better. To estimate smoothness we sample points using data augmentation and measure the fraction of these points classified into the majority class. Our method only requires selecting a data augmentation method and makes no other assumptions about the model or data distributions, meaning it can be applied even in out-of-domain (OOD) settings where existing methods cannot. In experiments on robustness benchmarks in image classification, sentiment analysis, and natural language inference, we demonstrate a strong and robust correlation between our manifold smoothness measure and actual OOD generalization on over 3,000 models evaluated on over 100 train/test domain pairs.  ( 3 min )
    Multi-Agent Broad Reinforcement Learning for Intelligent Traffic Light Control. (arXiv:2203.04310v2 [cs.LG] UPDATED)
    Intelligent Traffic Light Control System (ITLCS) is a typical Multi-Agent System (MAS), which comprises multiple roads and traffic lights.Constructing a model of MAS for ITLCS is the basis to alleviate traffic congestion. Existing approaches of MAS are largely based on Multi-Agent Deep Reinforcement Learning (MADRL). Although the Deep Neural Network (DNN) of MABRL is effective, the training time is long, and the parameters are difficult to trace. Recently, Broad Learning Systems (BLS) provided a selective way for learning in the deep neural networks by a flat network. Moreover, Broad Reinforcement Learning (BRL) extends BLS in Single Agent Deep Reinforcement Learning (SADRL) problem with promising results. However, BRL does not focus on the intricate structures and interaction of agents. Motivated by the feature of MADRL and the issue of BRL, we propose a Multi-Agent Broad Reinforcement Learning (MABRL) framework to explore the function of BLS in MAS. Firstly, unlike most existing MADRL approaches, which use a series of deep neural networks structures, we model each agent with broad networks. Then, we introduce a dynamic self-cycling interaction mechanism to confirm the "3W" information: When to interact, Which agents need to consider, What information to transmit. Finally, we do the experiments based on the intelligent traffic light control scenario. We compare the MABRL approach with six different approaches, and experimental results on three datasets verify the effectiveness of MABRL.  ( 3 min )
    Multi-Scored Sleep Databases: How to Exploit the Multiple-Labels in Automated Sleep Scoring. (arXiv:2207.01910v1 [cs.LG])
    Study Objectives: Inter-scorer variability in scoring polysomnograms is a well-known problem. Most of the existing automated sleep scoring systems are trained using labels annotated by a single scorer, whose subjective evaluation is transferred to the model. When annotations from two or more scorers are available, the scoring models are usually trained on the scorer consensus. The averaged scorer's subjectivity is transferred into the model, losing information about the internal variability among different scorers. In this study, we aim to insert the multiple-knowledge of the different physicians into the training procedure.The goal is to optimize a model training, exploiting the full information that can be extracted from the consensus of a group of scorers. Methods: We train two lightweight deep learning based models on three different multi-scored databases. We exploit the label smoothing technique together with a soft-consensus (LSSC) distribution to insert the multiple-knowledge in the training procedure of the model. We introduce the averaged cosine similarity metric (ACS) to quantify the similarity between the hypnodensity-graph generated by the models with-LSSC and the hypnodensity-graph generated by the scorer consensus. Results: The performance of the models improves on all the databases when we train the models with our LSSC. We found an increase in ACS (up to 6.4%) between the hypnodensity-graph generated by the models trained with-LSSC and the hypnodensity-graph generated by the consensus. Conclusions: Our approach definitely enables a model to better adapt to the consensus of the group of scorers. Future work will focus on further investigations on different scoring architectures.  ( 3 min )
    Graph Clustering with Graph Neural Networks. (arXiv:2006.16904v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have achieved state-of-the-art results on many graph analysis tasks such as node classification and link prediction. However, important unsupervised problems on graphs, such as graph clustering, have proved more resistant to advances in GNNs. Graph clustering has the same overall goal as node pooling in GNNs - does this mean that GNN pooling methods do a good job at clusterings graphs? Surprisingly, the answer is no - current GNN pooling methods often fail to recover the cluster structure in cases where simple baselines, such as k-means applied on learned representations, work well. We investigate further by carefully designing a set of experiments to study different signal-to-noise scenarios both in graph structure and attribute data. To address these methods' poor performance in clustering, we introduce Deep Modularity Networks (DMoN), an unsupervised pooling method inspired by the modularity measure of clustering quality, and show how it tackles recovery of the challenging clustering structure of real-world graphs. Similarly, on real-world data, we show that DMoN produces high quality clusters which correlate strongly with ground truth labels, achieving state-of-the-art results with over 40% improvement over other pooling methods across different metrics.  ( 3 min )
    StyleFlow For Content-Fixed Image to Image Translation. (arXiv:2207.01909v1 [cs.CV])
    Image-to-image (I2I) translation is a challenging topic in computer vision. We divide this problem into three tasks: strongly constrained translation, normally constrained translation, and weakly constrained translation. The constraint here indicates the extent to which the content or semantic information in the original image is preserved. Although previous approaches have achieved good performance in weakly constrained tasks, they failed to fully preserve the content in both strongly and normally constrained tasks, including photo-realism synthesis, style transfer, and colorization, etc. To achieve content-preserving transfer in strongly constrained and normally constrained tasks, we propose StyleFlow, a new I2I translation model that consists of normalizing flows and a novel Style-Aware Normalization (SAN) module. With the invertible network structure, StyleFlow first projects input images into deep feature space in the forward pass, while the backward pass utilizes the SAN module to perform content-fixed feature transformation and then projects back to image space. Our model supports both image-guided translation and multi-modal synthesis. We evaluate our model in several I2I translation benchmarks, and the results show that the proposed model has advantages over previous methods in both strongly constrained and normally constrained tasks.  ( 2 min )
    Sedentary Behavior Estimation with Hip-worn Accelerometer Data: Segmentation, Classification and Thresholding. (arXiv:2207.01809v1 [cs.LG])
    Cohort studies are increasingly using accelerometers for physical activity and sedentary behavior estimation. These devices tend to be less error-prone than self-report, can capture activity throughout the day, and are economical. However, previous methods for estimating sedentary behavior based on hip-worn data are often invalid or suboptimal under free-living situations and subject-to-subject variation. In this paper, we propose a local Markov switching model that takes this situation into account, and introduce a general procedure for posture classification and sedentary behavior analysis that fits the model naturally. Our method features changepoint detection methods in time series and also a two stage classification step that labels data into 3 classes(sitting, standing, stepping). Through a rigorous training-testing paradigm, we showed that our approach achieves > 80% accuracy. In addition, our method is robust and easy to interpret.  ( 2 min )
    An Approximation Method for Fitted Random Forests. (arXiv:2207.02184v1 [stat.ML])
    Random Forests (RF) is a popular machine learning method for classification and regression problems. It involves a bagging application to decision tree models. One of the primary advantages of the Random Forests model is the reduction in the variance of the forecast. In large scale applications of the model with millions of data points and hundreds of features, the size of the fitted objects can get very large and reach the limits on the available space in production setups, depending on the number and depth of the trees. This could be especially challenging when trained models need to be downloaded on-demand to small devices with limited memory. There is a need to approximate the trained RF models to significantly reduce the model size without losing too much of prediction accuracy. In this project we study methods that approximate each fitted tree in the Random Forests model using the multinomial allocation of the data points to the leafs. Specifically, we begin by studying whether fitting a multinomial logistic regression (and subsequently, a generalized additive model (GAM) extension) to the output of each tree helps reduce the size while preserving the prediction quality.  ( 2 min )
    CAPITAL: Optimal Subgroup Identification via Constrained Policy Tree Search. (arXiv:2110.05636v2 [stat.ML] UPDATED)
    Personalized medicine, a paradigm of medicine tailored to a patient's characteristics, is an increasingly attractive field in health care. An important goal of personalized medicine is to identify a subgroup of patients, based on baseline covariates, that benefits more from the targeted treatment than other comparative treatments. Most of the current subgroup identification methods only focus on obtaining a subgroup with an enhanced treatment effect without paying attention to subgroup size. Yet, a clinically meaningful subgroup learning approach should identify the maximum number of patients who can benefit from the better treatment. In this paper, we present an optimal subgroup selection rule (SSR) that maximizes the number of selected patients, and in the meantime, achieves the pre-specified clinically meaningful mean outcome, such as the average treatment effect. We derive two equivalent theoretical forms of the optimal SSR based on the contrast function that describes the treatment-covariates interaction in the outcome. We further propose a ConstrAined PolIcy Tree seArch aLgorithm (CAPITAL) to find the optimal SSR within the interpretable decision tree class. The proposed method is flexible to handle multiple constraints that penalize the inclusion of patients with negative treatment effects, and to address time to event data using the restricted mean survival time as the clinically interesting mean outcome. Extensive simulations, comparison studies, and real data applications are conducted to demonstrate the validity and utility of our method.
    CEN : Cooperatively Evolving Networks. (arXiv:2207.02192v1 [cs.LG])
    A finitely repeated game is a dynamic game in which a simultaneous game is played finitely many times. GANs contain two competing modules: the generator module is trained to generate new examples, and the discriminator module is trained to discriminate real examples from generated examples. Training procedure of GAN is a finitely repeated game in which each module tries to optimize it's error at every instance of simultaneous game in a non-cooperative manner. We observed that we can achieve more accurate training, if at each instance of simultaneous game the stronger module cooperate with weaker module and only weaker module only optimize it's error.
    TT-PINN: A Tensor-Compressed Neural PDE Solver for Edge Computing. (arXiv:2207.01751v1 [cs.LG])
    Physics-informed neural networks (PINNs) have been increasingly employed due to their capability of modeling complex physics systems. To achieve better expressiveness, increasingly larger network sizes are required in many problems. This has caused challenges when we need to train PINNs on edge devices with limited memory, computing and energy resources. To enable training PINNs on edge devices, this paper proposes an end-to-end compressed PINN based on Tensor-Train decomposition. In solving a Helmholtz equation, our proposed model significantly outperforms the original PINNs with few parameters and achieves satisfactory prediction with up to 15$\times$ overall parameter reduction.
    NeuralPassthrough: Learned Real-Time View Synthesis for VR. (arXiv:2207.02186v1 [cs.CV])
    Virtual reality (VR) headsets provide an immersive, stereoscopic visual experience, but at the cost of blocking users from directly observing their physical environment. Passthrough techniques are intended to address this limitation by leveraging outward-facing cameras to reconstruct the images that would otherwise be seen by the user without the headset. This is inherently a real-time view synthesis challenge, since passthrough cameras cannot be physically co-located with the eyes. Existing passthrough techniques suffer from distracting reconstruction artifacts, largely due to the lack of accurate depth information (especially for near-field and disoccluded objects), and also exhibit limited image quality (e.g., being low resolution and monochromatic). In this paper, we propose the first learned passthrough method and assess its performance using a custom VR headset that contains a stereo pair of RGB cameras. Through both simulations and experiments, we demonstrate that our learned passthrough method delivers superior image quality compared to state-of-the-art methods, while meeting strict VR requirements for real-time, perspective-correct stereoscopic view synthesis over a wide field of view for desktop-connected headsets.
    Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation. (arXiv:2204.10020v2 [eess.AS] UPDATED)
    Data augmentation via voice conversion (VC) has been successfully applied to low-resource expressive text-to-speech (TTS) when only neutral data for the target speaker are available. Although the quality of VC is crucial for this approach, it is challenging to learn a stable VC model because the amount of data is limited in low-resource scenarios, and highly expressive speech has large acoustic variety. To address this issue, we propose a novel data augmentation method that combines pitch-shifting and VC techniques. Because pitch-shift data augmentation enables the coverage of a variety of pitch dynamics, it greatly stabilizes training for both VC and TTS models, even when only 1,000 utterances of the target speaker's neutral data are available. Subjective test results showed that a FastSpeech 2-based emotional TTS system with the proposed method improved naturalness and emotional similarity compared with conventional methods.
    On Effective Scheduling of Model-based Reinforcement Learning. (arXiv:2111.08550v3 [cs.LG] UPDATED)
    Model-based reinforcement learning has attracted wide attention due to its superior sample efficiency. Despite its impressive success so far, it is still unclear how to appropriately schedule the important hyperparameters to achieve adequate performance, such as the real data ratio for policy optimization in Dyna-style model-based algorithms. In this paper, we first theoretically analyze the role of real data in policy training, which suggests that gradually increasing the ratio of real data yields better performance. Inspired by the analysis, we propose a framework named AutoMBPO to automatically schedule the real data ratio as well as other hyperparameters in training model-based policy optimization (MBPO) algorithm, a representative running case of model-based methods. On several continuous control tasks, the MBPO instance trained with hyperparameters scheduled by AutoMBPO can significantly surpass the original one, and the real data ratio schedule found by AutoMBPO shows consistency with our theoretical analysis.
    Automatic inspection of cultural monuments using deep and tensor-based learning on hyperspectral imagery. (arXiv:2207.02163v1 [cs.CV])
    In Cultural Heritage, hyperspectral images are commonly used since they provide extended information regarding the optical properties of materials. Thus, the processing of such high-dimensional data becomes challenging from the perspective of machine learning techniques to be applied. In this paper, we propose a Rank-$R$ tensor-based learning model to identify and classify material defects on Cultural Heritage monuments. In contrast to conventional deep learning approaches, the proposed high order tensor-based learning demonstrates greater accuracy and robustness against overfitting. Experimental results on real-world data from UNESCO protected areas indicate the superiority of the proposed scheme compared to conventional deep learning models.
    DBN-Mix: Training Dual Branch Network Using Bilateral Mixup Augmentation for Long-Tailed Visual Recognition. (arXiv:2207.02173v1 [cs.CV])
    There is a growing interest in the challenging visual perception task of learning from long-tailed class distributions. The extreme class imbalance in the training dataset biases the model to prefer to recognize majority-class data over minority-class data. Recently, the dual branch network (DBN) framework has been proposed, where two branch networks; the conventional branch and the re-balancing branch were employed to improve the accuracy of long-tailed visual recognition. The re-balancing branch uses a reverse sampler to generate class-balanced training samples to mitigate bias due to class imbalance. Although this strategy has been quite successful in handling bias, using a reversed sampler for training can degrade the representation learning performance. To alleviate this issue, the conventional method used a carefully designed cumulative learning strategy, in which the influence of the re-balancing branch gradually increases throughout the entire training phase. In this study, we aim to develop a simple yet effective method to improve the performance of DBN without cumulative learning that is difficult to optimize. We devise a simple data augmentation method termed bilateral mixup augmentation, which combines one sample from the uniform sampler with another sample from the reversed sampler to produce a training sample. Furthermore, we present class-conditional temperature scaling that mitigates bias toward the majority class for the proposed DBN architecture. Our experiments performed on widely used long-tailed visual recognition datasets show that bilateral mixup augmentation is quite effective in improving the representation learning performance of DBNs, and that the proposed method achieves state-of-the-art performance for some categories.
    Convolutional Filtering and Neural Networks with Non Commutative Algebras. (arXiv:2108.09923v2 [cs.LG] UPDATED)
    In this paper we provide stability results for algebraic neural networks (AlgNNs) based on non commutative algebras. AlgNNs are stacked layered structures with each layer associated to an algebraic signal model (ASM) determined by an algebra, a vector space, and a homomorphism. Signals are modeled as elements of the vector space, filters are elements in the algebra, while the homomorphism provides a realization of the filters as concrete operators. We study the stability of the algebraic filters in non commutative algebras to perturbations on the homomorphisms, and we provide conditions under which stability is guaranteed. We show that the commutativity between shift operators and between shifts and perturbations does not affect the property of an architecture of being stable. This provides an answer to the question of whether shift invariance was a necessary attribute of convolutional architectures to guarantee stability. Additionally, we show that although the frequency responses of filters in non commutative algebras exhibit substantial differences with respect to filters in commutative algebras, their derivatives for stable filters have a similar behavior.
    Path Integral Stochastic Optimal Control for Sampling Transition Paths. (arXiv:2207.02149v1 [q-bio.BM])
    We consider the problem of Sampling Transition Paths. Given two metastable conformational states of a molecular system, eg. a folded and unfolded protein, we aim to sample the most likely transition path between the two states. Sampling such a transition path is computationally expensive due to the existence of high free energy barriers between the two states. To circumvent this, previous work has focused on simplifying the trajectories to occur along specific molecular descriptors called Collective Variables (CVs). However, finding CVs is not trivial and requires chemical intuition. For larger molecules, where intuition is not sufficient, using these CV-based methods biases the transition along possibly irrelevant dimensions. Instead, this work proposes a method for sampling transition paths that consider the entire geometry of the molecules. To achieve this, we first relate the problem to recent work on the Schrodinger bridge problem and stochastic optimal control. Using this relation, we construct a method that takes into account important characteristics of molecular systems such as second-order dynamics and invariance to rotations and translations. We demonstrate our method on the commonly studied Alanine Dipeptide, but also consider larger proteins such as Polyproline and Chignolin.
    A survey of multimodal deep generative models. (arXiv:2207.02127v1 [cs.LG])
    Multimodal learning is a framework for building models that make predictions based on different types of modalities. Important challenges in multimodal learning are the inference of shared representations from arbitrary modalities and cross-modal generation via these representations; however, achieving this requires taking the heterogeneous nature of multimodal data into account. In recent years, deep generative models, i.e., generative models in which distributions are parameterized by deep neural networks, have attracted much attention, especially variational autoencoders, which are suitable for accomplishing the above challenges because they can consider heterogeneity and infer good representations of data. Therefore, various multimodal generative models based on variational autoencoders, called multimodal deep generative models, have been proposed in recent years. In this paper, we provide a categorized survey of studies on multimodal deep generative models.
    Discovering Quantum Phase Transitions with Fermionic Neural Networks. (arXiv:2202.05183v3 [physics.comp-ph] UPDATED)
    Deep neural networks have been extremely successful as highly accurate wave function ans\"atze for variational Monte Carlo calculations of molecular ground states. We present an extension of one such ansatz, FermiNet, to calculations of the ground states of periodic Hamiltonians, and study the homogeneous electron gas. FermiNet calculations of the ground-state energies of small electron gas systems are in excellent agreement with previous initiator full configuration interaction quantum Monte Carlo and diffusion Monte Carlo calculations. We investigate the spin-polarized homogeneous electron gas and demonstrate that the same neural network architecture is capable of accurately representing both the delocalized Fermi liquid state and the localized Wigner crystal state. The network is given no \emph{a priori} knowledge that a phase transition exists, but converges on the translationally invariant ground state at high density and spontaneously breaks the symmetry to produce the crystalline ground state at low density.
    A Safe Semi-supervised Graph Convolution Network. (arXiv:2207.01960v1 [cs.LG])
    In the semi-supervised learning field, Graph Convolution Network (GCN), as a variant model of GNN, has achieved promising results for non-Euclidean data by introducing convolution into GNN. However, GCN and its variant models fail to safely use the information of risk unlabeled data, which will degrade the performance of semi-supervised learning. Therefore, we propose a Safe GCN framework (Safe-GCN) to improve the learning performance. In the Safe-GCN, we design an iterative process to label the unlabeled data. In each iteration, a GCN and its supervised version(S-GCN) are learned to find the unlabeled data with high confidence. The high-confidence unlabeled data and their pseudo labels are then added to the label set. Finally, both added unlabeled data and labeled ones are used to train a S-GCN which can achieve the safe exploration of the risk unlabeled data and enable safe use of large numbers of unlabeled data. The performance of Safe-GCN is evaluated on three well-known citation network datasets and the obtained results demonstrate the effectiveness of the proposed framework over several graph-based semi-supervised learning methods.
    Rethinking Attention-Model Explainability through Faithfulness Violation Test. (arXiv:2201.12114v3 [cs.LG] UPDATED)
    Attention mechanisms are dominating the explainability of deep models. They produce probability distributions over the input, which are widely deemed as feature-importance indicators. However, in this paper, we find one critical limitation in attention explanations: weakness in identifying the polarity of feature impact. This would be somehow misleading -- features with higher attention weights may not faithfully contribute to model predictions; instead, they can impose suppression effects. With this finding, we reflect on the explainability of current attention-based techniques, such as Attentio$\odot$Gradient and LRP-based attention explanations. We first propose an actionable diagnostic methodology (henceforth faithfulness violation test) to measure the consistency between explanation weights and the impact polarity. Through the extensive experiments, we then show that most tested explanation methods are unexpectedly hindered by the faithfulness violation issue, especially the raw attention. Empirical analyses on the factors affecting violation issues further provide useful observations for adopting explanation methods in attention models.
    A Cross-City Federated Transfer Learning Framework: A Case Study on Urban Region Profiling. (arXiv:2206.00007v2 [cs.LG] UPDATED)
    Data insufficiency problem (i.e., data missing and label scarcity issues) caused by inadequate services and infrastructures or unbalanced development levels of cities has seriously affected the urban computing tasks in real scenarios. Prior transfer learning methods inspire an elegant solution to the data insufficiency, but are only concerned with one kind of insufficiency issue and fail to give consideration to both sides. In addition, most previous cross-city transfer methods overlooks the inter-city data privacy which is a public concern in practical application. To address above challenging problems, we propose a novel Cross-city Federated Transfer Learning framework (CcFTL) to cope with the data insufficiency and privacy problems. Concretely, CcFTL transfers the relational knowledge from multiple rich-data source cities to the target city. Besides, the model parameters specific to the target task are firstly trained on the source data and then fine-tuned to the target city by parameter transfer. With our adaptation of federated training and homomorphic encryption settings, CcFTL can effectively deal with the data privacy problem among cities. We take the urban region profiling as an application of smart cities and evaluate the proposed method with a real-world study. The experiments demonstrate the notable superiority of our framework over several competitive state-of-the-art models.
    PRoA: A Probabilistic Robustness Assessment against Functional Perturbations. (arXiv:2207.02036v1 [cs.LG])
    In safety-critical deep learning applications robustness measurement is a vital pre-deployment phase. However, existing robustness verification methods are not sufficiently practical for deploying machine learning systems in the real world. On the one hand, these methods attempt to claim that no perturbations can ``fool'' deep neural networks (DNNs), which may be too stringent in practice. On the other hand, existing works rigorously consider $L_p$ bounded additive perturbations on the pixel space, although perturbations, such as colour shifting and geometric transformations, are more practically and frequently occurring in the real world. Thus, from the practical standpoint, we present a novel and general {\it probabilistic robustness assessment method} (PRoA) based on the adaptive concentration, and it can measure the robustness of deep learning models against functional perturbations. PRoA can provide statistical guarantees on the probabilistic robustness of a model, \textit{i.e.}, the probability of failure encountered by the trained model after deployment. Our experiments demonstrate the effectiveness and flexibility of PRoA in terms of evaluating the probabilistic robustness against a broad range of functional perturbations, and PRoA can scale well to various large-scale deep neural networks compared to existing state-of-the-art baselines. For the purpose of reproducibility, we release our tool on GitHub: \url{ https://github.com/TrustAI/PRoA}.
    FedTune: Automatic Tuning of Federated Learning Hyper-Parameters from System Perspective. (arXiv:2110.03061v5 [cs.LG] UPDATED)
    Federated learning (FL) hyper-parameters significantly affect the training overheads in terms of computation time, transmission time, computation load, and transmission load. However, the current practice of manually selecting FL hyper-parameters puts a high burden on FL practitioners since various applications prefer different training preferences. In this paper, we propose FedTune, an automatic FL hyper-parameter tuning algorithm tailored to applications' diverse system requirements of FL training. FedTune is lightweight and flexible, achieving 8.48%-26.75% improvement for different datasets compared to fixed FL hyper-parameters.
    FedChain: Chained Algorithms for Near-Optimal Communication Cost in Federated Learning. (arXiv:2108.06869v4 [cs.LG] UPDATED)
    Federated learning (FL) aims to minimize the communication complexity of training a model over heterogeneous data distributed across many clients. A common approach is local methods, where clients take multiple optimization steps over local data before communicating with the server (e.g., FedAvg). Local methods can exploit similarity between clients' data. However, in existing analyses, this comes at the cost of slow convergence in terms of the dependence on the number of communication rounds R. On the other hand, global methods, where clients simply return a gradient vector in each round (e.g., SGD), converge faster in terms of R but fail to exploit the similarity between clients even when clients are homogeneous. We propose FedChain, an algorithmic framework that combines the strengths of local methods and global methods to achieve fast convergence in terms of R while leveraging the similarity between clients. Using FedChain, we instantiate algorithms that improve upon previously known rates in the general convex and PL settings, and are near-optimal (via an algorithm-independent lower bound that we show) for problems that satisfy strong convexity. Empirical results support this theoretical gain over existing methods.
    Evaluation of Semantic Answer Similarity Metrics. (arXiv:2206.12664v2 [cs.CL] UPDATED)
    There are several issues with the existing general machine translation or natural language generation evaluation metrics, and question-answering (QA) systems are indifferent in that context. To build robust QA systems, we need the ability to have equivalently robust evaluation systems to verify whether model predictions to questions are similar to ground-truth annotations. The ability to compare similarity based on semantics as opposed to pure string overlap is important to compare models fairly and to indicate more realistic acceptance criteria in real-life applications. We build upon the first to our knowledge paper that uses transformer-based model metrics to assess semantic answer similarity and achieve higher correlations to human judgement in the case of no lexical overlap. We propose cross-encoder augmented bi-encoder and BERTScore models for semantic answer similarity, trained on a new dataset consisting of name pairs of US-American public figures. As far as we are concerned, we provide the first dataset of co-referent name string pairs along with their similarities, which can be used for training.
    The optimal reservoir computer for nonlinear dynamics. (arXiv:2202.05159v2 [cs.LG] UPDATED)
    Analysis and prediction of real-world complex systems of nonlinear dynamics relies largely on surrogate models. Reservoir computers (RC) have proven useful in replicating the climate of chaotic dynamics. The quality of surrogate models based on RCs is crucially dependent on judiciously determined optimal implementation that involves selecting optimal reservoir topology and hyperparameters. By systematically applying Bayesian hyperparameter optimization and using ensembles of reservoirs of various topology we show that the topology of linked reservoirs has no significance in forecasting dynamics of the chaotic Lorenz system. By simulations we show that simple reservoirs of unconnected nodes outperform reservoirs of linked reservoirs as surrogate models for the Lorenz system in different regimes. We give a derivation for why reservoirs of unconnected nodes have the maximum entropy and hence are optimal. We conclude that the performance of an RC is based on mere functional transformation, not in its dynamical properties as has been generally presumed. Hence, RC could be improved by including information on dynamics more strongly in the model.
    Data-Dependent Randomized Smoothing. (arXiv:2012.04351v4 [cs.LG] UPDATED)
    Randomized smoothing is a recent technique that achieves state-of-art performance in training certifiably robust deep neural networks. While the smoothing family of distributions is often connected to the choice of the norm used for certification, the parameters of these distributions are always set as global hyper parameters independent from the input data on which a network is certified. In this work, we revisit Gaussian randomized smoothing and show that the variance of the Gaussian distribution can be optimized at each input so as to maximize the certification radius for the construction of the smooth classifier. Since the data dependent classifier does not directly enjoy sound certification with existing approaches, we propose a memory-enhanced data dependent smooth classifier that is certifiable by construction. This new approach is generic, parameter-free, and easy to implement. In fact, we show that our data dependent framework can be seamlessly incorporated into 3 randomized smoothing approaches, leading to consistent improved certified accuracy. When this framework is used in the training routine of these approaches followed by a data dependent certification, we achieve 9% and 6% improvement over the certified accuracy of the strongest baseline for a radius of 0.5 on CIFAR10 and ImageNet.
    No-Regret Learning in Partially-Informed Auctions. (arXiv:2202.10606v2 [cs.LG] UPDATED)
    Auctions with partially-revealed information about items are broadly employed in real-world applications, but the underlying mechanisms have limited theoretical support. In this work, we study a machine learning formulation of these types of mechanisms, presenting algorithms that are no-regret from the buyer's perspective. Specifically, a buyer who wishes to maximize his utility interacts repeatedly with a platform over a series of $T$ rounds. In each round, a new item is drawn from an unknown distribution and the platform publishes a price together with incomplete, "masked" information about the item. The buyer then decides whether to purchase the item. We formalize this problem as an online learning task where the goal is to have low regret with respect to a myopic oracle that has perfect knowledge of the distribution over items and the seller's masking function. When the distribution over items is known to the buyer and the mask is a SimHash function mapping $\mathbb{R}^d$ to $\{0,1\}^{\ell}$, our algorithm has regret $\tilde O((Td\ell)^{1/2})$. In a fully agnostic setting when the mask is an arbitrary function mapping to a set of size $n$ and the prices are stochastic, our algorithm has regret $\tilde O((Tn)^{1/2})$.
    On the Nash equilibrium of moment-matching GANs for stationary Gaussian processes. (arXiv:2203.07136v2 [stat.ML] UPDATED)
    Generative Adversarial Networks (GANs) learn an implicit generative model from data samples through a two-player game. In this paper, we study the existence of Nash equilibrium of the game which is consistent as the number of data samples grows to infinity. In a realizable setting where the goal is to estimate the ground-truth generator of a stationary Gaussian process, we show that the existence of consistent Nash equilibrium depends crucially on the choice of the discriminator family. The discriminator defined from second-order statistical moments can result in non-existence of Nash equilibrium, existence of consistent non-Nash equilibrium, or existence and uniqueness of consistent Nash equilibrium, depending on whether symmetry properties of the generator family are respected. We further study the local stability and global convergence of gradient descent-ascent methods towards consistent equilibrium.
    A Generative Framework for Personalized Learning and Estimation: Theory, Algorithms, and Privacy. (arXiv:2207.01771v1 [cs.LG])
    A distinguishing characteristic of federated learning is that the (local) client data could have statistical heterogeneity. This heterogeneity has motivated the design of personalized learning, where individual (personalized) models are trained, through collaboration. There have been various personalization methods proposed in literature, with seemingly very different forms and methods ranging from use of a single global model for local regularization and model interpolation, to use of multiple global models for personalized clustering, etc. In this work, we begin with a generative framework that could potentially unify several different algorithms as well as suggest new algorithms. We apply our generative framework to personalized estimation, and connect it to the classical empirical Bayes' methodology. We develop private personalized estimation under this framework. We then use our generative framework for learning, which unifies several known personalized FL algorithms and also suggests new ones; we propose and study a new algorithm AdaPeD based on a Knowledge Distillation, which numerically outperforms several known algorithms. We also develop privacy for personalized learning methods with guarantees for user-level privacy and composition. We numerically evaluate the performance as well as the privacy for both the estimation and learning problems, demonstrating the advantages of our proposed methods.
    Lane-GNN: Integrating GNN for Predicting Drivers' Lane Change Intention. (arXiv:2207.00824v2 [cs.LG] UPDATED)
    Nowadays, intelligent highway traffic network is playing an important role in modern transportation infrastructures. A variable speed limit (VSL) system can be facilitated in the highway traffic network to provide useful and dynamic speed limit information for drivers to travel with enhanced safety. Such system is usually designed with a steady advisory speed in mind so that traffic can move smoothly when drivers follow the speed, rather than speeding up whenever there is a gap and slowing down at congestion. However, little attention has been given to the research of vehicles' behaviours when drivers left the road network governed by a VSL system, which may largely involve unexpected acceleration, deceleration and frequent lane changes, resulting in chaos for the subsequent highway road users. In this paper, we focus on the detection of traffic flow anomaly due to drivers' lane change intention on the highway traffic networks after a VSL system. More specifically, we apply graph modelling on the traffic flow data generated by a popular mobility simulator, SUMO, at road segment levels. We then evaluate the performance of lane changing detection using the proposed Lane-GNN scheme, an attention temporal graph convolutional neural network, and compare its performance with a temporal convolutional neural network (TCNN) as our baseline. Our experimental results show that the proposed Lane-GNN can detect drivers' lane change intention within 90 seconds with an accuracy of 99.42% under certain assumptions. Finally, some interpretation methods are applied to the trained models with a view to further illustrate our findings.
    Minimax Estimation of Linear Functions of Eigenvectors in the Face of Small Eigen-Gaps. (arXiv:2104.03298v2 [math.ST] UPDATED)
    Eigenvector perturbation analysis plays a vital role in various data science applications. A large body of prior works, however, focused on establishing $\ell_{2}$ eigenvector perturbation bounds, which are often highly inadequate in addressing tasks that rely on fine-grained behavior of an eigenvector. This paper makes progress on this by studying the perturbation of linear functions of an unknown eigenvector. Focusing on two fundamental problems -- matrix denoising and principal component analysis -- in the presence of Gaussian noise, we develop a suite of statistical theory that characterizes the perturbation of arbitrary linear functions of an unknown eigenvector. In order to mitigate a non-negligible bias issue inherent to the natural ``plug-in'' estimator, we develop de-biased estimators that (1) achieve minimax lower bounds for a family of scenarios (modulo some logarithmic factor), and (2) can be computed in a data-driven manner without sample splitting. Noteworthily, the proposed estimators are nearly minimax optimal even when the associated eigen-gap is {\em substantially smaller} than what is required in prior statistical theory.
    Neural Network Gaussian Processes by Increasing Depth. (arXiv:2108.12862v3 [cs.LG] UPDATED)
    Recent years have witnessed an increasing interest in the correspondence between infinitely wide networks and Gaussian processes. Despite the effectiveness and elegance of the current neural network Gaussian process theory, to the best of our knowledge, all the neural network Gaussian processes are essentially induced by increasing width. However, in the era of deep learning, what concerns us more regarding a neural network is its depth as well as how depth impacts the behaviors of a network. Inspired by a width-depth symmetry consideration, we use a shortcut network to show that increasing the depth of a neural network can also give rise to a Gaussian process, which is a valuable addition to the existing theory and contributes to revealing the true picture of deep learning. Beyond the proposed Gaussian process by depth, we theoretically characterize its uniform tightness property and the smallest eigenvalue of the Gaussian process kernel. These characterizations can not only enhance our understanding of the proposed depth-induced Gaussian process but also pave the way for future applications. Lastly, we examine the performance of the proposed Gaussian process by regression experiments on two benchmark data sets.
    Progressive Subsampling for Oversampled Data -- Application to Quantitative MRI. (arXiv:2203.09268v3 [eess.IV] UPDATED)
    We present PROSUB: PROgressive SUBsampling, a deep learning based, automated methodology that subsamples an oversampled data set (e.g. multi-channeled 3D images) with minimal loss of information. We build upon a recent dual-network approach that won the MICCAI MUlti-DIffusion (MUDI) quantitative MRI measurement sampling-reconstruction challenge, but suffers from deep learning training instability, by subsampling with a hard decision boundary. PROSUB uses the paradigm of recursive feature elimination (RFE) and progressively subsamples measurements during deep learning training, improving optimization stability. PROSUB also integrates a neural architecture search (NAS) paradigm, allowing the network architecture hyperparameters to respond to the subsampling process. We show PROSUB outperforms the winner of the MUDI MICCAI challenge, producing large improvements >18% MSE on the MUDI challenge sub-tasks and qualitative improvements on downstream processes useful for clinical applications. We also show the benefits of incorporating NAS and analyze the effect of PROSUB's components. As our method generalizes to other problems beyond MRI measurement selection-reconstruction, our code is https://github.com/sbb-gh/PROSUB
    Learning Optimal Transport Between two Empirical Distributions with Normalizing Flows. (arXiv:2207.01246v2 [cs.LG] UPDATED)
    Optimal transport (OT) provides effective tools for comparing and mapping probability measures. We propose to leverage the flexibility of neural networks to learn an approximate optimal transport map. More precisely, we present a new and original method to address the problem of transporting a finite set of samples associated with a first underlying unknown distribution towards another finite set of samples drawn from another unknown distribution. We show that a particular instance of invertible neural networks, namely the normalizing flows, can be used to approximate the solution of this OT problem between a pair of empirical distributions. To this aim, we propose to relax the Monge formulation of OT by replacing the equality constraint on the push-forward measure by the minimization of the corresponding Wasserstein distance. The push-forward operator to be retrieved is then restricted to be a normalizing flow which is trained by optimizing the resulting cost function. This approach allows the transport map to be discretized as a composition of functions. Each of these functions is associated to one sub-flow of the network, whose output provides intermediate steps of the transport between the original and target measures. This discretization yields also a set of intermediate barycenters between the two measures of interest. Experiments conducted on toy examples as well as a challenging task of unsupervised translation demonstrate the interest of the proposed method. Finally, some experiments show that the proposed approach leads to a good approximation of the true OT.
    A Deep Learning Approach for the solution of Probability Density Evolution of Stochastic Systems. (arXiv:2207.01907v1 [cs.LG])
    Derivation of the probability density evolution provides invaluable insight into the behavior of many stochastic systems and their performance. However, for most real-time applica-tions, numerical determination of the probability density evolution is a formidable task. The latter is due to the required temporal and spatial discretization schemes that render most computational solutions prohibitive and impractical. In this respect, the development of an efficient computational surrogate model is of paramount importance. Recent studies on the physics-constrained networks show that a suitable surrogate can be achieved by encoding the physical insight into a deep neural network. To this aim, the present work introduces DeepPDEM which utilizes the concept of physics-informed networks to solve the evolution of the probability density via proposing a deep learning method. DeepPDEM learns the General Density Evolution Equation (GDEE) of stochastic structures. This approach paves the way for a mesh-free learning method that can solve the density evolution problem with-out prior simulation data. Moreover, it can also serve as an efficient surrogate for the solu-tion at any other spatiotemporal points within optimization schemes or real-time applica-tions. To demonstrate the potential applicability of the proposed framework, two network architectures with different activation functions as well as two optimizers are investigated. Numerical implementation on three different problems verifies the accuracy and efficacy of the proposed method.
    DiffML: End-to-end Differentiable ML Pipelines. (arXiv:2207.01269v2 [cs.DB] UPDATED)
    In this paper, we present our vision of differentiable ML pipelines called DiffML to automate the construction of ML pipelines in an end-to-end fashion. The idea is that DiffML allows to jointly train not just the ML model itself but also the entire pipeline including data preprocessing steps, e.g., data cleaning, feature selection, etc. Our core idea is to formulate all pipeline steps in a differentiable way such that the entire pipeline can be trained using backpropagation. However, this is a non-trivial problem and opens up many new research questions. To show the feasibility of this direction, we demonstrate initial ideas and a general principle of how typical preprocessing steps such as data cleaning, feature selection and dataset selection can be formulated as differentiable programs and jointly learned with the ML model. Moreover, we discuss a research roadmap and core challenges that have to be systematically tackled to enable fully differentiable ML pipelines.
    An adaptive music generation architecture for games based on the deep learning Transformer mode. (arXiv:2207.01698v1 [cs.SD])
    This paper presents an architecture for generating music for video games based on the Transformer deep learning model. The system generates music in various layers, following the standard layering strategy currently used by composers designing video game music. The music is adaptive to the psychological context of the player, according to the arousal-valence model. Our motivation is to customize music according to the player's tastes, who can select his preferred style of music through a set of training examples of music. We discuss current limitations and prospects for the future, such as collaborative and interactive control of the musical components.
    A Probabilistic State Space Model for Joint Inference from Differential Equations and Data. (arXiv:2103.10153v3 [stat.ML] UPDATED)
    Mechanistic models with differential equations are a key component of scientific applications of machine learning. Inference in such models is usually computationally demanding, because it involves repeatedly solving the differential equation. The main problem here is that the numerical solver is hard to combine with standard inference techniques. Recent work in probabilistic numerics has developed a new class of solvers for ordinary differential equations (ODEs) that phrase the solution process directly in terms of Bayesian filtering. We here show that this allows such methods to be combined very directly, with conceptual and numerical ease, with latent force models in the ODE itself. It then becomes possible to perform approximate Bayesian inference on the latent force as well as the ODE solution in a single, linear complexity pass of an extended Kalman filter / smoother - that is, at the cost of computing a single ODE solution. We demonstrate the expressiveness and performance of the algorithm by training, among others, a non-parametric SIRD model on data from the COVID-19 outbreak.
    An Optimization-based Algorithm for Non-stationary Kernel Bandits without Prior Knowledge. (arXiv:2205.14775v2 [stat.ML] UPDATED)
    We propose an algorithm for non-stationary kernel bandits that does not require prior knowledge of the degree of non-stationarity. The algorithm follows randomized strategies obtained by solving optimization problems that balance exploration and exploitation. It adapts to non-stationarity by restarting when a change in the reward function is detected. Our algorithm enjoys a tighter dynamic regret bound than previous work on the non-stationary kernel bandit setting. Moreover, when applied to the non-stationary linear bandit setting by using a linear kernel, our algorithm is nearly minimax optimal, solving an open problem in the non-stationary linear bandit literature. We extend our algorithm to use a neural network for dynamically adapting the feature mapping to observed data. We prove a dynamic regret bound of the extension using the neural tangent kernel theory. We demonstrate empirically that our algorithm and the extension can adapt to varying degrees of non-stationarity.
    Features Based Adaptive Augmentation for Graph Contrastive Learning. (arXiv:2207.01792v1 [cs.LG])
    Self-Supervised learning aims to eliminate the need for expensive annotation in graph representation learning, where graph contrastive learning (GCL) is trained with the self-supervision signals containing data-data pairs. These data-data pairs are generated with augmentation employing stochastic functions on the original graph. We argue that some features can be more critical than others depending on the downstream task, and applying stochastic function uniformly, will vandalize the influential features, leading to diminished accuracy. To fix this issue, we introduce a Feature Based Adaptive Augmentation (FebAA) approach, which identifies and preserves potentially influential features and corrupts the remaining ones. We implement FebAA as plug and play layer and use it with state-of-the-art Deep Graph Contrastive Learning (GRACE) and Bootstrapped Graph Latents (BGRL). We successfully improved the accuracy of GRACE and BGRL on eight graph representation learning's benchmark datasets.
    A Single-Loop Smoothed Gradient Descent-Ascent Algorithm for Nonconvex-Concave Min-Max Problems. (arXiv:2010.15768v2 [math.OC] UPDATED)
    Nonconvex-concave min-max problem arises in many machine learning applications including minimizing a pointwise maximum of a set of nonconvex functions and robust adversarial training of neural networks. A popular approach to solve this problem is the gradient descent-ascent (GDA) algorithm which unfortunately can exhibit oscillation in case of nonconvexity. In this paper, we introduce a "smoothing" scheme which can be combined with GDA to stabilize the oscillation and ensure convergence to a stationary solution. We prove that the stabilized GDA algorithm can achieve an $O(1/\epsilon^2)$ iteration complexity for minimizing the pointwise maximum of a finite collection of nonconvex functions. Moreover, the smoothed GDA algorithm achieves an $O(1/\epsilon^4)$ iteration complexity for general nonconvex-concave problems. Extensions of this stabilized GDA algorithm to multi-block cases are presented. To the best of our knowledge, this is the first algorithm to achieve $O(1/\epsilon^2)$ for a class of nonconvex-concave problem. We illustrate the practical efficiency of the stabilized GDA algorithm on robust training.
    Explainability in Deep Reinforcement Learning, a Review into Current Methods and Applications. (arXiv:2207.01911v1 [cs.LG])
    The use of Deep Reinforcement Learning (DRL) schemes has increased dramatically since their first introduction in 2015. Though uses in many different applications are being found they still have a problem with the lack of interpretability. This has bread a lack of understanding and trust in the use of DRL solutions from researchers and the general public. To solve this problem the field of explainable artificial intelligence (XAI) has emerged. This is a variety of different methods that look to open the DRL black boxes, they range from the use of interpretable symbolic decision trees to numerical methods like Shapley Values. This review looks at which methods are being used and what applications they are being used. This is done to identify which models are the best suited to each application or if a method is being underutilised.
    UniCR: Universally Approximated Certified Robustness via Randomized Smoothing. (arXiv:2207.02152v1 [cs.LG])
    We study certified robustness of machine learning classifiers against adversarial perturbations. In particular, we propose the first universally approximated certified robustness (UniCR) framework, which can approximate the robustness certification of any input on any classifier against any $\ell_p$ perturbations with noise generated by any continuous probability distribution. Compared with the state-of-the-art certified defenses, UniCR provides many significant benefits: (1) the first universal robustness certification framework for the above 4 'any's; (2) automatic robustness certification that avoids case-by-case analysis, (3) tightness validation of certified robustness, and (4) optimality validation of noise distributions used by randomized smoothing. We conduct extensive experiments to validate the above benefits of UniCR and the advantages of UniCR over state-of-the-art certified defenses against $\ell_p$ perturbations.
    Compactness Score: A Fast Filter Method for Unsupervised Feature Selection. (arXiv:2201.13194v2 [cs.LG] UPDATED)
    Along with the flourish of the information age, massive amounts of data are generated day by day. Due to the large-scale and high-dimensional characteristics of these data, it is often difficult to achieve better decision-making in practical applications. Therefore, an efficient big data analytics method is urgently needed. For feature engineering, feature selection seems to be an important research content in which is anticipated to select "excellent" features from candidate ones. Different functions can be realized through feature selection, such as dimensionality reduction, model effect improvement, and model performance improvement. In many classification tasks, researchers found that data seem to be usually close to each other if they are from the same class; thus, local compactness is of great importance for the evaluation of a feature. In this manuscript, we propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS), to select desired features. To demonstrate the efficiency and accuracy, several data sets are chosen with extensive experiments being performed. Later, the effectiveness and superiority of our method are revealed through addressing clustering tasks. Here, the performance is indicated by several well-known evaluation metrics, while the efficiency is reflected by the corresponding running time. As revealed by the simulation results, our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
    Bayesian NVH metamodels to assess interior cabin noise using measurement databases. (arXiv:2207.02120v1 [stat.AP])
    In recent years, a great emphasis has been put on engineering the acoustic signature of vehicles that represents the overall comfort level for passengers. Due to highly uncertain behavior of production cars, probabilistic metamodels or surrogates can be useful to estimate the NVH dispersion and assess different NVH risks. These metamodels follow physical behaviors and shall aid as a design space exploration tool during the early stage design process to support the NVH optimization. The measurement databases constitute different noise paths such as aerodynamic noise (wind-tunnel test), tire-pavement interaction noise (rolling noise), and noise due to electric motors (whining noise). This research work proposes a global NVH metamodeling technique for broadband noises such as aerodynamic and rolling noises exploiting the Bayesian framework that takes into account the prior (domain-expert) knowledge about complex physical mechanisms. Generalized additive models (GAMs) with polynomials and Gaussian basis functions are used to model the dependency of sound pressure level (SPL) on predictor variables. Moreover, parametric bootstrap algorithm based on data-generating mechanism using the point estimates is used to estimate the dispersion in unknown parameters. Probabilistic modelling is carried out using an open-source library PyMC3 that utilizes No-U-Turn sampler (NUTS) and the developed models are validated using Cross-Validation technique.
    Near out-of-distribution detection for low-resolution radar micro-Doppler signatures. (arXiv:2205.07869v2 [eess.SP] UPDATED)
    Near out-of-distribution detection (OODD) aims at discriminating semantically similar data points without the supervision required for classification. This paper puts forward an OODD use case for radar targets detection extensible to other kinds of sensors and detection scenarios. We emphasize the relevance of OODD and its specific supervision requirements for the detection of a multimodal, diverse targets class among other similar radar targets and clutter in real-life critical systems. We propose a comparison of deep and non-deep OODD methods on simulated low-resolution pulse radar micro-Doppler signatures, considering both a spectral and a covariance matrix input representation. The covariance representation aims at estimating whether dedicated second-order processing is appropriate to discriminate signatures. The potential contributions of labeled anomalies in training, self-supervised learning, contrastive learning insights and innovative training losses are discussed, and the impact of training set contamination caused by mislabelling is investigated.
    Conflicting Interactions Among Protections Mechanisms for Machine Learning Models. (arXiv:2207.01991v1 [cs.LG])
    Nowadays, systems based on machine learning (ML) are widely used in different domains. Given their popularity, ML models have become targets for various attacks. As a result, research at the intersection of security and privacy, and ML has flourished. The research community has been exploring the attack vectors and potential mitigations separately. However, practitioners will likely need to deploy defences against several threats simultaneously. A solution that is optimal for a specific concern may interact negatively with solutions intended to address other concerns. In this work, we explore the potential for conflicting interactions between different solutions that enhance the security/privacy of ML-base systems. We focus on model and data ownership; exploring how ownership verification techniques interact with other ML security/privacy techniques like differentially private training, and robustness against model evasion. We provide a framework, and conduct systematic analysis of pairwise interactions. We show that many pairs are incompatible. Where possible, we provide relaxations to the hyperparameters or the techniques themselves that allow for the simultaneous deployment. Lastly, we discuss the implications and provide guidelines for future work.
    Towards trustworthy Energy Disaggregation: A review of challenges, methods and perspectives for Non-Intrusive Load Monitoring. (arXiv:2207.02009v1 [cs.LG])
    Non-intrusive load monitoring (NILM) is the task of disaggregating the total power consumption into its individual sub-components. Over the years, signal processing and machine learning algorithms have been combined to achieve this. A lot of publications and extensive research works are performed on energy disaggregation or NILM for the state-of-the-art methods to reach on the desirable performance. The initial interest of the scientific community to formulate and describe mathematically the NILM problem using machine learning tools has now shifted into a more practical NILM. Nowadays, we are in the mature NILM period where there is an attempt for NILM to be applied in real-life application scenarios. Thus, complexity of the algorithms, transferability, reliability, practicality and in general trustworthiness are the main issues of interest. This review narrows the gap between the early immature NILM era and the mature one. In particular, the paper provides a comprehensive literature review of the NILM methods for residential appliances only. The paper analyzes, summarizes and presents the outcomes of a large number of recently published scholarly articles. Also, the paper discusses the highlights of these methods and introduces the research dilemmas that should be taken into consideration by researchers to apply NILM methods. Finally, we show the need for transferring the traditional disaggregation models into a practical and trustworthy framework.
    PLATINUM: Semi-Supervised Model Agnostic Meta-Learning using Submodular Mutual Information. (arXiv:2201.12928v2 [cs.LG] UPDATED)
    Few-shot classification (FSC) requires training models using a few (typically one to five) data points per class. Meta learning has proven to be able to learn a parametrized model for FSC by training on various other classification tasks. In this work, we propose PLATINUM (semi-suPervised modeL Agnostic meTa-learnIng usiNg sUbmodular Mutual information), a novel semi-supervised model agnostic meta-learning framework that uses the submodular mutual information (SMI) functions to boost the performance of FSC. PLATINUM leverages unlabeled data in the inner and outer loop using SMI functions during meta-training and obtains richer meta-learned parameterizations for meta-test. We study the performance of PLATINUM in two scenarios - 1) where the unlabeled data points belong to the same set of classes as the labeled set of a certain episode, and 2) where there exist out-of-distribution classes that do not belong to the labeled set. We evaluate our method on various settings on the miniImageNet, tieredImageNet and Fewshot-CIFAR100 datasets. Our experiments show that PLATINUM outperforms MAML and semi-supervised approaches like pseduo-labeling for semi-supervised FSC, especially for small ratio of labeled examples per class.
    On the Efficiency of Subclass Knowledge Distillation in Classification Tasks. (arXiv:2109.05587v3 [cs.LG] UPDATED)
    This work introduces a novel knowledge distillation framework for classification tasks where information on existing subclasses is available and taken into consideration. In classification tasks with a small number of classes or binary detection (two classes) the amount of information transferred from the teacher to the student network is restricted, thus limiting the utility of knowledge distillation. Performance can be improved by leveraging information about possible subclasses within the available classes in the classification task. To that end, we propose the so-called Subclass Knowledge Distillation (SKD) framework, which is the process of transferring the subclasses' prediction knowledge from a large teacher model into a smaller student one. Through SKD, additional meaningful information which is not in the teacher's class logits but exists in subclasses (e.g., similarities inside classes) will be conveyed to the student and boost its performance. Mathematically, we measure how many extra information bits the teacher can provide for the student via SKD framework. The framework developed is evaluated in clinical application, namely colorectal polyp binary classification. In this application, clinician-provided annotations are used to define subclasses based on the annotation label's variability in a curriculum style of learning. A lightweight, low complexity student trained with the proposed framework achieves an F1-score of 85.05%, an improvement of 2.14% and 1.49% gain over the student that trains without and with conventional knowledge distillation, respectively. These results show that the extra subclasses' knowledge (i.e., 0.4656 label bits per training sample in our experiment) can provide more information about the teacher generalization, and therefore SKD can benefit from using more information to increase the student performance.
    Multimodal Frame-Scoring Transformer for Video Summarization. (arXiv:2207.01814v1 [cs.LG])
    As the number of video content has mushroomed in recent years, automatic video summarization has come useful when we want to just peek at the content of the video. However, there are two underlying limitations in generic video summarization task. First, most previous approaches read in just visual features as input, leaving other modality features behind. Second, existing datasets for generic video summarization are relatively insufficient to train a caption generator and multimodal feature extractors. To address these two problems, this paper proposes the Multimodal Frame-Scoring Transformer (MFST) framework exploiting visual, text and audio features and scoring a video with respect to frames. Our MFST framework first extracts each modality features (visual-text-audio) using pretrained encoders. Then, MFST trains the multimodal frame-scoring transformer that uses video-text-audio representations as inputs and predicts frame-level scores. Our extensive experiments with previous models and ablation studies on TVSum and SumMe datasets demonstrate the effectiveness and superiority of our proposed method.
    Deriving Surface Resistivity from Polarimetric SAR Data Using Dual-Input UNet. (arXiv:2207.01811v1 [physics.geo-ph])
    Traditional survey methods for finding surface resistivity are time-consuming and labor intensive. Very few studies have focused on finding the resistivity/conductivity using remote sensing data and deep learning techniques. In this line of work, we assessed the correlation between surface resistivity and Synthetic Aperture Radar (SAR) by applying various deep learning methods and tested our hypothesis in the Coso Geothermal Area, USA. For detecting the resistivity, L-band full polarimetric SAR data acquired by UAVSAR were used, and MT (Magnetotellurics) inverted resistivity data of the area were used as the ground truth. We conducted experiments to compare various deep learning architectures and suggest the use of Dual Input UNet (DI-UNet) architecture. DI-UNet uses a deep learning architecture to predict the resistivity using full polarimetric SAR data by promising a quick survey addition to the traditional method. Our proposed approach accomplished improved outcomes for the mapping of MT resistivity from SAR data.
    Defending against the Label-flipping Attack in Federated Learning. (arXiv:2207.01982v1 [cs.CR])
    Federated learning (FL) provides autonomy and privacy by design to participating peers, who cooperatively build a machine learning (ML) model while keeping their private data in their devices. However, that same autonomy opens the door for malicious peers to poison the model by conducting either untargeted or targeted poisoning attacks. The label-flipping (LF) attack is a targeted poisoning attack where the attackers poison their training data by flipping the labels of some examples from one class (i.e., the source class) to another (i.e., the target class). Unfortunately, this attack is easy to perform and hard to detect and it negatively impacts on the performance of the global model. Existing defenses against LF are limited by assumptions on the distribution of the peers' data and/or do not perform well with high-dimensional models. In this paper, we deeply investigate the LF attack behavior and find that the contradicting objectives of attackers and honest peers on the source class examples are reflected in the parameter gradients corresponding to the neurons of the source and target classes in the output layer, making those gradients good discriminative features for the attack detection. Accordingly, we propose a novel defense that first dynamically extracts those gradients from the peers' local updates, and then clusters the extracted gradients, analyzes the resulting clusters and filters out potential bad updates before model aggregation. Extensive empirical analysis on three data sets shows the proposed defense's effectiveness against the LF attack regardless of the data distribution or model dimensionality. Also, the proposed defense outperforms several state-of-the-art defenses by offering lower test error, higher overall accuracy, higher source class accuracy, lower attack success rate, and higher stability of the source class accuracy.
    Disentangling private classes through regularization. (arXiv:2207.02000v1 [cs.LG])
    Deep learning models are nowadays broadly deployed to solve an incredibly large variety of tasks. However, little attention has been devoted to connected legal aspects. In 2016, the European Union approved the General Data Protection Regulation which entered into force in 2018. Its main rationale was to protect the privacy and data protection of its citizens by the way of operating of the so-called "Data Economy". As data is the fuel of modern Artificial Intelligence, it is argued that the GDPR can be partly applicable to a series of algorithmic decision making tasks before a more structured AI Regulation enters into force. In the meantime, AI should not allow undesired information leakage deviating from the purpose for which is created. In this work we propose DisP, an approach for deep learning models disentangling the information related to some classes we desire to keep private, from the data processed by AI. In particular, DisP is a regularization strategy de-correlating the features belonging to the same private class at training time, hiding the information of private classes membership. Our experiments on state-of-the-art deep learning models show the effectiveness of DisP, minimizing the risk of extraction for the classes we desire to keep private.
    Ask-AC: An Initiative Advisor-in-the-Loop Actor-Critic Framework. (arXiv:2207.01955v1 [cs.LG])
    Despite the promising results achieved, state-of-the-art interactive reinforcement learning schemes rely on passively receiving supervision signals from advisor experts, in the form of either continuous monitoring or pre-defined rules, which inevitably result in a cumbersome and expensive learning process. In this paper, we introduce a novel initiative advisor-in-the-loop actor-critic framework, termed as Ask-AC, that replaces the unilateral advisor-guidance mechanism with a bidirectional learner-initiative one, and thereby enables a customized and efficacious message exchange between learner and advisor. At the heart of Ask-AC are two complementary components, namely action requester and adaptive state selector, that can be readily incorporated into various discrete actor-critic architectures. The former component allows the agent to initiatively seek advisor intervention in the presence of uncertain states, while the latter identifies the unstable states potentially missed by the former especially when environment changes, and then learns to promote the ask action on such states. Experimental results on both stationary and non-stationary environments and across different actor-critic backbones demonstrate that the proposed framework significantly improves the learning efficiency of the agent, and achieves the performances on par with those obtained by continuous advisor monitoring.
    Network Support for High-performance Distributed Machine Learning. (arXiv:2102.03394v2 [cs.NI] UPDATED)
    The traditional approach to distributed machine learning is to adapt learning algorithms to the network, e.g., reducing updates to curb overhead. Networks based on intelligent edge, instead, make it possible to follow the opposite approach, i.e., to define the logical network topology em around the learning task to perform, so as to meet the desired learning performance. In this paper, we propose a system model that captures such aspects in the context of supervised machine learning, accounting for both learning nodes (that perform computations) and information nodes (that provide data). We then formulate the problem of selecting (i) which learning and information nodes should cooperate to complete the learning task, and (ii) the number of iterations to perform, in order to minimize the learning cost while meeting the target prediction error and execution time. After proving important properties of the above problem, we devise an algorithm, named DoubleClimb, that can find a 1+1/|I|-competitive solution (with I being the set of information nodes), with cubic worst-case complexity. Our performance evaluation, leveraging a real-world network topology and considering both classification and regression tasks, also shows that DoubleClimb closely matches the optimum, outperforming state-of-the-art alternatives.  ( 3 min )
    VisRuler: Visual Analytics for Extracting Decision Rules from Bagged and Boosted Decision Trees. (arXiv:2112.00334v3 [cs.LG] UPDATED)
    Bagging and boosting are two popular ensemble methods in machine learning (ML) that produce many individual decision trees. Due to the inherent ensemble characteristic of these methods, they typically outperform single decision trees or other ML models in predictive performance. However, numerous decision paths are generated for each decision tree, increasing the overall complexity of the model and hindering its use in domains that require trustworthy and explainable decisions, such as finance, social care, and health care. Thus, the interpretability of bagging and boosting algorithms, such as random forest and adaptive boosting, reduces as the number of decisions rises. In this paper, we propose a visual analytics tool that aims to assist users in extracting decisions from such ML models via a thorough visual inspection workflow that includes selecting a set of robust and diverse models (originating from different ensemble learning algorithms), choosing important features according to their global contribution, and deciding which decisions are essential for global explanation (or locally, for specific cases). The outcome is a final decision based on the class agreement of several models and the explored manual decisions exported by users. We evaluated the applicability and effectiveness of VisRuler via a use case, a usage scenario, and a user study. The evaluation revealed that most users managed to successfully use our system to explore decision rules visually, performing the proposed tasks and answering the given questions in a satisfying way.
    Image Amodal Completion: A Survey. (arXiv:2207.02062v1 [cs.CV])
    Existing computer vision systems can compete with humans in understanding the visible parts of objects, but still fall far short of humans when it comes to depicting the invisible parts of partially occluded objects. Image amodal completion aims to equip computers with human-like amodal completion functions to understand an intact object despite it being partially occluded. The main purpose of this survey is to provide an intuitive understanding of the research hotspots, key technologies and future trends in the field of image amodal completion. Firstly, we present a comprehensive review of the latest literature in this emerging field, exploring three key tasks in image amodal completion, including amodal shape completion, amodal appearance completion, and order perception. Then we examine popular datasets related to image amodal completion along with their common data collection methods and evaluation metrics. Finally, we discuss real-world applications and future research directions for image amodal completion, facilitating the reader's understanding of the challenges of existing technologies and upcoming research trends.  ( 2 min )
    Improving Covariance Conditioning of the SVD Meta-layer by Orthogonality. (arXiv:2207.02119v1 [cs.CV])
    Inserting an SVD meta-layer into neural networks is prone to make the covariance ill-conditioned, which could harm the model in the training stability and generalization abilities. In this paper, we systematically study how to improve the covariance conditioning by enforcing orthogonality to the Pre-SVD layer. Existing orthogonal treatments on the weights are first investigated. However, these techniques can improve the conditioning but would hurt the performance. To avoid such a side effect, we propose the Nearest Orthogonal Gradient (NOG) and Optimal Learning Rate (OLR). The effectiveness of our methods is validated in two applications: decorrelated Batch Normalization (BN) and Global Covariance Pooling (GCP). Extensive experiments on visual recognition demonstrate that our methods can simultaneously improve the covariance conditioning and generalization. Moreover, the combinations with orthogonal weight can further boost the performances.  ( 2 min )
    One-Shot Transfer Learning of Physics-Informed Neural Networks. (arXiv:2110.11286v2 [cs.LG] UPDATED)
    Solving differential equations efficiently and accurately sits at the heart of progress in many areas of scientific research, from classical dynamical systems to quantum mechanics. There is a surge of interest in using Physics-Informed Neural Networks (PINNs) to tackle such problems as they provide numerous benefits over traditional numerical approaches. Despite their potential benefits for solving differential equations, transfer learning has been under explored. In this study, we present a general framework for transfer learning PINNs that results in one-shot inference for linear systems of both ordinary and partial differential equations. This means that highly accurate solutions to many unknown differential equations can be obtained instantaneously without retraining an entire network. We demonstrate the efficacy of the proposed deep learning approach by solving several real-world problems, such as first- and second-order linear ordinary equations, the Poisson equation, and the time-dependent Schrodinger complex-value partial differential equation.
    A Boosting Algorithm for Positive-Unlabeled Learning. (arXiv:2205.09485v2 [cs.LG] UPDATED)
    Positive-unlabeled (PU) learning deals with binary classification problems when only positive (P) and unlabeled (U) data are available. A lot of PU methods based on linear models and neural networks have been proposed; however, there still lacks study on how the theoretically sound boosting-style algorithms could work with P and U data. Considering that in some scenarios when neural networks cannot perform as good as boosting algorithms even with fully-supervised data, we propose a novel boosting algorithm for PU learning: Ada-PU, which compares against neural networks. Ada-PU follows the general procedure of AdaBoost while two different distributions of P data are maintained and updated. After a weak classifier is learned on the newly updated distribution, the corresponding combining weight for the final ensemble is estimated using only PU data. We demonstrated that with a smaller set of base classifiers, the proposed method is guaranteed to keep the theoretical properties of boosting algorithms. In experiments, we showed that Ada-PU outperforms neural networks on benchmark PU datasets. We also study a real-world dataset UNSW-NB15 in cyber security and demonstrated that Ada-PU has superior performance for malicious activity detection.
    Degree-Based Random Walk Approach for Graph Embedding. (arXiv:2110.13627v2 [cs.SI] UPDATED)
    Graph embedding, representing local and global neighborhood information by numerical vectors, is a crucial part of the mathematical modeling of a wide range of real-world systems. Among the embedding algorithms, random walk-based algorithms have proven to be very successful. These algorithms collect information by creating numerous random walks with a redefined number of steps. Creating random walks is the most demanding part of the embedding process. The computation demand increases with the size of the network. Moreover, for real-world networks, considering all nodes on the same footing, the abundance of low-degree nodes creates an imbalanced data problem. In this work, a computationally less intensive and node connectivity aware uniform sampling method is proposed. In the proposed method, the number of random walks is created proportionally with the degree of the node. The advantages of the proposed algorithm become more enhanced when the algorithm is applied to large graphs. A comparative study by using two networks namely CORA and CiteSeer is presented. Comparing with the fixed number of walks case, the proposed method requires 50% less computational effort to reach the same accuracy for node classification and link prediction calculations.
    "Even if ..." -- Diverse Semifactual Explanations of Reject. (arXiv:2207.01898v1 [cs.LG])
    Machine learning based decision making systems applied in safety critical areas require reliable high certainty predictions. For this purpose, the system can be extended by an reject option which allows the system to reject inputs where only a prediction with an unacceptably low certainty would be possible. While being able to reject uncertain samples is important, it is also of importance to be able to explain why a particular sample was rejected. With the ongoing rise of eXplainable AI (XAI), a lot of explanation methodologies for machine learning based systems have been developed -- explaining reject options, however, is still a novel field where only very little prior work exists. In this work, we propose to explain rejects by semifactual explanations, an instance of example-based explanation methods, which them self have not been widely considered in the XAI community yet. We propose a conceptual modeling of semifactual explanations for arbitrary reject options and empirically evaluate a specific implementation on a conformal prediction based reject option.
    An Empirical Study of Language Model Integration for Transducer based Speech Recognition. (arXiv:2203.16776v3 [eess.AS] UPDATED)
    Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-T) for speech recognition is challenging. Recently, a class of methods such as density ratio (DR) and internal language model estimation (ILME) have been developed, outperforming the classic shallow fusion (SF) method. The basic idea behind these methods is that RNN-T posterior should first subtract the implicitly learned internal language model (ILM) prior, in order to integrate the ELM. While recent studies suggest that RNN-T only learns some low-order language model information, the DR method uses a well-trained neural language model with full context, which may be inappropriate for the estimation of ILM and deteriorate the integration performance. Based on the DR method, we propose a low-order density ratio method (LODR) by replacing the estimation with a low-order weak language model. Extensive empirical experiments are conducted on both in-domain and cross-domain scenarios on English LibriSpeech & Tedlium-2 and Chinese WenetSpeech & AISHELL-1 datasets. It is shown that LODR consistently outperforms SF in all tasks, while performing generally close to ILME and better than DR in most tests.
    Bayesian approaches for Quantifying Clinicians' Variability in Medical Image Quantification. (arXiv:2207.01868v1 [eess.IV])
    Medical imaging, including MRI, CT, and Ultrasound, plays a vital role in clinical decisions. Accurate segmentation is essential to measure the structure of interest from the image. However, manual segmentation is highly operator-dependent, which leads to high inter and intra-variability of quantitative measurements. In this paper, we explore the feasibility that Bayesian predictive distribution parameterized by deep neural networks can capture the clinicians' inter-intra variability. By exploring and analyzing recently emerged approximate inference schemes, we evaluate whether approximate Bayesian deep learning with the posterior over segmentations can learn inter-intra rater variability both in segmentation and clinical measurements. The experiments are performed with two different imaging modalities: MRI and ultrasound. We empirically demonstrated that Bayesian predictive distribution parameterized by deep neural networks could approximate the clinicians' inter-intra variability. We show a new perspective in analyzing medical images quantitatively by providing clinical measurement uncertainty.
    Application of multilayer perceptron with data augmentation in nuclear physics. (arXiv:2205.07953v2 [cs.LG] UPDATED)
    Neural networks have become popular in many fields of science since they serve as promising, reliable and powerful tools. In this work, we study the effect of data augmentation on the predictive power of neural network models for nuclear physics data. We present two different data augmentation techniques, and we conduct a detailed analysis in terms of different depths, optimizers, activation functions and random seed values to show the success and robustness of the model. Using the experimental uncertainties for data augmentation for the first time, the size of the training data set is artificially boosted and the changes in the root-mean-square error between the model predictions on the test set and the experimental data are investigated. Our results show that the data augmentation decreases the prediction errors, stabilizes the model and prevents overfitting. The extrapolation capabilities of the MLP models are also tested for newly measured nuclei in AME2020 mass table, and it is shown that the predictions are significantly improved by using data augmentation.
    Fidelity of Ensemble Aggregation for Saliency Map Explanations using Bayesian Optimization Techniques. (arXiv:2207.01565v2 [cs.CV] UPDATED)
    In recent years, an abundance of feature attribution methods for explaining neural networks have been developed. Especially in the field of computer vision, many methods for generating saliency maps providing pixel attributions exist. However, their explanations often contradict each other and it is not clear which explanation to trust. A natural solution to this problem is the aggregation of multiple explanations. We present and compare different pixel-based aggregation schemes with the goal of generating a new explanation, whose fidelity to the model's decision is higher than each individual explanation. Using methods from the field of Bayesian Optimization, we incorporate the variance between the individual explanations into the aggregation process. Additionally, we analyze the effect of multiple normalization techniques on ensemble aggregation.
    Resource Allocation in Multicore Elastic Optical Networks: A Deep Reinforcement Learning Approach. (arXiv:2207.02074v1 [cs.LG])
    A deep reinforcement learning approach is applied, for the first time, to solve the routing, modulation, spectrum and core allocation (RMSCA) problem in dynamic multicore fiber elastic optical networks (MCF-EONs). To do so, a new environment - compatible with OpenAI's Gym - was designed and implemented to emulate the operation of MCF-EONs. The new environment processes the agent actions (selection of route, core and spectrum slot) by considering the network state and physical-layer-related aspects. The latter includes the available modulation formats and their reach and the inter-core crosstalk (XT), an MCF-related impairment. If the resulting quality of the signal is acceptable, the environment allocates the resources selected by the agent. After processing the agent's action, the environment is configured to give the agent a numerical reward and information about the new network state. The blocking performance of four different agents was compared through simulation to 3 baseline heuristics used in MCF-EONs. Results obtained for the NSFNet and COST239 network topologies show that the best-performing agent achieves, on average, up to a four-times decrease in blocking probability concerning the best-performing baseline heuristic methods.  ( 2 min )
    Content Addressable Memory Without Catastrophic Forgetting by Heteroassociation with a Fixed Scaffold. (arXiv:2202.00159v3 [cs.AI] UPDATED)
    Content-addressable memory (CAM) networks, so-called because stored items can be recalled by partial or corrupted versions of the items, exhibit near-perfect recall of a small number of information-dense patterns below capacity and a 'memory cliff' beyond, such that inserting a single additional pattern results in catastrophic loss of all stored patterns. We propose a novel CAM architecture, Memory Scaffold with Heteroassociation (MESH), that factorizes the problems of internal attractor dynamics and association with external content to generate a CAM continuum without a memory cliff: Small numbers of patterns are stored with complete information recovery matching standard CAMs, while inserting more patterns still results in partial recall of every pattern, with a graceful trade-off between pattern number and pattern richness. Motivated by the architecture of the Entorhinal-Hippocampal memory circuit in the brain, MESH is a tripartite architecture with pairwise interactions that uses a predetermined set of internally stabilized states together with heteroassociation between the internal states and arbitrary external patterns. We show analytically and experimentally that for any number of stored patterns, MESH nearly saturates the total information bound (given by the number of synapses) for CAM networks, outperforming all existing CAM models.
    QuPeD: Quantized Personalization via Distillation with Applications to Federated Learning. (arXiv:2107.13892v2 [cs.LG] UPDATED)
    Traditionally, federated learning (FL) aims to train a single global model while collaboratively using multiple clients and a server. Two natural challenges that FL algorithms face are heterogeneity in data across clients and collaboration of clients with {\em diverse resources}. In this work, we introduce a \textit{quantized} and \textit{personalized} FL algorithm QuPeD that facilitates collective (personalized model compression) training via \textit{knowledge distillation} (KD) among clients who have access to heterogeneous data and resources. For personalization, we allow clients to learn \textit{compressed personalized models} with different quantization parameters and model dimensions/structures. Towards this, first we propose an algorithm for learning quantized models through a relaxed optimization problem, where quantization values are also optimized over. When each client participating in the (federated) learning process has different requirements for the compressed model (both in model dimension and precision), we formulate a compressed personalization framework by introducing knowledge distillation loss for local client objectives collaborating through a global model. We develop an alternating proximal gradient update for solving this compressed personalization problem, and analyze its convergence properties. Numerically, we validate that QuPeD outperforms competing personalized FL methods, FedAvg, and local training of clients in various heterogeneous settings.
    Local Multi-Label Explanations for Random Forest. (arXiv:2207.01994v1 [cs.LG])
    Multi-label classification is a challenging task, particularly in domains where the number of labels to be predicted is large. Deep neural networks are often effective at multi-label classification of images and textual data. When dealing with tabular data, however, conventional machine learning algorithms, such as tree ensembles, appear to outperform competition. Random forest, being a popular ensemble algorithm, has found use in a wide range of real-world problems. Such problems include fraud detection in the financial domain, crime hotspot detection in the legal sector, and in the biomedical field, disease probability prediction when patient records are accessible. Since they have an impact on people's lives, these domains usually require decision-making systems to be explainable. Random Forest falls short on this property, especially when a large number of tree predictors are used. This issue was addressed in a recent research named LionForests, regarding single label classification and regression. In this work, we adapt this technique to multi-label classification problems, by employing three different strategies regarding the labels that the explanation covers. Finally, we provide a set of qualitative and quantitative experiments to assess the efficacy of this approach.
    Deterministic Decoupling of Global Features and its Application to Data Analysis. (arXiv:2207.02132v1 [cs.LG])
    We introduce a method for deterministic decoupling of global features and show its applicability to improve data analysis performance, as well as to open new venues for feature transfer. We propose a new formalism that is based on defining transformations on submanifolds, by following trajectories along the features gradients. Through these transformations we define a normalization that, we demonstrate, allows for decoupling differentiable features. By applying this to sampling moments, we obtain a quasi-analytic solution for the orthokurtosis, a normalized version of the kurtosis that is not just decoupled from mean and variance, but also from skewness. We apply this method in the original data domain and at the output of a filter bank to regression and classification problems based on global descriptors, obtaining a consistent and significant improvement in performance as compared to using classical (non-decoupled) descriptors.  ( 2 min )
    Modeling and Correcting Bias in Sequential Evaluation. (arXiv:2205.01607v2 [stat.ML] UPDATED)
    We consider the problem of sequential evaluation, in which an evaluator observes candidates in a sequence and assigns scores to these candidates in an online, irrevocable fashion. Motivated by the psychology literature that has studied sequential bias in such settings -- namely, dependencies between the evaluation outcome and the order in which the candidates appear -- we propose a natural model for the evaluator's rating process that captures the lack of calibration inherent to such a task. We conduct crowdsourcing experiments to demonstrate various facets of our model. We then proceed to study how to correct sequential bias under our model by posing this as a statistical inference problem. We propose a near-linear time, online algorithm for this task and prove guarantees in terms of two canonical ranking metrics. We also prove that our algorithm is information theoretically optimal, by establishing matching lower bounds in both metrics. Finally, we show that our algorithm outperforms the de facto method of using the rankings induced by the reported scores.
    Entity Linking in Tabular Data Needs the Right Attention. (arXiv:2207.01937v1 [cs.CL])
    Understanding the semantic meaning of tabular data requires Entity Linking (EL), in order to associate each cell value to a real-world entity in a Knowledge Base (KB). In this work, we focus on end-to-end solutions for EL on tabular data that do not rely on fact lookup in the target KB. Tabular data contains heterogeneous and sparse context, including column headers, cell values and table captions. We experiment with various models to generate a vector representation for each cell value to be linked. Our results show that it is critical to apply an attention mechanism as well as an attention mask, so that the model can only attend to the most relevant context and avoid information dilution. The most relevant context includes: same-row cells, same-column cells, headers and caption. Computational complexity, however, grows quadratically with the size of tabular data for such a complex model. We achieve constant memory usage by introducing a Tabular Entity Linking Lite model (TELL ) that generates vector representation for a cell based only on its value, the table headers and the table caption. TELL achieves 80.8% accuracy on Wikipedia tables, which is only 0.1% lower than the state-of-the-art model with quadratic memory usage.  ( 2 min )
    A Causal Approach for Business Optimization: Application on an Online Marketplace. (arXiv:2207.01722v1 [cs.LG])
    A common sales strategy involves having account executives (AEs) actively reach out and contact potential customers. However, not all contact attempts have a positive effect: some attempts do not change customer decisions, while others might even interfere with the desired outcome. In this work we propose using causal inference to estimate the effect of contacting each potential customer and setting the contact policy accordingly. We demonstrate this approach on data from Worthy.com, an online jewelry marketplace. We examined the Worthy business process to identify relevant decisions and outcomes, and formalized assumptions on how they were made. Using causal tools, we selected a decision point where improving AE contact activity appeared to be promising. We then generated a personalized policy and recommended reaching out only to customers for whom it would be beneficial. Finally, we validated the results in an A\B test over a 3-month period, resulting in an increase in item delivery rate of the targeted population by 22% (p-value=0.026). This policy is now being used on an ongoing basis.
    Insights into the origin of halo mass profiles from machine learning. (arXiv:2205.04474v2 [astro-ph.CO] UPDATED)
    The mass distribution of dark matter haloes is the result of the hierarchical growth of initial density perturbations through mass accretion and mergers. We use an interpretable machine-learning framework to provide physical insights into the origin of the spherically-averaged mass profile of dark matter haloes. We train a gradient-boosted-trees algorithm to predict the final mass profiles of cluster-sized haloes, and measure the importance of the different inputs provided to the algorithm. We find two primary scales in the initial conditions (ICs) that impact the final mass profile: the density at approximately the scale of the haloes' Lagrangian patch $R_L$ ($R\sim 0.7\, R_L$) and that in the large-scale environment ($R\sim 1.7~R_L$). The model also identifies three primary time-scales in the halo assembly history that affect the final profile: (i) the formation time of the virialized, collapsed material inside the halo, (ii) the dynamical time, which captures the dynamically unrelaxed, infalling component of the halo over its first orbit, (iii) a third, most recent time-scale, which captures the impact on the outer profile of recent massive merger events. While the inner profile retains memory of the ICs, this information alone is insufficient to yield accurate predictions for the outer profile. As we add information about the haloes' mass accretion history, we find a significant improvement in the predicted profiles at all radii. Our machine-learning framework provides novel insights into the role of the ICs and the mass assembly history in determining the final mass profile of cluster-sized haloes.
    Adapting to Online Label Shift with Provable Guarantees. (arXiv:2207.02121v1 [cs.LG])
    The standard supervised learning paradigm works effectively when training data shares the same distribution as the upcoming testing samples. However, this assumption is often violated in real-world applications, especially when testing data appear in an online fashion. In this paper, we formulate and investigate the problem of online label shift (OLaS): the learner trains an initial model from the labeled offline data and then deploys it to an unlabeled online environment where the underlying label distribution changes over time but the label-conditional density does not. The non-stationarity nature and the lack of supervision make the problem challenging to be tackled. To address the difficulty, we construct a new unbiased risk estimator that utilizes the unlabeled data, which exhibits many benign properties albeit with potential non-convexity. Building upon that, we propose novel online ensemble algorithms to deal with the non-stationarity of the environments. Our approach enjoys optimal dynamic regret, indicating that the performance is competitive with a clairvoyant who knows the online environments in hindsight and then chooses the best decision for each round. The obtained dynamic regret bound scales with the intensity and pattern of label distribution shift, hence exhibiting the adaptivity in the OLaS problem. Extensive experiments are conducted to validate the effectiveness and support our theoretical findings.  ( 2 min )
    Improved Global Guarantees for the Nonconvex Burer--Monteiro Factorization via Rank Overparameterization. (arXiv:2207.01789v1 [math.OC])
    We consider minimizing a twice-differentiable, $L$-smooth, and $\mu$-strongly convex objective $\phi$ over an $n\times n$ positive semidefinite matrix $M\succeq0$, under the assumption that the minimizer $M^{\star}$ has low rank $r^{\star}\ll n$. Following the Burer--Monteiro approach, we instead minimize the nonconvex objective $f(X)=\phi(XX^{T})$ over a factor matrix $X$ of size $n\times r$. This substantially reduces the number of variables from $O(n^{2})$ to as few as $O(n)$ and also enforces positive semidefiniteness for free, but at the cost of giving up the convexity of the original problem. In this paper, we prove that if the search rank $r\ge r^{\star}$ is overparameterized by a constant factor with respect to the true rank $r^{\star}$, namely as in $r>\frac{1}{4}(L/\mu-1)^{2}r^{\star}$, then despite nonconvexity, local optimization is guaranteed to globally converge from any initial point to the global optimum. This significantly improves upon a previous rank overparameterization threshold of $r\ge n$, which is known to be sharp if $\phi$ is allowed to be nonsmooth and/or non-strongly convex, but would increase the number of variables back up to $O(n^{2})$. Conversely, without rank overparameterization, we prove that such a global guarantee is possible if and only if $\phi$ is almost perfectly conditioned, with a condition number of $L/\mu<3$. Therefore, we conclude that a small amount of overparameterization can lead to large improvements in theoretical guarantees for the nonconvex Burer--Monteiro factorization.  ( 3 min )
    A Neural Tangent Kernel Perspective of GANs. (arXiv:2106.05566v4 [cs.LG] UPDATED)
    We propose a novel theoretical framework of analysis for Generative Adversarial Networks (GANs). We reveal a fundamental flaw of previous analyses which, by incorrectly modeling GANs' training scheme, are subject to ill-defined discriminator gradients. We overcome this issue which impedes a principled study of GAN training, solving it within our framework by taking into account the discriminator's architecture. To this end, we leverage the theory of infinite-width neural networks for the discriminator via its Neural Tangent Kernel. We characterize the trained discriminator for a wide range of losses and establish general differentiability properties of the network. From this, we derive new insights about the convergence of the generated distribution, advancing our understanding of GANs' training dynamics. We empirically corroborate these results via an analysis toolkit based on our framework, unveiling intuitions that are consistent with GAN practice.  ( 3 min )
    Approximating Discontinuous Nash Equilibrial Values of Two-Player General-Sum Differential Games. (arXiv:2207.01773v1 [cs.LG])
    Finding Nash equilibrial policies for two-player differential games requires solving Hamilton-Jacobi-Isaacs PDEs. Recent studies achieved success in circumventing the curse of dimensionality in solving such PDEs with underlying applications to human-robot interactions (HRI), by adopting self-supervised (physics-informed) neural networks as universal value approximators. This paper extends from previous SOTA on zero-sum games with continuous values to general-sum games with discontinuous values, where the discontinuity is caused by that of the players' losses. We show that due to its lack of convergence proof and generalization analysis on discontinuous losses, the existing self-supervised learning technique fails to generalize and raises safety concerns in an autonomous driving application. Our solution is to first pre-train the value network on supervised Nash equilibria, and then refine it by minimizing a loss that combines the supervised data with the PDE and boundary conditions. Importantly, the demonstrated advantage of the proposed learning method against purely supervised and self-supervised approaches requires careful choice of the neural activation function: Among $\texttt{relu}$, $\texttt{sin}$, and $\texttt{tanh}$, we show that $\texttt{tanh}$ is the only choice that achieves optimal generalization and safety performance. Our conjecture is that $\texttt{tanh}$ (similar to $\texttt{sin}$) allows continuity of value and its gradient, which is sufficient for the convergence of learning, and at the same time is expressive enough (similar to $\texttt{relu}$) at approximating discontinuous value landscapes. Lastly, we apply our method to approximating control policies for an incomplete-information interaction and demonstrate its contribution to safe interactions.  ( 3 min )
    Correlation between entropy and generalizability in a neural network. (arXiv:2207.01996v1 [cond-mat.stat-mech])
    Although neural networks can solve very complex machine-learning problems, the theoretical reason for their generalizability is still not fully understood. Here we use Wang-Landau Mote Carlo algorithm to calculate the entropy (logarithm of the volume of a part of the parameter space) at a given test accuracy, and a given training loss function value or training accuracy. Our results show that entropical forces help generalizability. Although our study is on a very simple application of neural networks (a spiral dataset and a small, fully-connected neural network), our approach should be useful in explaining the generalizability of more complicated neural networks in future works.  ( 2 min )
    Unsupervised Crowdsourcing with Accuracy and Cost Guarantees. (arXiv:2207.01988v1 [cs.LG])
    We consider the problem of cost-optimal utilization of a crowdsourcing platform for binary, unsupervised classification of a collection of items, given a prescribed error threshold. Workers on the crowdsourcing platform are assumed to be divided into multiple classes, based on their skill, experience, and/or past performance. We model each worker class via an unknown confusion matrix, and a (known) price to be paid per label prediction. For this setting, we propose algorithms for acquiring label predictions from workers, and for inferring the true labels of items. We prove that if the number of (unlabeled) items available is large enough, our algorithms satisfy the prescribed error thresholds, incurring a cost that is near-optimal. Finally, we validate our algorithms, and some heuristics inspired by them, through an extensive case study.  ( 2 min )
    Learning to Accelerate Approximate Methods for Solving Integer Programming via Early Fixing. (arXiv:2207.02087v1 [cs.DM])
    Integer programming (IP) is an important and challenging problem. Approximate methods have shown promising performance on both effectiveness and efficiency for solving the IP problem. However, we observed that a large fraction of variables solved by some iterative approximate methods fluctuate around their final converged discrete states in very long iterations. Inspired by this observation, we aim to accelerate these approximate methods by early fixing these fluctuated variables to their converged states while not significantly harming the solution accuracy. To this end, we propose an early fixing framework along with the approximate method. We formulate the whole early fixing process as a Markov decision process, and train it using imitation learning. A policy network will evaluate the posterior probability of each free variable concerning its discrete candidate states in each block of iterations. Specifically, we adopt the powerful multi-headed attention mechanism in the policy network. Extensive experiments on our proposed early fixing framework are conducted to three different IP applications: constrained linear programming, MRF energy minimization and sparse adversarial attack. The former one is linear IP problem, while the latter two are quadratic IP problems. We extend the problem scale from regular size to significantly large size. The extensive experiments reveal the competitiveness of our early fixing framework: the runtime speeds up significantly, while the solution quality does not degrade much, even in some cases it is available to obtain better solutions. Our proposed early fixing framework can be regarded as an acceleration extension of ADMM methods for solving integer programming. The source codes are available at \url{https://github.com/SCLBD/Accelerated-Lpbox-ADMM}.  ( 3 min )
    Meta-Learning a Real-Time Tabular AutoML Method For Small Data. (arXiv:2207.01848v1 [cs.LG])
    We present TabPFN, an AutoML method that is competitive with the state of the art on small tabular datasets while being over 1,000$\times$ faster. Our method is very simple: it is fully entailed in the weights of a single neural network, and a single forward pass directly yields predictions for a new dataset. Our AutoML method is meta-learned using the Transformer-based Prior-Data Fitted Network (PFN) architecture and approximates Bayesian inference with a prior that is based on assumptions of simplicity and causal structures. The prior contains a large space of structural causal models and Bayesian neural networks with a bias for small architectures and thus low complexity. Furthermore, we extend the PFN approach to differentiably calibrate the prior's hyperparameters on real data. By doing so, we separate our abstract prior assumptions from their heuristic calibration on real data. Afterwards, the calibrated hyperparameters are fixed and TabPFN can be applied to any new tabular dataset at the push of a button. Finally, on 30 datasets from the OpenML-CC18 suite we show that our method outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with predictions produced in less than a second. We provide all our code and our final trained TabPFN in the supplementary materials.  ( 2 min )
    Learning Matchable Image Transformations for Long-term Metric Visual Localization. (arXiv:1904.01080v5 [cs.CV] UPDATED)
    Long-term metric self-localization is an essential capability of autonomous mobile robots, but remains challenging for vision-based systems due to appearance changes caused by lighting, weather, or seasonal variations. While experience-based mapping has proven to be an effective technique for bridging the `appearance gap,' the number of experiences required for reliable metric localization over days or months can be very large, and methods for reducing the necessary number of experiences are needed for this approach to scale. Taking inspiration from color constancy theory, we learn a nonlinear RGB-to-grayscale mapping that explicitly maximizes the number of inlier feature matches for images captured under different lighting and weather conditions, and use it as a pre-processing step in a conventional single-experience localization pipeline to improve its robustness to appearance change. We train this mapping by approximating the target non-differentiable localization pipeline with a deep neural network, and find that incorporating a learned low-dimensional context feature can further improve cross-appearance feature matching. Using synthetic and real-world datasets, we demonstrate substantial improvements in localization performance across day-night cycles, enabling continuous metric localization over a 30-hour period using a single mapping experience, and allowing experience-based localization to scale to long deployments with dramatically reduced data requirements.  ( 3 min )
    Recent Deep Semi-supervised Learning Approaches and Related Works. (arXiv:2106.11528v2 [cs.LG] UPDATED)
    The author of this work proposes an overview of the recent semi-supervised learning approaches and related works. Despite the remarkable success of neural networks in various applications, there exist few formidable constraints including the need for a large amount of labeled data. Therefore, semi-supervised learning, which is a learning scheme in which the scarce labels and a larger amount of unlabeled data are utilized to train models (e.g., deep neural networks) is getting more important. Based on the key assumptions of semi-supervised learning, which are the manifold assumption, cluster assumption, and continuity assumption, the work reviews the recent semi-supervised learning approaches. In particular, the methods in regard to using deep neural networks in a semi-supervised learning setting are primarily discussed. In addition, the existing works are first classified based on the underlying idea and explained, and then the holistic approaches that unify the aforementioned ideas are detailed.  ( 2 min )
    Neural Networks and the Chomsky Hierarchy. (arXiv:2207.02098v1 [cs.LG])
    Reliable generalization lies at the heart of safe ML and AI. However, understanding when and how neural networks generalize remains one of the most important unsolved problems in the field. In this work, we conduct an extensive empirical study (2200 models, 16 tasks) to investigate whether insights from the theory of computation can predict the limits of neural network generalization in practice. We demonstrate that grouping tasks according to the Chomsky hierarchy allows us to forecast whether certain architectures will be able to generalize to out-of-distribution inputs. This includes negative results where even extensive amounts of data and training time never led to any non-trivial generalization, despite models having sufficient capacity to perfectly fit the training data. Our results show that, for our subset of tasks, RNNs and Transformers fail to generalize on non-regular tasks, LSTMs can solve regular and counter-language tasks, and only networks augmented with structured memory (such as a stack or memory tape) can successfully generalize on context-free and context-sensitive tasks.  ( 2 min )
    Deep Network Approximation: Achieving Arbitrary Accuracy with Fixed Number of Neurons. (arXiv:2107.02397v6 [cs.LG] UPDATED)
    This paper develops simple feed-forward neural networks that achieve the universal approximation property for all continuous functions with a fixed finite number of neurons. These neural networks are simple because they are designed with a simple and computable continuous activation function $\sigma$ leveraging a triangular-wave function and the softsign function. We prove that $\sigma$-activated networks with width $36d(2d+1)$ and depth $11$ can approximate any continuous function on a $d$-dimensional hypercube within an arbitrarily small error. Hence, for supervised learning and its related regression problems, the hypothesis space generated by these networks with a size not smaller than $36d(2d+1)\times 11$ is dense in the continuous function space $C([a,b]^d)$ and therefore dense in the Lebesgue spaces $L^p([a,b]^d)$ for $p\in [1,\infty)$. Furthermore, classification functions arising from image and signal classification are in the hypothesis space generated by $\sigma$-activated networks with width $36d(2d+1)$ and depth $12$, when there exist pairwise disjoint bounded closed subsets of $\mathbb{R}^d$ such that the samples of the same class are located in the same subset. Finally, we use numerical experimentation to show that replacing the ReLU activation function by ours would improve the experiment results.  ( 3 min )
    ICE-NODE: Integration of Clinical Embeddings with Neural Ordinary Differential Equations. (arXiv:2207.01873v1 [cs.LG])
    Early diagnosis of disease can result in improved health outcomes, such as higher survival rates and lower treatment costs. With the massive amount of information in electronic health records (EHRs), there is great potential to use machine learning (ML) methods to model disease progression aimed at early prediction of disease onset and other outcomes. In this work, we employ recent innovations in neural ODEs to harness the full temporal information of EHRs. We propose ICE-NODE (Integration of Clinical Embeddings with Neural Ordinary Differential Equations), an architecture that temporally integrates embeddings of clinical codes and neural ODEs to learn and predict patient trajectories in EHRs. We apply our method to the publicly available MIMIC-III and MIMIC-IV datasets, reporting improved prediction results compared to state-of-the-art methods, specifically for clinical codes that are not frequently observed in EHRs. We also show that ICE-NODE is more competent at predicting certain medical conditions, like acute renal failure and pulmonary heart disease, and is also able to produce patient risk trajectories over time that can be exploited for further predictions.  ( 2 min )
    ST-CoNAL: Consistency-Based Acquisition Criterion Using Temporal Self-Ensemble for Active Learning. (arXiv:2207.02182v1 [cs.CV])
    Modern deep learning has achieved great success in various fields. However, it requires the labeling of huge amounts of data, which is expensive and labor-intensive. Active learning (AL), which identifies the most informative samples to be labeled, is becoming increasingly important to maximize the efficiency of the training process. The existing AL methods mostly use only a single final fixed model for acquiring the samples to be labeled. This strategy may not be good enough in that the structural uncertainty of a model for given training data is not considered to acquire the samples. In this study, we propose a novel acquisition criterion based on temporal self-ensemble generated by conventional stochastic gradient descent (SGD) optimization. These self-ensemble models are obtained by capturing the intermediate network weights obtained through SGD iterations. Our acquisition function relies on a consistency measure between the student and teacher models. The student models are given a fixed number of temporal self-ensemble models, and the teacher model is constructed by averaging the weights of the student models. Using the proposed acquisition criterion, we present an AL algorithm, namely student-teacher consistency-based AL (ST-CoNAL). Experiments conducted for image classification tasks on CIFAR-10, CIFAR-100, Caltech-256, and Tiny ImageNet datasets demonstrate that the proposed ST-CoNAL achieves significantly better performance than the existing acquisition methods. Furthermore, extensive experiments show the robustness and effectiveness of our methods.  ( 3 min )
    CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations. (arXiv:2207.02185v1 [cs.CV])
    Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions. In this paper, we aim to solve two key challenges in this task: utilizing multilingual instructions for improved instruction-path grounding and navigating through new environments that are unseen during training. To address these challenges, we propose CLEAR: Cross-Lingual and Environment-Agnostic Representations. First, our agent learns a shared and visually-aligned cross-lingual language representation for the three languages (English, Hindi and Telugu) in the Room-Across-Room dataset. Our language representation learning is guided by text pairs that are aligned by visual information. Second, our agent learns an environment-agnostic visual representation by maximizing the similarity between semantically-aligned image pairs (with constraints on object-matching) from different environments. Our environment agnostic visual representation can mitigate the environment bias induced by low-level visual information. Empirically, on the Room-Across-Room dataset, we show that our multilingual agent gets large improvements in all metrics over the strong baseline model when generalizing to unseen environments with the cross-lingual language representation and the environment-agnostic visual representation. Furthermore, we show that our learned language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task, and present detailed qualitative and quantitative generalization and grounding analysis. Our code is available at https://github.com/jialuli-luka/CLEAR  ( 3 min )
    Federated Phish Bowl: LSTM-Based Decentralized Phishing Email Detection. (arXiv:2110.06025v2 [cs.CR] UPDATED)
    With increasingly more sophisticated phishing campaigns in recent years, phishing emails lure people using more legitimate-looking personal contexts. To tackle this problem, instead of traditional heuristics-based algorithms, more adaptive detection systems such as natural language processing (NLP)-powered approaches are essential to understanding phishing text representations. Nevertheless, concerns surrounding the collection of phishing data that might cover confidential information hinder the effectiveness of model learning. We propose a decentralized phishing email detection framework called Federated Phish Bowl (FedPB) which facilitates collaborative phishing detection with privacy. In particular, we devise a knowledge-sharing mechanism with federated learning (FL). Using long short-term memory (LSTM) for phishing detection, the framework adapts by sharing a global word embedding matrix across the clients, with each client running its local model with Non-IID data. We collected the most recent phishing samples to study the effectiveness of the proposed method using different client numbers and data distributions. The results show that FedPB can attain a competitive performance with a centralized phishing detector, with generality to various cases of FL retaining a prediction accuracy of 83%.  ( 2 min )
    The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks. (arXiv:2110.06296v2 [cs.LG] UPDATED)
    In this paper, we conjecture that if the permutation invariance of neural networks is taken into account, SGD solutions will likely have no barrier in the linear interpolation between them. Although it is a bold conjecture, we show how extensive empirical attempts fall short of refuting it. We further provide a preliminary theoretical result to support our conjecture. Our conjecture has implications for lottery ticket hypothesis, distributed training, and ensemble methods.  ( 2 min )
    Offline RL Policies Should be Trained to be Adaptive. (arXiv:2207.02200v1 [cs.LG])
    Offline RL algorithms must account for the fact that the dataset they are provided may leave many facets of the environment unknown. The most common way to approach this challenge is to employ pessimistic or conservative methods, which avoid behaviors that are too dissimilar from those in the training dataset. However, relying exclusively on conservatism has drawbacks: performance is sensitive to the exact degree of conservatism, and conservative objectives can recover highly suboptimal policies. In this work, we propose that offline RL methods should instead be adaptive in the presence of uncertainty. We show that acting optimally in offline RL in a Bayesian sense involves solving an implicit POMDP. As a result, optimal policies for offline RL must be adaptive, depending not just on the current state but rather all the transitions seen so far during evaluation.We present a model-free algorithm for approximating this optimal adaptive policy, and demonstrate the efficacy of learning such adaptive policies in offline RL benchmarks.  ( 2 min )
    An Intrusion Detection System based on Deep Belief Networks. (arXiv:2207.02117v1 [cs.CR])
    The rapid growth of connected devices has led to the proliferation of novel cyber-security threats known as zero-day attacks. Traditional behaviour-based IDS rely on DNN to detect these attacks. The quality of the dataset used to train the DNN plays a critical role in the detection performance, with underrepresented samples causing poor performances. In this paper, we develop and evaluate the performance of DBN on detecting cyber-attacks within a network of connected devices. The CICIDS2017 dataset was used to train and evaluate the performance of our proposed DBN approach. Several class balancing techniques were applied and evaluated. Lastly, we compare our approach against a conventional MLP model and the existing state-of-the-art. Our proposed DBN approach shows competitive and promising results, with significant performance improvement on the detection of attacks underrepresented in the training dataset.  ( 2 min )
    Investigating Why Contrastive Learning Benefits Robustness Against Label Noise. (arXiv:2201.12498v4 [cs.LG] UPDATED)
    Self-supervised Contrastive Learning (CL) has been recently shown to be very effective in preventing deep networks from overfitting noisy labels. Despite its empirical success, the theoretical understanding of the effect of contrastive learning on boosting robustness is very limited. In this work, we rigorously prove that the representation matrix learned by contrastive learning boosts robustness, by having: (i) one prominent singular value corresponding to each sub-class in the data, and significantly smaller remaining singular values; and (ii) {a large alignment between the prominent singular vectors and the clean labels of each sub-class. The above properties enable a linear layer trained on such representations to effectively learn the clean labels without overfitting the noise.} We further show that the low-rank structure of the Jacobian of deep networks pre-trained with contrastive learning allows them to achieve a superior performance initially, when fine-tuned on noisy labels. Finally, we demonstrate that the initial robustness provided by contrastive learning enables robust training methods to achieve state-of-the-art performance under extreme noise levels, e.g., an average of 27.18\% and 15.58\% increase in accuracy on CIFAR-10 and CIFAR-100 with 80\% symmetric noisy labels, and 4.11\% increase in accuracy on WebVision.  ( 3 min )
    Formalizing and Estimating Distribution Inference Risks. (arXiv:2109.06024v6 [cs.LG] UPDATED)
    Distribution inference, sometimes called property inference, infers statistical properties about a training set from access to a model trained on that data. Distribution inference attacks can pose serious risks when models are trained on private data, but are difficult to distinguish from the intrinsic purpose of statistical machine learning -- namely, to produce models that capture statistical properties about a distribution. Motivated by Yeom et al.'s membership inference framework, we propose a formal definition of distribution inference attacks that is general enough to describe a broad class of attacks distinguishing between possible training distributions. We show how our definition captures previous ratio-based property inference attacks as well as new kinds of attack including revealing the average node degree or clustering coefficient of a training graph. To understand distribution inference risks, we introduce a metric that quantifies observed leakage by relating it to the leakage that would occur if samples from the training distribution were provided directly to the adversary. We report on a series of experiments across a range of different distributions using both novel black-box attacks and improved versions of the state-of-the-art white-box attacks. Our results show that inexpensive attacks are often as effective as expensive meta-classifier attacks, and that there are surprising asymmetries in the effectiveness of attacks. Code is available at https://github.com/iamgroot42/FormEstDistRisks  ( 3 min )
    Synthesizing Speech from Intracranial Depth Electrodes using an Encoder-Decoder Framework. (arXiv:2111.01457v2 [cs.SD] UPDATED)
    Speech Neuroprostheses have the potential to enable communication for people with dysarthria or anarthria. Recent advances have demonstrated high-quality text decoding and speech synthesis from electrocorticographic grids placed on the cortical surface. Here, we investigate a less invasive measurement modality in three participants, namely stereotactic EEG (sEEG) that provides sparse sampling from multiple brain regions, including subcortical regions. To evaluate whether sEEG can also be used to synthesize high-quality audio from neural recordings, we employ a recurrent encoder-decoder model based on modern deep learning methods. We find that speech can indeed be reconstructed with correlations up to 0.8 from these minimally invasive recordings, despite limited amounts of training data.  ( 2 min )
    Creativity and Machine Learning: A Survey. (arXiv:2104.02726v3 [cs.LG] UPDATED)
    There is a growing interest in the area of machine learning and creativity. This survey presents an overview of the history and the state of the art of computational creativity theories, key machine learning techniques (including generative deep learning), and corresponding automatic evaluation methods. After presenting a critical discussion of the key contributions in this area, we outline the current research challenges and emerging opportunities in this field.  ( 2 min )
    Frustratingly Easy Transferability Estimation. (arXiv:2106.09362v3 [cs.LG] UPDATED)
    Transferability estimation has been an essential tool in selecting a pre-trained model and the layers in it for transfer learning, to transfer, so as to maximize the performance on a target task and prevent negative transfer. Existing estimation algorithms either require intensive training on target tasks or have difficulties in evaluating the transferability between layers. To this end, we propose a simple, efficient, and effective transferability measure named TransRate. Through a single pass over examples of a target task, TransRate measures the transferability as the mutual information between features of target examples extracted by a pre-trained model and their labels. We overcome the challenge of efficient mutual information estimation by resorting to coding rate that serves as an effective alternative to entropy. From the perspective of feature representation, the resulting TransRate evaluates both completeness (whether features contain sufficient information of a target task) and compactness (whether features of each class are compact enough for good generalization) of pre-trained features. Theoretically, we have analyzed the close connection of TransRate to the performance after transfer learning. Despite its extraordinary simplicity in 10 lines of codes, TransRate performs remarkably well in extensive evaluations on 32 pre-trained models and 16 downstream tasks.  ( 3 min )
    Balancing Profit, Risk, and Sustainability for Portfolio Management. (arXiv:2207.02134v1 [q-fin.PM])
    Stock portfolio optimization is the process of continuous reallocation of funds to a selection of stocks. This is a particularly well-suited problem for reinforcement learning, as daily rewards are compounding and objective functions may include more than just profit, e.g., risk and sustainability. We developed a novel utility function with the Sharpe ratio representing risk and the environmental, social, and governance score (ESG) representing sustainability. We show that a state-of-the-art policy gradient method - multi-agent deep deterministic policy gradients (MADDPG) - fails to find the optimum policy due to flat policy gradients and we therefore replaced gradient descent with a genetic algorithm for parameter optimization. We show that our system outperforms MADDPG while improving on deep Q-learning approaches by allowing for continuous action spaces. Crucially, by incorporating risk and sustainability criteria in the utility function, we improve on the state-of-the-art in reinforcement learning for portfolio optimization; risk and sustainability are essential in any modern trading strategy and we propose a system that does not merely report these metrics, but that actively optimizes the portfolio to improve on them.  ( 2 min )
    Continual 3D Convolutional Neural Networks for Real-time Processing of Videos. (arXiv:2106.00050v3 [cs.CV] UPDATED)
    We introduce Continual 3D Convolutional Neural Networks (Co3D CNNs), a new computational formulation of spatio-temporal 3D CNNs, in which videos are processed frame-by-frame rather than by clip. In online tasks demanding frame-wise predictions, Co3D CNNs dispense with the computational redundancies of regular 3D CNNs, namely the repeated convolutions over frames, which appear in overlapping clips. We show that Continual 3D CNNs can reuse preexisting 3D-CNN weights to reduce the per-prediction floating point operations (FLOPs) in proportion to the temporal receptive field while retaining similar memory requirements and accuracy. This is validated with multiple models on Kinetics-400 and Charades with remarkable results: CoX3D models attain state-of-the-art complexity/accuracy trade-offs on Kinetics-400 with 12.1-15.3x reductions of FLOPs and 2.3-3.8% improvements in accuracy compared to regular X3D models while reducing peak memory consumption by up to 48%. Moreover, we investigate the transient response of Co3D CNNs at start-up and perform extensive benchmarks of on-hardware processing characteristics for publicly available 3D CNNs.  ( 2 min )
    opPINN: Physics-Informed Neural Network with operator learning to approximate solutions to the Fokker-Planck-Landau equation. (arXiv:2207.01765v1 [math.NA])
    We propose a hybrid framework opPINN: physics-informed neural network (PINN) with operator learning for approximating the solution to the Fokker-Planck-Landau (FPL) equation. The opPINN framework is divided into two steps: Step 1 and Step 2. After the operator surrogate models are trained during Step 1, PINN can effectively approximate the solution to the FPL equation during Step 2 by using the pre-trained surrogate models. The operator surrogate models greatly reduce the computational cost and boost PINN by approximating the complex Landau collision integral in the FPL equation. The operator surrogate models can also be combined with the traditional numerical schemes. It provides a high efficiency in computational time when the number of velocity modes becomes larger. Using the opPINN framework, we provide the neural network solutions for the FPL equation under the various types of initial conditions, and interaction models in two and three dimensions. Furthermore, based on the theoretical properties of the FPL equation, we show that the approximated neural network solution converges to the a priori classical solution of the FPL equation as the pre-defined loss function is reduced.  ( 2 min )
    Efficient Representation Learning via Adaptive Context Pooling. (arXiv:2207.01844v1 [cs.LG])
    Self-attention mechanisms model long-range context by using pairwise attention between all input tokens. In doing so, they assume a fixed attention granularity defined by the individual tokens (e.g., text characters or image pixels), which may not be optimal for modeling complex dependencies at higher levels. In this paper, we propose ContextPool to address this problem by adapting the attention granularity for each token. Inspired by the success of ConvNets that are combined with pooling to capture long-range dependencies, we learn to pool neighboring features for each token before computing attention in a given attention layer. The pooling weights and support size are adaptively determined, allowing the pooled features to encode meaningful context with varying scale. We show that ContextPool makes attention models more expressive, achieving strong performance often with fewer layers and thus significantly reduced cost. Experiments validate that our ContextPool module, when plugged into transformer models, matches or surpasses state-of-the-art performance using less compute on several language and image benchmarks, outperforms recent works with learned context sizes or sparse attention patterns, and is also applicable to ConvNets for efficient feature learning.  ( 2 min )
    A Unified Meta-Learning Framework for Dynamic Transfer Learning. (arXiv:2207.01784v1 [cs.LG])
    Transfer learning refers to the transfer of knowledge or information from a relevant source task to a target task. However, most existing works assume both tasks are sampled from a stationary task distribution, thereby leading to the sub-optimal performance for dynamic tasks drawn from a non-stationary task distribution in real scenarios. To bridge this gap, in this paper, we study a more realistic and challenging transfer learning setting with dynamic tasks, i.e., source and target tasks are continuously evolving over time. We theoretically show that the expected error on the dynamic target task can be tightly bounded in terms of source knowledge and consecutive distribution discrepancy across tasks. This result motivates us to propose a generic meta-learning framework L2E for modeling the knowledge transferability on dynamic tasks. It is centered around a task-guided meta-learning problem with a group of meta-pairs of tasks, based on which we are able to learn the prior model initialization for fast adaptation on the newest target task. L2E enjoys the following properties: (1) effective knowledge transferability across dynamic tasks; (2) fast adaptation to the new target task; (3) mitigation of catastrophic forgetting on historical target tasks; and (4) flexibility in incorporating any existing static transfer learning algorithms. Extensive experiments on various image data sets demonstrate the effectiveness of the proposed L2E framework.  ( 2 min )
    Vector Quantisation for Robust Segmentation. (arXiv:2207.01919v1 [eess.IV])
    The reliability of segmentation models in the medical domain depends on the model's robustness to perturbations in the input space. Robustness is a particular challenge in medical imaging exhibiting various sources of image noise, corruptions, and domain shifts. Obtaining robustness is often attempted via simulating heterogeneous environments, either heuristically in the form of data augmentation or by learning to generate specific perturbations in an adversarial manner. We propose and justify that learning a discrete representation in a low dimensional embedding space improves robustness of a segmentation model. This is achieved with a dictionary learning method called vector quantisation. We use a set of experiments designed to analyse robustness in both the latent and output space under domain shift and noise perturbations in the input space. We adapt the popular UNet architecture, inserting a quantisation block in the bottleneck. We demonstrate improved segmentation accuracy and better robustness on three segmentation tasks. Code is available at \url{https://github.com/AinkaranSanthi/Vector-Quantisation-for-Robust-Segmentation}  ( 2 min )
    Randomized-to-Canonical Model Predictive Control for Real-world Visual Robotic Manipulation. (arXiv:2207.01840v1 [cs.RO])
    Many works have recently explored Sim-to-real transferable visual model predictive control (MPC). However, such works are limited to one-shot transfer, where real-world data must be collected once to perform the sim-to-real transfer, which remains a significant human effort in transferring the models learned in simulations to new domains in the real world. To alleviate this problem, we first propose a novel model-learning framework called Kalman Randomized-to-Canonical Model (KRC-model). This framework is capable of extracting task-relevant intrinsic features and their dynamics from randomized images. We then propose Kalman Randomized-to-Canonical Model Predictive Control (KRC-MPC) as a zero-shot sim-to-real transferable visual MPC using KRC-model. The effectiveness of our method is evaluated through a valve rotation task by a robot hand in both simulation and the real world, and a block mating task in simulation. The experimental results show that KRC-MPC can be applied to various real domains and tasks in a zero-shot manner.  ( 2 min )
    Machine Learning in Access Control: A Taxonomy and Survey. (arXiv:2207.01739v1 [cs.CR])
    An increasing body of work has recognized the importance of exploiting machine learning (ML) advancements to address the need for efficient automation in extracting access control attributes, policy mining, policy verification, access decisions, etc. In this work, we survey and summarize various ML approaches to solve different access control problems. We propose a novel taxonomy of the ML model's application in the access control domain. We highlight current limitations and open challenges such as lack of public real-world datasets, administration of ML-based access control systems, understanding a black-box ML model's decision, etc., and enumerate future research directions.  ( 2 min )
    Anomaly-aware multiple instance learning for rare anemia disorder classification. (arXiv:2207.01742v1 [cs.LG])
    Deep learning-based classification of rare anemia disorders is challenged by the lack of training data and instance-level annotations. Multiple Instance Learning (MIL) has shown to be an effective solution, yet it suffers from low accuracy and limited explainability. Although the inclusion of attention mechanisms has addressed these issues, their effectiveness highly depends on the amount and diversity of cells in the training samples. Consequently, the poor machine learning performance on rare anemia disorder classification from blood samples remains unresolved. In this paper, we propose an interpretable pooling method for MIL to address these limitations. By benefiting from instance-level information of negative bags (i.e., homogeneous benign cells from healthy individuals), our approach increases the contribution of anomalous instances. We show that our strategy outperforms standard MIL classification algorithms and provides a meaningful explanation behind its decisions. Moreover, it can denote anomalous instances of rare blood diseases that are not seen during the training phase.  ( 2 min )
    On A Mallows-type Model For (Ranked) Choices. (arXiv:2207.01783v1 [cs.LG])
    In a preference learning setting, every participant chooses an ordered list of $k$ most preferred items among a displayed set of candidates. (The set can be different for every participant.) We identify a distance-based ranking model for the population's preferences and their (ranked) choice behavior. The ranking model resembles the Mallows model but uses a new distance function called Reverse Major Index (RMJ). We find that despite the need to sum over all permutations, the RMJ-based ranking distribution aggregates into (ranked) choice probabilities with simple closed-form expression. We develop effective methods to estimate the model parameters and showcase their generalization power using real data, especially when there is a limited variety of display sets.  ( 2 min )
    Ensemble feature selection with data-driven thresholding for Alzheimer's disease biomarker discovery. (arXiv:2207.01822v1 [cs.LG])
    Healthcare datasets present many challenges to both machine learning and statistics as their data are typically heterogeneous, censored, high-dimensional and have missing information. Feature selection is often used to identify the important features but can produce unstable results when applied to high-dimensional data, selecting a different set of features on each iteration. The stability of feature selection can be improved with the use of feature selection ensembles, which aggregate the results of multiple base feature selectors. A threshold must be applied to the final aggregated feature set to separate the relevant features from the redundant ones. A fixed threshold, which is typically applied, offers no guarantee that the final set of selected features contains only relevant features. This work develops several data-driven thresholds to automatically identify the relevant features in an ensemble feature selector and evaluates their predictive accuracy and stability. To demonstrate the applicability of these methods to clinical data, they are applied to data from two real-world Alzheimer's disease (AD) studies. AD is a progressive neurodegenerative disease with no known cure, that begins at least 2-3 decades before overt symptoms appear, presenting an opportunity for researchers to identify early biomarkers that might identify patients at risk of developing AD. Features identified by applying these methods to both datasets reflect current findings in the AD literature.  ( 3 min )
    Discrete Tree Flows via Tree-Structured Permutations. (arXiv:2207.01744v1 [cs.LG])
    While normalizing flows for continuous data have been extensively researched, flows for discrete data have only recently been explored. These prior models, however, suffer from limitations that are distinct from those of continuous flows. Most notably, discrete flow-based models cannot be straightforwardly optimized with conventional deep learning methods because gradients of discrete functions are undefined or zero. Previous works approximate pseudo-gradients of the discrete functions but do not solve the problem on a fundamental level. In addition to that, backpropagation can be computationally burdensome compared to alternative discrete algorithms such as decision tree algorithms. Our approach seeks to reduce computational burden and remove the need for pseudo-gradients by developing a discrete flow based on decision trees -- building upon the success of efficient tree-based methods for classification and regression for discrete data. We first define a tree-structured permutation (TSP) that compactly encodes a permutation of discrete data where the inverse is easy to compute; thus, we can efficiently compute the density value and sample new data. We then propose a decision tree algorithm to build TSPs that learns the tree structure and permutations at each node via novel criteria. We empirically demonstrate the feasibility of our method on multiple datasets.  ( 2 min )
    GSMFlow: Generation Shifts Mitigating Flow for Generalized Zero-Shot Learning. (arXiv:2207.01798v1 [cs.CV])
    Generalized Zero-Shot Learning (GZSL) aims to recognize images from both the seen and unseen classes by transferring semantic knowledge from seen to unseen classes. It is a promising solution to take the advantage of generative models to hallucinate realistic unseen samples based on the knowledge learned from the seen classes. However, due to the generation shifts, the synthesized samples by most existing methods may drift from the real distribution of the unseen data. To address this issue, we propose a novel flow-based generative framework that consists of multiple conditional affine coupling layers for learning unseen data generation. Specifically, we discover and address three potential problems that trigger the generation shifts, i.e., semantic inconsistency, variance collapse, and structure disorder. First, to enhance the reflection of the semantic information in the generated samples, we explicitly embed the semantic information into the transformation in each conditional affine coupling layer. Second, to recover the intrinsic variance of the real unseen features, we introduce a boundary sample mining strategy with entropy maximization to discover more difficult visual variants of semantic prototypes and hereby adjust the decision boundary of the classifiers. Third, a relative positioning strategy is proposed to revise the attribute embeddings, guiding them to fully preserve the inter-class geometric structure and further avoid structure disorder in the semantic space. Extensive experimental results on four GZSL benchmark datasets demonstrate that GSMFlow achieves the state-of-the-art performance on GZSL.  ( 3 min )
    How Much More Data Do I Need? Estimating Requirements for Downstream Tasks. (arXiv:2207.01725v1 [cs.CV])
    Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance? This question is of critical importance in applications such as autonomous driving or medical imaging where collecting data is expensive and time-consuming. Overestimating or underestimating data requirements incurs substantial costs that could be avoided with an adequate budget. Prior work on neural scaling laws suggest that the power-law function can fit the validation performance curve and extrapolate it to larger data set sizes. We find that this does not immediately translate to the more difficult downstream task of estimating the required data set size to meet a target performance. In this work, we consider a broad class of computer vision tasks and systematically investigate a family of functions that generalize the power-law function to allow for better estimation of data requirements. Finally, we show that incorporating a tuned correction factor and collecting over multiple rounds significantly improves the performance of the data estimators. Using our guidelines, practitioners can accurately estimate data requirements of machine learning systems to gain savings in both development time and data acquisition costs.  ( 3 min )
    CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning. (arXiv:2207.01780v1 [cs.LG])
    Program synthesis or code generation aims to generate a program that satisfies a problem specification. Recent approaches using large-scale pretrained language models (LMs) have shown promising results, yet they have some critical limitations. In particular, they often follow a standard supervised fine-tuning procedure to train a code generation model only from the pairs of natural-language problem descriptions and ground-truth programs. Such paradigm largely ignores some important but potentially useful signals in the problem specification such as unit tests, which thus often results in poor performance when solving complex unseen coding tasks. To address the limitations, we propose "CodeRL", a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning (RL). Specifically, during training, we treat the code-generating LM as an actor network, and introduce a critic network that is trained to predict the functional correctness of generated programs and provide dense feedback signals to the actor. During inference, we introduce a new generation procedure with a critical sampling strategy that allows a model to automatically regenerate programs based on feedback from example unit tests and critic scores. For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives, larger model sizes, and better pretraining data. Our method not only achieves new SOTA results on the challenging APPS benchmark, but also shows strong zero-shot transfer capability with new SOTA results on the simpler MBPP benchmark.  ( 3 min )
    PoF: Post-Training of Feature Extractor for Improving Generalization. (arXiv:2207.01847v1 [cs.LG])
    It has been intensively investigated that the local shape, especially flatness, of the loss landscape near a minimum plays an important role for generalization of deep models. We developed a training algorithm called PoF: Post-Training of Feature Extractor that updates the feature extractor part of an already-trained deep model to search a flatter minimum. The characteristics are two-fold: 1) Feature extractor is trained under parameter perturbations in the higher-layer parameter space, based on observations that suggest flattening higher-layer parameter space, and 2) the perturbation range is determined in a data-driven manner aiming to reduce a part of test loss caused by the positive loss curvature. We provide a theoretical analysis that shows the proposed algorithm implicitly reduces the target Hessian components as well as the loss. Experimental results show that PoF improved model performance against baseline methods on both CIFAR-10 and CIFAR-100 datasets for only 10-epoch post-training, and on SVHN dataset for 50-epoch post-training. Source code is available at: \url{https://github.com/DensoITLab/PoF-v1  ( 2 min )
    Robust Reinforcement Learning in Continuous Control Tasks with Uncertainty Set Regularization. (arXiv:2207.02016v1 [cs.LG])
    Reinforcement learning (RL) is recognized as lacking generalization and robustness under environmental perturbations, which excessively restricts its application for real-world robotics. Prior work claimed that adding regularization to the value function is equivalent to learning a robust policy with uncertain transitions. Although the regularization-robustness transformation is appealing for its simplicity and efficiency, it is still lacking in continuous control tasks. In this paper, we propose a new regularizer named $\textbf{U}$ncertainty $\textbf{S}$et $\textbf{R}$egularizer (USR), by formulating the uncertainty set on the parameter space of the transition function. In particular, USR is flexible enough to be plugged into any existing RL framework. To deal with unknown uncertainty sets, we further propose a novel adversarial approach to generate them based on the value function. We evaluate USR on the Real-world Reinforcement Learning (RWRL) benchmark, demonstrating improvements in the robust performance for perturbed testing environments.  ( 2 min )
    Do Not Take It for Granted: Comparing Open-Source Libraries for Software Development Effort Estimation. (arXiv:2207.01705v1 [cs.SE])
    In the past two decades, several Machine Learning (ML) libraries have become freely available. Many studies have used such libraries to carry out empirical investigations on predictive Software Engineering (SE) tasks. However, the differences stemming from using one library over another have been overlooked, implicitly assuming that using any of these libraries would provide the user with the same or very similar results. This paper aims at raising awareness of the differences incurred when using different ML libraries for software development effort estimation (SEE), one of most widely studied SE prediction tasks. To this end, we investigate 4 deterministic machine learners as provided by 3 of the most popular ML open-source libraries written in different languages (namely, Scikit-Learn, Caret and Weka). We carry out a thorough empirical study comparing the performance of the machine learners on 5 SEE datasets in the two most common SEE scenarios (i.e., out-of-the-box-ml and tuned-ml) as well as an in-depth analysis of the documentation and code of their APIs. The results of our study reveal that the predictions provided by the 3 libraries differ in 95% of the cases on average across a total of 105 cases studied. These differences are significantly large in most cases and yield misestimations of up to approx. 3,000 hours per project. Moreover, our API analysis reveals that these libraries provide the user with different levels of control on the parameters one can manipulate, and a lack of clarity and consistency, overall, which might mislead users. Our findings highlight that the ML library is an important design choice for SEE studies, which can lead to a difference in performance. However, such a difference is under-documented. We conclude by highlighting open-challenges with suggestions for the developers of libraries as well as for the researchers and practitioners using them.  ( 3 min )
    The Deep Ritz Method for Parametric $p$-Dirichlet Problems. (arXiv:2207.01894v1 [math.NA])
    We establish error estimates for the approximation of parametric $p$-Dirichlet problems deploying the Deep Ritz Method. Parametric dependencies include, e.g., varying geometries and exponents $p\in (1,\infty)$. Combining the derived error estimates with quantitative approximation theorems yields error decay rates and establishes that the Deep Ritz Method retains the favorable approximation capabilities of neural networks in the approximation of high dimensional functions which makes the method attractive for parametric problems. Finally, we present numerical examples to illustrate potential applications.  ( 2 min )
    What Do Graph Convolutional Neural Networks Learn?. (arXiv:2207.01839v1 [cs.LG])
    Graph neural networks (GNNs) have gained traction over the past few years for their superior performance in numerous machine learning tasks. Graph Convolutional Neural Networks (GCN) are a common variant of GNNs that are known to have high performance in semi-supervised node classification (SSNC), and work well under the assumption of homophily. Recent literature has highlighted that GCNs can achieve strong performance on heterophilous graphs under certain "special conditions". These arguments motivate us to understand why, and how, GCNs learn to perform SSNC. We find a positive correlation between similarity of latent node embeddings of nodes within a class and the performance of a GCN. Our investigation on underlying graph structures of a dataset finds that a GCN's SSNC performance is significantly influenced by the consistency and uniqueness in neighborhood structure of nodes within a class.  ( 2 min )
    FACT: High-Dimensional Random Forests Inference. (arXiv:2207.01678v1 [stat.ML])
    Random forests is one of the most widely used machine learning methods over the past decade thanks to its outstanding empirical performance. Yet, because of its black-box nature, the results by random forests can be hard to interpret in many big data applications. Quantifying the usefulness of individual features in random forests learning can greatly enhance its interpretability. Existing studies have shown that some popularly used feature importance measures for random forests suffer from the bias issue. In addition, there lack comprehensive size and power analyses for most of these existing methods. In this paper, we approach the problem via hypothesis testing, and suggest a framework of the self-normalized feature-residual correlation test (FACT) for evaluating the significance of a given feature in the random forests model with bias-resistance property, where our null hypothesis concerns whether the feature is conditionally independent of the response given all other features. Such an endeavor on random forests inference is empowered by some recent developments on high-dimensional random forests consistency. The vanilla version of our FACT test can suffer from the bias issue in the presence of feature dependency. We exploit the techniques of imbalancing and conditioning for bias correction. We further incorporate the ensemble idea into the FACT statistic through feature transformations for the enhanced power. Under a fairly general high-dimensional nonparametric model setting with dependent features, we formally establish that FACT can provide theoretically justified random forests feature p-values and enjoy appealing power through nonasymptotic analyses. The theoretical results and finite-sample advantages of the newly suggested method are illustrated with several simulation examples and an economic forecasting application in relation to COVID-19.  ( 3 min )
    Slice-by-slice deep learning aided oropharyngeal cancer segmentation with adaptive thresholding for spatial uncertainty on FDG PET and CT images. (arXiv:2207.01623v1 [eess.IV])
    Tumor segmentation is a fundamental step for radiotherapy treatment planning. To define an accurate segmentation of the primary tumor (GTVp) of oropharyngeal cancer patients (OPC), simultaneous assessment of different image modalities is needed, and each image volume is explored slice-by-slice from different orientations. Moreover, the manual fixed boundary of segmentation neglects the spatial uncertainty known to occur in tumor delineation. This study proposes a novel automatic deep learning (DL) model to assist radiation oncologists in a slice-by-slice adaptive GTVp segmentation on registered FDG PET/CT images. We included 138 OPC patients treated with (chemo)radiation in our institute. Our DL framework exploits both inter and intra-slice context. Sequences of 3 consecutive 2D slices of concatenated FDG PET/CT images and GTVp contours were used as input. A 3-fold cross validation was performed three times, training on sequences extracted from the Axial (A), Sagittal (S), and Coronal (C) plane of 113 patients. Since consecutive sequences in a volume contain overlapping slices, each slice resulted in three outcome predictions that were averaged. In the A, S, and C planes, the output shows areas with different probabilities of predicting the tumor. The performance of the models was assessed on 25 patients at different probability thresholds using the mean Dice Score Coefficient (DSC). Predictions were the closest to the ground truth at a probability threshold of 0.9 (DSC of 0.70 in the A, 0.77 in the S, and 0.80 in the C plane). The promising results of the proposed DL model show that the probability maps on registered FDG PET/CT images could guide radiation oncologists in a slice-by-slice adaptive GTVp segmentation.  ( 3 min )
  • Open

    Offline RL Policies Should be Trained to be Adaptive. (arXiv:2207.02200v1 [cs.LG])
    Offline RL algorithms must account for the fact that the dataset they are provided may leave many facets of the environment unknown. The most common way to approach this challenge is to employ pessimistic or conservative methods, which avoid behaviors that are too dissimilar from those in the training dataset. However, relying exclusively on conservatism has drawbacks: performance is sensitive to the exact degree of conservatism, and conservative objectives can recover highly suboptimal policies. In this work, we propose that offline RL methods should instead be adaptive in the presence of uncertainty. We show that acting optimally in offline RL in a Bayesian sense involves solving an implicit POMDP. As a result, optimal policies for offline RL must be adaptive, depending not just on the current state but rather all the transitions seen so far during evaluation.We present a model-free algorithm for approximating this optimal adaptive policy, and demonstrate the efficacy of learning such adaptive policies in offline RL benchmarks.  ( 2 min )
    A Generative Framework for Personalized Learning and Estimation: Theory, Algorithms, and Privacy. (arXiv:2207.01771v1 [cs.LG])
    A distinguishing characteristic of federated learning is that the (local) client data could have statistical heterogeneity. This heterogeneity has motivated the design of personalized learning, where individual (personalized) models are trained, through collaboration. There have been various personalization methods proposed in literature, with seemingly very different forms and methods ranging from use of a single global model for local regularization and model interpolation, to use of multiple global models for personalized clustering, etc. In this work, we begin with a generative framework that could potentially unify several different algorithms as well as suggest new algorithms. We apply our generative framework to personalized estimation, and connect it to the classical empirical Bayes' methodology. We develop private personalized estimation under this framework. We then use our generative framework for learning, which unifies several known personalized FL algorithms and also suggests new ones; we propose and study a new algorithm AdaPeD based on a Knowledge Distillation, which numerically outperforms several known algorithms. We also develop privacy for personalized learning methods with guarantees for user-level privacy and composition. We numerically evaluate the performance as well as the privacy for both the estimation and learning problems, demonstrating the advantages of our proposed methods.
    On Effective Scheduling of Model-based Reinforcement Learning. (arXiv:2111.08550v3 [cs.LG] UPDATED)
    Model-based reinforcement learning has attracted wide attention due to its superior sample efficiency. Despite its impressive success so far, it is still unclear how to appropriately schedule the important hyperparameters to achieve adequate performance, such as the real data ratio for policy optimization in Dyna-style model-based algorithms. In this paper, we first theoretically analyze the role of real data in policy training, which suggests that gradually increasing the ratio of real data yields better performance. Inspired by the analysis, we propose a framework named AutoMBPO to automatically schedule the real data ratio as well as other hyperparameters in training model-based policy optimization (MBPO) algorithm, a representative running case of model-based methods. On several continuous control tasks, the MBPO instance trained with hyperparameters scheduled by AutoMBPO can significantly surpass the original one, and the real data ratio schedule found by AutoMBPO shows consistency with our theoretical analysis.
    DAS-PINNs: A deep adaptive sampling method for solving high-dimensional partial differential equations. (arXiv:2112.14038v2 [math.NA] UPDATED)
    In this work we propose a deep adaptive sampling (DAS) method for solving partial differential equations (PDEs), where deep neural networks are utilized to approximate the solutions of PDEs and deep generative models are employed to generate new collocation points that refine the training set. The overall procedure of DAS consists of two components: solving the PDEs by minimizing the residual loss on the collocation points in the training set and generating a new training set to further improve the accuracy of current approximate solution. In particular, we treat the residual as a probability density function and approximate it with a deep generative model, called KRnet. The new samples from KRnet are consistent with the distribution induced by the residual, i.e., more samples are located in the region of large residual and less samples are located in the region of small residual. Analogous to classical adaptive methods such as the adaptive finite element, KRnet acts as an error indicator that guides the refinement of the training set. Compared to the neural network approximation obtained with uniformly distributed collocation points, the developed algorithms can significantly improve the accuracy, especially for low regularity and high-dimensional problems. We demonstrate the effectiveness of the proposed DAS method with numerical experiments.
    Learning Stochastic Shortest Path with Linear Function Approximation. (arXiv:2110.12727v3 [cs.LG] UPDATED)
    We study the stochastic shortest path (SSP) problem in reinforcement learning with linear function approximation, where the transition kernel is represented as a linear mixture of unknown models. We call this class of SSP problems as linear mixture SSPs. We propose a novel algorithm with Hoeffding-type confidence sets for learning the linear mixture SSP, which can attain an $\tilde{\mathcal{O}}(d B_{\star}^{1.5}\sqrt{K/c_{\min}})$ regret. Here $K$ is the number of episodes, $d$ is the dimension of the feature mapping in the mixture model, $B_{\star}$ bounds the expected cumulative cost of the optimal policy, and $c_{\min}>0$ is the lower bound of the cost function. Our algorithm also applies to the case when $c_{\min} = 0$, and an $\tilde{\mathcal{O}}(K^{2/3})$ regret is guaranteed. To the best of our knowledge, this is the first algorithm with a sublinear regret guarantee for learning linear mixture SSP. Moreover, we design a refined Bernstein-type confidence set and propose an improved algorithm, which provably achieves an $\tilde{\mathcal{O}}(d B_{\star}\sqrt{K/c_{\min}})$ regret. In complement to the regret upper bounds, we also prove a lower bound of $\Omega(dB_{\star} \sqrt{K})$. Hence, our improved algorithm matches the lower bound up to a $1/\sqrt{c_{\min}}$ factor and poly-logarithmic factors, achieving a near-optimal regret guarantee.
    An Approximation Method for Fitted Random Forests. (arXiv:2207.02184v1 [stat.ML])
    Random Forests (RF) is a popular machine learning method for classification and regression problems. It involves a bagging application to decision tree models. One of the primary advantages of the Random Forests model is the reduction in the variance of the forecast. In large scale applications of the model with millions of data points and hundreds of features, the size of the fitted objects can get very large and reach the limits on the available space in production setups, depending on the number and depth of the trees. This could be especially challenging when trained models need to be downloaded on-demand to small devices with limited memory. There is a need to approximate the trained RF models to significantly reduce the model size without losing too much of prediction accuracy. In this project we study methods that approximate each fitted tree in the Random Forests model using the multinomial allocation of the data points to the leafs. Specifically, we begin by studying whether fitting a multinomial logistic regression (and subsequently, a generalized additive model (GAM) extension) to the output of each tree helps reduce the size while preserving the prediction quality.  ( 2 min )
    PRoA: A Probabilistic Robustness Assessment against Functional Perturbations. (arXiv:2207.02036v1 [cs.LG])
    In safety-critical deep learning applications robustness measurement is a vital pre-deployment phase. However, existing robustness verification methods are not sufficiently practical for deploying machine learning systems in the real world. On the one hand, these methods attempt to claim that no perturbations can ``fool'' deep neural networks (DNNs), which may be too stringent in practice. On the other hand, existing works rigorously consider $L_p$ bounded additive perturbations on the pixel space, although perturbations, such as colour shifting and geometric transformations, are more practically and frequently occurring in the real world. Thus, from the practical standpoint, we present a novel and general {\it probabilistic robustness assessment method} (PRoA) based on the adaptive concentration, and it can measure the robustness of deep learning models against functional perturbations. PRoA can provide statistical guarantees on the probabilistic robustness of a model, \textit{i.e.}, the probability of failure encountered by the trained model after deployment. Our experiments demonstrate the effectiveness and flexibility of PRoA in terms of evaluating the probabilistic robustness against a broad range of functional perturbations, and PRoA can scale well to various large-scale deep neural networks compared to existing state-of-the-art baselines. For the purpose of reproducibility, we release our tool on GitHub: \url{ https://github.com/TrustAI/PRoA}.
    Graph Clustering with Graph Neural Networks. (arXiv:2006.16904v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have achieved state-of-the-art results on many graph analysis tasks such as node classification and link prediction. However, important unsupervised problems on graphs, such as graph clustering, have proved more resistant to advances in GNNs. Graph clustering has the same overall goal as node pooling in GNNs - does this mean that GNN pooling methods do a good job at clusterings graphs? Surprisingly, the answer is no - current GNN pooling methods often fail to recover the cluster structure in cases where simple baselines, such as k-means applied on learned representations, work well. We investigate further by carefully designing a set of experiments to study different signal-to-noise scenarios both in graph structure and attribute data. To address these methods' poor performance in clustering, we introduce Deep Modularity Networks (DMoN), an unsupervised pooling method inspired by the modularity measure of clustering quality, and show how it tackles recovery of the challenging clustering structure of real-world graphs. Similarly, on real-world data, we show that DMoN produces high quality clusters which correlate strongly with ground truth labels, achieving state-of-the-art results with over 40% improvement over other pooling methods across different metrics.
    Variational Bayes for high-dimensional proportional hazards models with applications within gene expression. (arXiv:2112.10270v2 [stat.ME] UPDATED)
    Few Bayesian methods for analyzing high-dimensional sparse survival data provide scalable variable selection, effect estimation and uncertainty quantification. Such methods often either sacrifice uncertainty quantification by computing maximum a posteriori estimates, or quantify the uncertainty at high (unscalable) computational expense. We bridge this gap and develop an interpretable and scalable Bayesian proportional hazards model for prediction and variable selection, referred to as SVB. Our method, based on a mean-field variational approximation, overcomes the high computational cost of MCMC whilst retaining useful features, providing a posterior distribution for the parameters and offering a natural mechanism for variable selection via posterior inclusion probabilities. The performance of our proposed method is assessed via extensive simulations and compared against other state-of-the-art Bayesian variable selection methods, demonstrating comparable or better performance. Finally, we demonstrate how the proposed method can be used for variable selection on two transcriptomic datasets with censored survival outcomes, and how the uncertainty quantification offered by our method can be used to provide an interpretable assessment of patient risk.
    An Optimization-based Algorithm for Non-stationary Kernel Bandits without Prior Knowledge. (arXiv:2205.14775v2 [stat.ML] UPDATED)
    We propose an algorithm for non-stationary kernel bandits that does not require prior knowledge of the degree of non-stationarity. The algorithm follows randomized strategies obtained by solving optimization problems that balance exploration and exploitation. It adapts to non-stationarity by restarting when a change in the reward function is detected. Our algorithm enjoys a tighter dynamic regret bound than previous work on the non-stationary kernel bandit setting. Moreover, when applied to the non-stationary linear bandit setting by using a linear kernel, our algorithm is nearly minimax optimal, solving an open problem in the non-stationary linear bandit literature. We extend our algorithm to use a neural network for dynamically adapting the feature mapping to observed data. We prove a dynamic regret bound of the extension using the neural tangent kernel theory. We demonstrate empirically that our algorithm and the extension can adapt to varying degrees of non-stationarity.
    A survey of multimodal deep generative models. (arXiv:2207.02127v1 [cs.LG])
    Multimodal learning is a framework for building models that make predictions based on different types of modalities. Important challenges in multimodal learning are the inference of shared representations from arbitrary modalities and cross-modal generation via these representations; however, achieving this requires taking the heterogeneous nature of multimodal data into account. In recent years, deep generative models, i.e., generative models in which distributions are parameterized by deep neural networks, have attracted much attention, especially variational autoencoders, which are suitable for accomplishing the above challenges because they can consider heterogeneity and infer good representations of data. Therefore, various multimodal generative models based on variational autoencoders, called multimodal deep generative models, have been proposed in recent years. In this paper, we provide a categorized survey of studies on multimodal deep generative models.
    $\pi$VAE: a stochastic process prior for Bayesian deep learning with MCMC. (arXiv:2002.06873v5 [cs.LG] UPDATED)
    Stochastic processes provide a mathematically elegant way model complex data. In theory, they provide flexible priors over function classes that can encode a wide range of interesting assumptions. In practice, however, efficient inference by optimisation or marginalisation is difficult, a problem further exacerbated with big data and high dimensional input spaces. We propose a novel variational autoencoder (VAE) called the prior encoding variational autoencoder ($\pi$VAE). The $\pi$VAE is finitely exchangeable and Kolmogorov consistent, and thus is a continuous stochastic process. We use $\pi$VAE to learn low dimensional embeddings of function classes. We show that our framework can accurately learn expressive function classes such as Gaussian processes, but also properties of functions to enable statistical inference (such as the integral of a log Gaussian process). For popular tasks, such as spatial interpolation, $\pi$VAE achieves state-of-the-art performance both in terms of accuracy and computational efficiency. Perhaps most usefully, we demonstrate that the low dimensional independently distributed latent space representation learnt provides an elegant and scalable means of performing Bayesian inference for stochastic processes within probabilistic programming languages such as Stan.
    CAPITAL: Optimal Subgroup Identification via Constrained Policy Tree Search. (arXiv:2110.05636v2 [stat.ML] UPDATED)
    Personalized medicine, a paradigm of medicine tailored to a patient's characteristics, is an increasingly attractive field in health care. An important goal of personalized medicine is to identify a subgroup of patients, based on baseline covariates, that benefits more from the targeted treatment than other comparative treatments. Most of the current subgroup identification methods only focus on obtaining a subgroup with an enhanced treatment effect without paying attention to subgroup size. Yet, a clinically meaningful subgroup learning approach should identify the maximum number of patients who can benefit from the better treatment. In this paper, we present an optimal subgroup selection rule (SSR) that maximizes the number of selected patients, and in the meantime, achieves the pre-specified clinically meaningful mean outcome, such as the average treatment effect. We derive two equivalent theoretical forms of the optimal SSR based on the contrast function that describes the treatment-covariates interaction in the outcome. We further propose a ConstrAined PolIcy Tree seArch aLgorithm (CAPITAL) to find the optimal SSR within the interpretable decision tree class. The proposed method is flexible to handle multiple constraints that penalize the inclusion of patients with negative treatment effects, and to address time to event data using the restricted mean survival time as the clinically interesting mean outcome. Extensive simulations, comparison studies, and real data applications are conducted to demonstrate the validity and utility of our method.
    Learning Optimal Transport Between two Empirical Distributions with Normalizing Flows. (arXiv:2207.01246v2 [cs.LG] UPDATED)
    Optimal transport (OT) provides effective tools for comparing and mapping probability measures. We propose to leverage the flexibility of neural networks to learn an approximate optimal transport map. More precisely, we present a new and original method to address the problem of transporting a finite set of samples associated with a first underlying unknown distribution towards another finite set of samples drawn from another unknown distribution. We show that a particular instance of invertible neural networks, namely the normalizing flows, can be used to approximate the solution of this OT problem between a pair of empirical distributions. To this aim, we propose to relax the Monge formulation of OT by replacing the equality constraint on the push-forward measure by the minimization of the corresponding Wasserstein distance. The push-forward operator to be retrieved is then restricted to be a normalizing flow which is trained by optimizing the resulting cost function. This approach allows the transport map to be discretized as a composition of functions. Each of these functions is associated to one sub-flow of the network, whose output provides intermediate steps of the transport between the original and target measures. This discretization yields also a set of intermediate barycenters between the two measures of interest. Experiments conducted on toy examples as well as a challenging task of unsupervised translation demonstrate the interest of the proposed method. Finally, some experiments show that the proposed approach leads to a good approximation of the true OT.
    FACT: High-Dimensional Random Forests Inference. (arXiv:2207.01678v1 [stat.ML])
    Random forests is one of the most widely used machine learning methods over the past decade thanks to its outstanding empirical performance. Yet, because of its black-box nature, the results by random forests can be hard to interpret in many big data applications. Quantifying the usefulness of individual features in random forests learning can greatly enhance its interpretability. Existing studies have shown that some popularly used feature importance measures for random forests suffer from the bias issue. In addition, there lack comprehensive size and power analyses for most of these existing methods. In this paper, we approach the problem via hypothesis testing, and suggest a framework of the self-normalized feature-residual correlation test (FACT) for evaluating the significance of a given feature in the random forests model with bias-resistance property, where our null hypothesis concerns whether the feature is conditionally independent of the response given all other features. Such an endeavor on random forests inference is empowered by some recent developments on high-dimensional random forests consistency. The vanilla version of our FACT test can suffer from the bias issue in the presence of feature dependency. We exploit the techniques of imbalancing and conditioning for bias correction. We further incorporate the ensemble idea into the FACT statistic through feature transformations for the enhanced power. Under a fairly general high-dimensional nonparametric model setting with dependent features, we formally establish that FACT can provide theoretically justified random forests feature p-values and enjoy appealing power through nonasymptotic analyses. The theoretical results and finite-sample advantages of the newly suggested method are illustrated with several simulation examples and an economic forecasting application in relation to COVID-19.
    Meta-Learning a Real-Time Tabular AutoML Method For Small Data. (arXiv:2207.01848v1 [cs.LG])
    We present TabPFN, an AutoML method that is competitive with the state of the art on small tabular datasets while being over 1,000$\times$ faster. Our method is very simple: it is fully entailed in the weights of a single neural network, and a single forward pass directly yields predictions for a new dataset. Our AutoML method is meta-learned using the Transformer-based Prior-Data Fitted Network (PFN) architecture and approximates Bayesian inference with a prior that is based on assumptions of simplicity and causal structures. The prior contains a large space of structural causal models and Bayesian neural networks with a bias for small architectures and thus low complexity. Furthermore, we extend the PFN approach to differentiably calibrate the prior's hyperparameters on real data. By doing so, we separate our abstract prior assumptions from their heuristic calibration on real data. Afterwards, the calibrated hyperparameters are fixed and TabPFN can be applied to any new tabular dataset at the push of a button. Finally, on 30 datasets from the OpenML-CC18 suite we show that our method outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with predictions produced in less than a second. We provide all our code and our final trained TabPFN in the supplementary materials.
    Making Sense of Dependence: Efficient Black-box Explanations Using Dependence Measure. (arXiv:2206.06219v2 [cs.CV] UPDATED)
    This paper presents a new efficient black-box attribution method based on Hilbert-Schmidt Independence Criterion (HSIC), a dependence measure based on Reproducing Kernel Hilbert Spaces (RKHS). HSIC measures the dependence between regions of an input image and the output of a model based on kernel embeddings of distributions. It thus provides explanations enriched by RKHS representation capabilities. HSIC can be estimated very efficiently, significantly reducing the computational cost compared to other black-box attribution methods. Our experiments show that HSIC is up to 8 times faster than the previous best black-box attribution methods while being as faithful. Indeed, we improve or match the state-of-the-art of both black-box and white-box attribution methods for several fidelity metrics on Imagenet with various recent model architectures. Importantly, we show that these advances can be transposed to efficiently and faithfully explain object detection models such as YOLOv4. Finally, we extend the traditional attribution methods by proposing a new kernel enabling an orthogonal decomposition of importance scores based on HSIC, allowing us to evaluate not only the importance of each image patch but also the importance of their pairwise interactions.
    Estimating means of bounded random variables by betting. (arXiv:2010.09686v6 [math.ST] UPDATED)
    This paper derives confidence intervals (CI) and time-uniform confidence sequences (CS) for the classical problem of estimating an unknown mean from bounded observations. We present a general approach for deriving concentration bounds, that can be seen as a generalization (and improvement) of the celebrated Chernoff method. At its heart, it is based on deriving a new class of composite nonnegative martingales, with strong connections to testing by betting and the method of mixtures. We show how to extend these ideas to sampling without replacement, another heavily studied problem. In all cases, our bounds are adaptive to the unknown variance, and empirically vastly outperform existing approaches based on Hoeffding or empirical Bernstein inequalities and their recent supermartingale generalizations. In short, we establish a new state-of-the-art for four fundamental problems: CSs and CIs for bounded means, when sampling with and without replacement.
    A Probabilistic State Space Model for Joint Inference from Differential Equations and Data. (arXiv:2103.10153v3 [stat.ML] UPDATED)
    Mechanistic models with differential equations are a key component of scientific applications of machine learning. Inference in such models is usually computationally demanding, because it involves repeatedly solving the differential equation. The main problem here is that the numerical solver is hard to combine with standard inference techniques. Recent work in probabilistic numerics has developed a new class of solvers for ordinary differential equations (ODEs) that phrase the solution process directly in terms of Bayesian filtering. We here show that this allows such methods to be combined very directly, with conceptual and numerical ease, with latent force models in the ODE itself. It then becomes possible to perform approximate Bayesian inference on the latent force as well as the ODE solution in a single, linear complexity pass of an extended Kalman filter / smoother - that is, at the cost of computing a single ODE solution. We demonstrate the expressiveness and performance of the algorithm by training, among others, a non-parametric SIRD model on data from the COVID-19 outbreak.
    DMS, AE, DAA: methods and applications of adaptive time series model selection, ensemble, and financial evaluation. (arXiv:2110.11156v3 [stat.AP] UPDATED)
    We introduce three adaptive time series learning methods, called Dynamic Model Selection (DMS), Adaptive Ensemble (AE), and Dynamic Asset Allocation (DAA). The methods respectively handle model selection, ensembling, and contextual evaluation in financial time series. Empirically, we use the methods to forecast the returns of four key indices in the US market, incorporating information from the VIX and Yield curves. We present financial applications of the learning results, including fully-automated portfolios and dynamic hedging strategies. The strategies strongly outperform long-only benchmarks over our testing period, spanning from Q4 2015 to the end of 2021. The key outputs of the learning methods are interpreted during the 2020 market crash.
    VisRuler: Visual Analytics for Extracting Decision Rules from Bagged and Boosted Decision Trees. (arXiv:2112.00334v3 [cs.LG] UPDATED)
    Bagging and boosting are two popular ensemble methods in machine learning (ML) that produce many individual decision trees. Due to the inherent ensemble characteristic of these methods, they typically outperform single decision trees or other ML models in predictive performance. However, numerous decision paths are generated for each decision tree, increasing the overall complexity of the model and hindering its use in domains that require trustworthy and explainable decisions, such as finance, social care, and health care. Thus, the interpretability of bagging and boosting algorithms, such as random forest and adaptive boosting, reduces as the number of decisions rises. In this paper, we propose a visual analytics tool that aims to assist users in extracting decisions from such ML models via a thorough visual inspection workflow that includes selecting a set of robust and diverse models (originating from different ensemble learning algorithms), choosing important features according to their global contribution, and deciding which decisions are essential for global explanation (or locally, for specific cases). The outcome is a final decision based on the class agreement of several models and the explored manual decisions exported by users. We evaluated the applicability and effectiveness of VisRuler via a use case, a usage scenario, and a user study. The evaluation revealed that most users managed to successfully use our system to explore decision rules visually, performing the proposed tasks and answering the given questions in a satisfying way.
    Regret analysis of the Piyavskii-Shubert algorithm for global Lipschitz optimization. (arXiv:2002.02390v4 [cs.LG] UPDATED)
    We consider the problem of maximizing a non-concave Lipschitz multivariate function over a compact domain by sequentially querying its (possibly perturbed) values. We study a natural algorithm designed originally by Piyavskii and Shubert in 1972, for which we prove new bounds on the number of evaluations of the function needed to reach or certify a given optimization accuracy. Our analysis uses a bandit-optimization viewpoint and solves an open problem from Hansen et al.\ (1991) by bounding the number of evaluations to certify a given accuracy with a near-optimal sum of packing numbers.
    Minimax Estimation of Linear Functions of Eigenvectors in the Face of Small Eigen-Gaps. (arXiv:2104.03298v2 [math.ST] UPDATED)
    Eigenvector perturbation analysis plays a vital role in various data science applications. A large body of prior works, however, focused on establishing $\ell_{2}$ eigenvector perturbation bounds, which are often highly inadequate in addressing tasks that rely on fine-grained behavior of an eigenvector. This paper makes progress on this by studying the perturbation of linear functions of an unknown eigenvector. Focusing on two fundamental problems -- matrix denoising and principal component analysis -- in the presence of Gaussian noise, we develop a suite of statistical theory that characterizes the perturbation of arbitrary linear functions of an unknown eigenvector. In order to mitigate a non-negligible bias issue inherent to the natural ``plug-in'' estimator, we develop de-biased estimators that (1) achieve minimax lower bounds for a family of scenarios (modulo some logarithmic factor), and (2) can be computed in a data-driven manner without sample splitting. Noteworthily, the proposed estimators are nearly minimax optimal even when the associated eigen-gap is {\em substantially smaller} than what is required in prior statistical theory.
    Adapting to Online Label Shift with Provable Guarantees. (arXiv:2207.02121v1 [cs.LG])
    The standard supervised learning paradigm works effectively when training data shares the same distribution as the upcoming testing samples. However, this assumption is often violated in real-world applications, especially when testing data appear in an online fashion. In this paper, we formulate and investigate the problem of online label shift (OLaS): the learner trains an initial model from the labeled offline data and then deploys it to an unlabeled online environment where the underlying label distribution changes over time but the label-conditional density does not. The non-stationarity nature and the lack of supervision make the problem challenging to be tackled. To address the difficulty, we construct a new unbiased risk estimator that utilizes the unlabeled data, which exhibits many benign properties albeit with potential non-convexity. Building upon that, we propose novel online ensemble algorithms to deal with the non-stationarity of the environments. Our approach enjoys optimal dynamic regret, indicating that the performance is competitive with a clairvoyant who knows the online environments in hindsight and then chooses the best decision for each round. The obtained dynamic regret bound scales with the intensity and pattern of label distribution shift, hence exhibiting the adaptivity in the OLaS problem. Extensive experiments are conducted to validate the effectiveness and support our theoretical findings.  ( 2 min )
    A Neural Tangent Kernel Perspective of GANs. (arXiv:2106.05566v4 [cs.LG] UPDATED)
    We propose a novel theoretical framework of analysis for Generative Adversarial Networks (GANs). We reveal a fundamental flaw of previous analyses which, by incorrectly modeling GANs' training scheme, are subject to ill-defined discriminator gradients. We overcome this issue which impedes a principled study of GAN training, solving it within our framework by taking into account the discriminator's architecture. To this end, we leverage the theory of infinite-width neural networks for the discriminator via its Neural Tangent Kernel. We characterize the trained discriminator for a wide range of losses and establish general differentiability properties of the network. From this, we derive new insights about the convergence of the generated distribution, advancing our understanding of GANs' training dynamics. We empirically corroborate these results via an analysis toolkit based on our framework, unveiling intuitions that are consistent with GAN practice.
    Best Subset Selection with Efficient Primal-Dual Algorithm. (arXiv:2207.02058v1 [stat.ME])
    Best subset selection is considered the `gold standard' for many sparse learning problems. A variety of optimization techniques have been proposed to attack this non-convex and NP-hard problem. In this paper, we investigate the dual forms of a family of $\ell_0$-regularized problems. An efficient primal-dual method has been developed based on the primal and dual problem structures. By leveraging the dual range estimation along with the incremental strategy, our algorithm potentially reduces redundant computation and improves the solutions of best subset selection. Theoretical analysis and experiments on synthetic and real-world datasets validate the efficiency and statistical properties of the proposed solutions.  ( 2 min )
    Predicting Out-of-Domain Generalization with Local Manifold Smoothness. (arXiv:2207.02093v1 [cs.LG])
    Understanding how machine learning models generalize to new environments is a critical part of their safe deployment. Recent work has proposed a variety of complexity measures that directly predict or theoretically bound the generalization capacity of a model. However, these methods rely on a strong set of assumptions that in practice are not always satisfied. Motivated by the limited settings in which existing measures can be applied, we propose a novel complexity measure based on the local manifold smoothness of a classifier. We define local manifold smoothness as a classifier's output sensitivity to perturbations in the manifold neighborhood around a given test point. Intuitively, a classifier that is less sensitive to these perturbations should generalize better. To estimate smoothness we sample points using data augmentation and measure the fraction of these points classified into the majority class. Our method only requires selecting a data augmentation method and makes no other assumptions about the model or data distributions, meaning it can be applied even in out-of-domain (OOD) settings where existing methods cannot. In experiments on robustness benchmarks in image classification, sentiment analysis, and natural language inference, we demonstrate a strong and robust correlation between our manifold smoothness measure and actual OOD generalization on over 3,000 models evaluated on over 100 train/test domain pairs.  ( 3 min )
    Deep Network Approximation: Achieving Arbitrary Accuracy with Fixed Number of Neurons. (arXiv:2107.02397v6 [cs.LG] UPDATED)
    This paper develops simple feed-forward neural networks that achieve the universal approximation property for all continuous functions with a fixed finite number of neurons. These neural networks are simple because they are designed with a simple and computable continuous activation function $\sigma$ leveraging a triangular-wave function and the softsign function. We prove that $\sigma$-activated networks with width $36d(2d+1)$ and depth $11$ can approximate any continuous function on a $d$-dimensional hypercube within an arbitrarily small error. Hence, for supervised learning and its related regression problems, the hypothesis space generated by these networks with a size not smaller than $36d(2d+1)\times 11$ is dense in the continuous function space $C([a,b]^d)$ and therefore dense in the Lebesgue spaces $L^p([a,b]^d)$ for $p\in [1,\infty)$. Furthermore, classification functions arising from image and signal classification are in the hypothesis space generated by $\sigma$-activated networks with width $36d(2d+1)$ and depth $12$, when there exist pairwise disjoint bounded closed subsets of $\mathbb{R}^d$ such that the samples of the same class are located in the same subset. Finally, we use numerical experimentation to show that replacing the ReLU activation function by ours would improve the experiment results.  ( 3 min )
    Improved Global Guarantees for the Nonconvex Burer--Monteiro Factorization via Rank Overparameterization. (arXiv:2207.01789v1 [math.OC])
    We consider minimizing a twice-differentiable, $L$-smooth, and $\mu$-strongly convex objective $\phi$ over an $n\times n$ positive semidefinite matrix $M\succeq0$, under the assumption that the minimizer $M^{\star}$ has low rank $r^{\star}\ll n$. Following the Burer--Monteiro approach, we instead minimize the nonconvex objective $f(X)=\phi(XX^{T})$ over a factor matrix $X$ of size $n\times r$. This substantially reduces the number of variables from $O(n^{2})$ to as few as $O(n)$ and also enforces positive semidefiniteness for free, but at the cost of giving up the convexity of the original problem. In this paper, we prove that if the search rank $r\ge r^{\star}$ is overparameterized by a constant factor with respect to the true rank $r^{\star}$, namely as in $r>\frac{1}{4}(L/\mu-1)^{2}r^{\star}$, then despite nonconvexity, local optimization is guaranteed to globally converge from any initial point to the global optimum. This significantly improves upon a previous rank overparameterization threshold of $r\ge n$, which is known to be sharp if $\phi$ is allowed to be nonsmooth and/or non-strongly convex, but would increase the number of variables back up to $O(n^{2})$. Conversely, without rank overparameterization, we prove that such a global guarantee is possible if and only if $\phi$ is almost perfectly conditioned, with a condition number of $L/\mu<3$. Therefore, we conclude that a small amount of overparameterization can lead to large improvements in theoretical guarantees for the nonconvex Burer--Monteiro factorization.
    On the Nash equilibrium of moment-matching GANs for stationary Gaussian processes. (arXiv:2203.07136v2 [stat.ML] UPDATED)
    Generative Adversarial Networks (GANs) learn an implicit generative model from data samples through a two-player game. In this paper, we study the existence of Nash equilibrium of the game which is consistent as the number of data samples grows to infinity. In a realizable setting where the goal is to estimate the ground-truth generator of a stationary Gaussian process, we show that the existence of consistent Nash equilibrium depends crucially on the choice of the discriminator family. The discriminator defined from second-order statistical moments can result in non-existence of Nash equilibrium, existence of consistent non-Nash equilibrium, or existence and uniqueness of consistent Nash equilibrium, depending on whether symmetry properties of the generator family are respected. We further study the local stability and global convergence of gradient descent-ascent methods towards consistent equilibrium.
    Modeling and Correcting Bias in Sequential Evaluation. (arXiv:2205.01607v2 [stat.ML] UPDATED)
    We consider the problem of sequential evaluation, in which an evaluator observes candidates in a sequence and assigns scores to these candidates in an online, irrevocable fashion. Motivated by the psychology literature that has studied sequential bias in such settings -- namely, dependencies between the evaluation outcome and the order in which the candidates appear -- we propose a natural model for the evaluator's rating process that captures the lack of calibration inherent to such a task. We conduct crowdsourcing experiments to demonstrate various facets of our model. We then proceed to study how to correct sequential bias under our model by posing this as a statistical inference problem. We propose a near-linear time, online algorithm for this task and prove guarantees in terms of two canonical ranking metrics. We also prove that our algorithm is information theoretically optimal, by establishing matching lower bounds in both metrics. Finally, we show that our algorithm outperforms the de facto method of using the rankings induced by the reported scores.
    What Do Graph Convolutional Neural Networks Learn?. (arXiv:2207.01839v1 [cs.LG])
    Graph neural networks (GNNs) have gained traction over the past few years for their superior performance in numerous machine learning tasks. Graph Convolutional Neural Networks (GCN) are a common variant of GNNs that are known to have high performance in semi-supervised node classification (SSNC), and work well under the assumption of homophily. Recent literature has highlighted that GCNs can achieve strong performance on heterophilous graphs under certain "special conditions". These arguments motivate us to understand why, and how, GCNs learn to perform SSNC. We find a positive correlation between similarity of latent node embeddings of nodes within a class and the performance of a GCN. Our investigation on underlying graph structures of a dataset finds that a GCN's SSNC performance is significantly influenced by the consistency and uniqueness in neighborhood structure of nodes within a class.

  • Open

    [R] Automated Taxonomic Identification of Insects with Expert-Level Accuracy Using Effective Feature Transfer from Convolutional Networks
    An interesting article in the Systematic Biology journal about identifying insects: https://academic.oup.com/sysbio/article/68/6/876/5368535 See as well: Deep learning and computer vision will transform entomology submitted by /u/1_like_science [link] [comments]  ( 85 min )
    [D] Extracting predicate to apply formal logic rules in autonomous driving dataset or CARLA simulator
    In the formal logic based autonomous driving dataset, we have a set of rules usually written in First order logic or temporal logic . But to apply the rules, we need to extract the predicate from perception system. For example, how to attach the predicate like standing_at_intersection with the perception scene obtained from AD dataset like Lyft or Argoverse or CARLA simulator. So that I can apply rules on those specific scenario. I could not find any papers or explanation, which explains how to connect the predicate in formal logic and match the connecting predicate with the dataset scene interpretation. Any help is appreciated or links to resource. submitted by /u/projekt_treadstone [link] [comments]  ( 86 min )
    [P] Reward function as a way to represent multiple targets
    I've been assigned at work a problem with multiple targets, and I've been thinking about what's the best to design a model that would optimize towards all these targets. An idea that occurred me is to create a reward function that would "encapsulate" all these targets in such a way where, the higher the reward, the better the outcome is for all the targets. In my case, it's a task distribution system where the workers have the option to decline a task if for whatever reason the task doesn't suit them, and one of my targets is to minimize the number of declines. But we also need to make sure the workload is balanced, and we are not overwhelming someone while under-utilizing the rest of the team; that would be my second target, and we can use the standard deviation as a way to measure the workload balance (the closer to 0 the std is, the better). Essentially, the targets we want to optimize towards are, reduce the number of declines, and also reduce the std of the overall task distribution. So, my reward function could be: - score 0 if the task is declined; - if the task is accepted, then I can take the delta of the std before and after. The bigger the delta, the more std was reduced, so the more even the distribution became. That way, the reward score would in a way represent both my targets (and would be the labels), and then it's simply a matter of training a regression model. Then for a new task, I predict the reward score for each task and worker, and finally assign the tasks by taking the argmax of the predicted scores. I know that rewards are popular in the RL field, but this wouldn't be necessarily a RL problem. In fact, I googled this idea but the vast majority of articles and papers covering reward functions are RL-related. I'm wondering if anyone has tried anything like this before, or have any thoughts. All comments are appreciated. submitted by /u/Travolta1984 [link] [comments]  ( 87 min )
    [P] No, we don't have to choose batch sizes as powers of 2
    Prompted by a recent discussion on social media, I did some benchmarks and wrote down my thoughts on why it doesn't really make a difference whether we choose batch sizes as powers of 2: https://sebastianraschka.com/blog/2022/batch-size-2.html What is your experience, do you do you stick to batch sizes as powers of 2 or do you choose batch sizes more freely? notice a substantial difference when you choose batch sizes as powers of 2 (or multiples of 8)? submitted by /u/seraschka [link] [comments]  ( 92 min )
    [D] How do you share a server for multiple training jobs?
    First of all, using the cloud is not a cost effective solution for us. We have an absolute beast of a server though everything grounds down to a halt when some training sessions are going on - some libraries just ignore the num_cpu settings and uses all the cpu (and even when more cores are free, everything seems to get much slower) Here's the build: 2x AMD EPYC 7763 (64 cores, 2 threads each) 2TB memory 8 RTX A6000 4TB SSD (NVMe) How do you all share a single computer resource amongst other co-workers? We have this expensive machine but when someone runs their training, others have a hard time running basic pandas operations (starting other training jobs just slows down ALL training jobs). To me, it seems like the hardware should be more than enough to run multiple training jobs concurrently. Any tips on how to use it efficiently? One solution I've been thinking was to use docker for each training job and to put hard limits on cpu / memory usage - is this something closer to best practice? submitted by /u/tadf2 [link] [comments]  ( 87 min )
    [P] Concrete dropout implementation for tensorflow 2.0
    Hello everyone! I updated the concrete dropout implementation from the original authors to work with tensorflow 2.0, tweaked the code a bit and turned it into a pip package! If you are interested, you can find it at pypi by sarching "concretedropout". There is also a link in the comments. For those of you who don't know what concrete dropout is, it's a technique which allows for the training of the dropout probability in a layer, which may save a lot of time since it removes the need to grid search for the best dropout parameters. For more information, see the original paper: arXiv:1705.07832 submitted by /u/TrPhantom8 [link] [comments]  ( 88 min )
    [D] Looking for a fast OCR repo
    Currently, we use google as our OCR service provider, but we've had already some serious issues with them and their customer support is terrible. Therefore we would like to change and move away from third-party providers in general. By now we have a sufficient amount of data to train our own OCR model, therefore I am looking for a custom fine-tunable model that is fast/accurate. I've found PaddleOCR and mmocr, but their inference speed for documents like invoices on CPU is quite slow (10s/page on my computer). I'm looking for something in the 1s/page range, similar to google's OCR. We probably don't need all the power and language knowledge these libraries provide, as we only operate on documents in mainly 4 Latin languages. Does anybody know a good starting point? submitted by /u/mkeySeraSera [link] [comments]  ( 87 min )
    [P] Using transformers for time-series forecasting
    I'm currently using different machine learning techniques on a time series and testing their forecast performance. This dataset has both an independent variable and exploratory variables. I've used LSTM on python to forecast and was searching for more recent techniques and found transformers. They seem to have been developed for NLP but have been used for time-series forecasts How well do these transformers perform and is there any resources / library I should look into? EDIT: the data I'll be using is of daily periodicity without weekends. It will have 2+ years of observations (currently working with 3 years and some other datasets have longer periods but "worse" information) submitted by /u/DoruSonic [link] [comments]  ( 90 min )
    WACV 2023 Paper Registration. [R]
    Does anyone know how to register for the WACV 2023 conference? submitted by /u/jeryyjohnson [link] [comments]  ( 85 min )
  • Open

    Tom Cruise without the power of Scientology.
    submitted by /u/cganimater [link] [comments]  ( 83 min )
    AI Dream 58 - Incredible Stellar Trip - vqgan clip
    submitted by /u/LordPewPew777 [link] [comments]  ( 84 min )
    GitHub Copilot is the first real product of large language models
    submitted by /u/bendee983 [link] [comments]  ( 83 min )
    The four big misconceptions of AI research
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 84 min )
    Closest majors/fields to AI
    I have just graduated from school and I wanted to major in AI engineering, unfortunately, I found out that I can't, even for close majors like computer science. The only two options I have now lie between Software Engineering and Computer Engineering, which are both far from my interests, such as: AI, Machine Learning, Simulations, 3D Engines. My second plan is now to get a master's degree in AI after finishing my bachelors degree in one of these two majors I mentioned, which I later could afford on my own with their job prospects. My problem with Software Engineering is that it's too restrictive and I'm not a big fan of making software, apps and websites, but I know two friends who have already majored in it. For Computer Engineering, it does seem more interesting, the making of hardware components, but it does include Electrical Engineering, which I'm NOT a fan of. My view on majors other than AI is pretty superficial, so not claiming to have made an educated opinion on what to opt. Do I have to go with Software Engineering or Computer Engineering according to my plan and interests? I'm open for opinions, even if there's other majors than these two! submitted by /u/CATEXEBRAIN [link] [comments]  ( 88 min )
    Introducing the RAVEN MVP - a general purpose AI companion (with a live DEMO)
    submitted by /u/DavidKShapiro [link] [comments]  ( 84 min )
    The Shortest Guide to Launch Your Career The AI Way (Infographic)
    ​ This infographic shows the many ways in which AI is transforming business as well as leading job roles created by this change and top skills needed to ride the AI wave. submitted by /u/Emily-joe [link] [comments]  ( 84 min )
    Brain-Supervised Image Editing
    Brain-Supervised Image Editing Despite recent advances in deep neural models for semantic image editing, present approaches are dependent on explicit human input. Previous work assumes the availability of manually curated datasets for supervised learning, while for unsupervised approaches the human inspection of discovered components is required to identify those which modify worthwhile semantic features. Here, we present a novel alternative: the utilization of brain responses as a supervision signal for learning semantic feature representations. Participants (N=30) in a neurophysiological experiment were shown artificially generated faces and instructed to look for a particular semantic feature, such as "old" or "smiling", while their brain responses were recorded via electroencephalog…  ( 92 min )
    Why is open ended conversational AI not more popular?
    Up until recently with projects like LaMDA and BlenderBot, the area of open ended conversational AI has either been completely untouched or kept purely as research. Very few of these projects have actually been used in applications for just having a two way conversation with a user. For people in the field, why is this the case? Do companies not see a path forward with open ended dialogue systems? submitted by /u/holamyeung [link] [comments]  ( 89 min )
    still plugging Starryai
    submitted by /u/rikusorasephiroth [link] [comments]  ( 83 min )
    Hi all, every week I host AI sessions on DPhi. While all our resources are free, we create them with passion & quality. Happy to share this upcoming session on Tesla Autopilot. Would love to see you join. Link for it - https://dphi.tech/live-sessions/tesla-autopilot-ml?utm_source=reddit. 😃
    submitted by /u/muditjps [link] [comments]  ( 84 min )
  • Open

    "Watch and Match: Supercharging Imitation with Regularized Optimal Transport (ROT)", Haldar et al 2022
    submitted by /u/gwern [link] [comments]  ( 84 min )
    TRPO Practical Implementation vs Lagrangian
    Hi all, so TRPO enforces a constraint on the approximated KL divergence (which is clear to me). However, I was wondering why they solve such a constrained optimization problem using the "hard way" (i.e., a linear approx. on the objective and the quadratic one on the constraint), when they could have used a simpler Lagrangian dual. Is there any advantage in doing that over using a Lagrangian? Thanks! submitted by /u/Beautiful_Zebra_198 [link] [comments]  ( 84 min )
    Could someone help me with this question or point me to some helpful resources?
    Check, if and (if so) where in the Monte-Carlo algorithm, Temporal Difference learning (TD(0)), the Dyna-Q architecture, and R-Max a static environment is implicitly assumed. How could you modify the relevant learning methods so that they can in principle adapt to changing environments? It can be assumed that ε is sufficiently large. submitted by /u/Garbage-Shoddy [link] [comments]  ( 84 min )
    Goal-Conditionned policy on Ant Maze
    Hi, I'm trying to learn a goal conditioned policy (a policy that make an agent reach a goal, that change at each episode) on ant maze (the mujoco one). This task looks really tough, I easily learned that king of policy on a grid world but my agent (tried with SAC and DDPG) failed to find a goal-conditioned policy in d4RL. Even the DDPG code given buy D4RL authors fail to do it.Did anybody here already did it? Do you have any git repository to share with me, or any tips? Thanks a lot in advance. Links:D4RL environments: https://github.com/rail-berkeley/d4rl D4RL DDPG implementation I'm talking about: https://github.com/rail-berkeley/d4rl_evaluations/tree/master/bcq/continuous_bcq submitted by /u/hbonnavaud [link] [comments]  ( 85 min )
    Learning the CartPole so fast
    That you do not have time to get bored by watching it in real time. So it sounds like a challenge: Does any of you knows a faster learning algorithm for gym CartPole? Sorry the repository is messy, cartpole_play.py is the main file its local dependencies are sdr_util.py, sdr_value_map.py - these are all what is needed its global dependencies are numpy, numba, gym and pygame if you want rendering. A short explanation of the algorithm: after each fall, two bit pair correlation value maps are updated to chart dangerous states in its environment then picks the least dangerous action at every step. Somewhat like a Q-Table yet quite efficient since it highlights specific value correlations between different state parameters that are most significant. submitted by /u/blimpyway [link] [comments]  ( 84 min )
    Are there any human-level chess AI that doesn't use MCTS (Mont-Carlo Tree Search) ?
    From the papers I've read, it seems like all the existing methods (AlphaGo, AlphaGo Zero, MuZero, EfficientZero) uses MCTS at some point. Are there methods that doesn't perform search (i.e. directly predict action from state only, maybe with something like Policy Gradients) that have been shown to reach human level performance at chess ? submitted by /u/Lairv [link] [comments]  ( 90 min )
  • Open

    DSC Weekly 05 July 2022: Standardizing a Metaverse
    Facebook’s announcement last year about creating the Metaverse (and subsequent rebranding to Meta) kicked off a great deal of PR from the tech industry as everyone from established game companies to decentralized finance wildcats raced to plant their flag in the ground. Roughly a year has passed and in the interim the initial fervor has… Read More »DSC Weekly 05 July 2022: Standardizing a Metaverse The post DSC Weekly 05 July 2022: Standardizing a Metaverse appeared first on Data Science Central.  ( 20 min )
    Wanna become Value-driven? Time for a Culture Shift!
    I am honored to collaborate on this week’s blog with Fran Willis White, an industry expert on the role of change leadership and employee empowerment to drive cultural transformation.  In collaborating on this blog, I discovered many similarities in the role of empowerment in the data science development process to optimize business outcomes, as well… Read More »Wanna become Value-driven? Time for a Culture Shift! The post Wanna become Value-driven? Time for a Culture Shift! appeared first on Data Science Central.  ( 19 min )
    Education Trends 2022: Data Science in schools
    Data Science is a growing field that has emerged in many key areas of our world. Data Science has become a global phenomenon and has significantly improved the performance of many industries. Data Science has even incorporated education under its umbrella. Today we will be discussing the importance of data science for education & some… Read More »Education Trends 2022: Data Science in schools The post Education Trends 2022: Data Science in schools appeared first on Data Science Central.  ( 20 min )
    Databricks open sourcing delta lake is good news for AI
    Last week, Databricks open sourced all of Delta Lake (Delta Lake 2.0) to the Linux Foundation.  There is also a new release of MLflow (MLflow 2.0), which is a machine learning operations platform for management of ML pipelines.  In Databricks parlance, a Delta Lake represents a data architecture that has both storage and analytics capabilities; … Read More »Databricks open sourcing delta lake is good news for AI  The post Databricks open sourcing delta lake is good news for AI  appeared first on Data Science Central.  ( 17 min )
  • Open

    Use Amazon SageMaker Data Wrangler in Amazon SageMaker Studio with a default lifecycle configuration
    If you use the default lifecycle configuration for your domain or user profile in Amazon SageMaker Studio and use Amazon SageMaker Data Wrangler for data preparation, then this post is for you. In this post, we show how you can create a Data Wrangler flow and use it for data preparation in a Studio environment […]  ( 8 min )
  • Open

    Watching the Watchers: Democratizing AI To Audit The State
    Socially disadvantaged communities have often raised legitimate concerns about being over-policed and under-protected. Now, the rise of AI…  ( 12 min )
    How Much Does an AI Solution Cost?
    Since a customized AI solution is always individual, no one can give you a general cost estimate.  ( 8 min )
  • Open

    Computer Graphics Artist Xueguo Yang Shares Fractal Art Series This Week ‘In the NVIDIA Studio’
    Putting art, mathematics and computers together in the mid-1980s created a new genre of digital media: fractal art. In the NVIDIA Studio this week, computer graphics (CG) artist, educator and curator Xueguo Yang shares his insights behind fractal art — which uses algorithms to artistically represent calculations derived from geometric objects as digital images and animations. The post Computer Graphics Artist Xueguo Yang Shares Fractal Art Series This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.  ( 7 min )

  • Open

    “Japanese Samurai”
    submitted by /u/pixelz_ai [link] [comments]  ( 82 min )
    Researchers from George Mason and Emory University Develop ‘RES’: a Robust Python Framework for Learning to Explain DNNs (Deep Neural Networks) with Explanation Supervision
    The study on explainability or explainable AI is currently receiving a lot of attention as DNNs become accessible in a variety of application domains. Many explainability techniques that attempt to provide the local explanation of the DNNs prediction for a particular instance, such as techniques that provide saliency maps for understanding which sub-parts in an instance are most responsible for the model prediction, have been proposed in an effort to open the black box of DNNs. While local explanation techniques have seen a rapid growth in research in recent years, the majority of attention has been placed on handling the generation of explanations rather than understanding whether the explanations are accurate or reasonable, what to do if they are, and how to modify the model to produce more accurate or reasonable explanations. Continue reading | Checkout the paper and github submitted by /u/ai-lover [link] [comments]  ( 84 min )
    Researchers at Stanford have developed an Artificial Intelligence (AI) model, EG3D, that can generate random images of faces and other objects with high resolution together with underlying geometric structures
    Artificially intelligent models have recently advanced to the point that users will soon be able to utilize these models to immediately construct and alter nearly photorealistic three-dimensional sceneries from the comfort of their laptops. Since these technologies make it simple to generate hyperrealistic avatars, they will revolutionize the way artists working on video games and CGI for movies approach their work. For quite some time, AIs have been able to create realistic 2D images. However, 3D scenarios have proven to be more challenging due to the enormous computer power needed. The AI model EG3D, created by a team of Stanford academics, can be used to produce random high-resolution images of faces and other things having an underlying geometric structure. This model is one of the first 3D models now in use to reach rendering quality close to photorealism. Continue reading | Checkout the paper, github submitted by /u/ai-lover [link] [comments]  ( 84 min )
    AI Researchers deserve a Nobel Prize!
    Why is there a Nobel Prize? The Nobel Prize was set up when businessman and entrepreneur Alfred Nobel died and left the majority of his fortune to the establishment of prizes in physics, chemistry, physiology or medicine, literature and peace. His will stated that the prizes should be awarded to “those who, during the preceding year, shall have conferred the greatest benefit to humankind.” [source: https://www.nobelprize.org ] I think AI technologies are everywhere; in physics, chemistry, physiology, medicine, literature, peace, etc. The Nobel Peace Prize foundation should dedicate a prize to AI researchers who invent technologies that change human life. Who agrees? submitted by /u/aymenSekhri [link] [comments]  ( 83 min )
    Is General Intelligence "Compact"? | LessWrong
    submitted by /u/DragonGod2718 [link] [comments]  ( 83 min )
    Bringing Python to Browser for Doing Image Processing
    Ever wondered could we learn python in the browser and run machine learning apps?. Recently I came to know about PyScript which can be used to run python in the browser. Still, I couldn't find a single example or post which demonstrates image processing using PyScript hence I decided to figure it out myself, and create one and share it with the community. The article doing the same can be checked from link below: https://blog.devgenius.io/bringing-python-to-browser-for-doing-image-processing-c34f5bba9c1d submitted by /u/VikasOjha666 [link] [comments]  ( 83 min )
    DARK TEMPLES ESCAPADE | 4K DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 83 min )
    The first really "scary" Windows bot will be just automatically click any X in the top right corner of any new window.
    submitted by /u/OmitsWordsByAccident [link] [comments]  ( 83 min )
    Implementing Simple Neural Network in C#
    submitted by /u/RubiksCodeNMZ [link] [comments]  ( 83 min )
    NBA Teams Mug Rugs, (Coasters) Which one is the best?
    submitted by /u/aysheshandmade [link] [comments]  ( 82 min )
    6 Best Artificial Intelligence courses for Healthcare You should learn 2022
    https://codingvidya.com/best-artificial-intelligence-courses-for-healthcare/ submitted by /u/Lakshmireddys [link] [comments]  ( 82 min )
  • Open

    [P] Poniard: a companion library for scikit-learn that helps with model evaluation and comparison
    TL;DR: Check out Poniard, a new Python library that helps with machine learning model evaluation. You can go ahead and install with pip. Links to source code and documentation at the end of this post. ----- For the past few months I've been working on Poniard, a Python library that streamlines ML model evaluation and comparison, built on top of scikit-learn. In a nutshell, load some data, select some models, some metrics and a cross-validation strategy, and go to town. Poniard tries to have a small footprint, a simple API and sane defaults. But above all it strives to have the user stay in control of their modeling experience; you should always know what's going on. This deliberately is NOT an AutoML tool When I started this project I was trying to speed up a very uninteresting process, i.e., loop through multiple estimators and arrive at a list of metrics for comparison. On the way I included easy hyperparameter tuning, plotting, an extensible plugin framework (out of the box includes Weights and Biases and Pandas Profiling) and as much as I could to make the experience simple and transparent. Poniard is not exactly groundbreaking, and there are projects in a similar vein that do so much more. In contrast, they tend to have a more complicated API and more dependencies which are some of the things that I actively tried to avoid. Github Example notebooks (including Colab links) Documentation PyPI submitted by /u/rafa10pj [link] [comments]  ( 85 min )
    [P] Bulk AI Text Generation (No/Low code)
    https://textgenerator.app.nz/bulk-text-generator You can upload a CSV and get lots of Text Generated, works in many languages and code too. There's also an API. The main selling points (VS OpenAI who is the main competitor) Works faster Currie/Babbage quality, but also works across languages/code without needing to specify what model Massively cheaper pricing/huge cost savings :) easier to control can specify max_sentences to make it generate up to a specific number of sentences) can specify min_probability to make it generate the next few likely words to do autocomplete for code/writing I originally created https://textgenerator.app.nz/ as a API for developers primarily but the bulk generator now allows non technical types to pre generate a lot of variety too/branching stories/games/marketing content/code/summaries/ etc. There's actually a massive amount of use cases that one will never be able to understand which is exciting too. submitted by /u/leepenkman [link] [comments]  ( 86 min )
    [D] Backpropagating from GPT-2's output
    I am working on a research project about controllable generation with GPT. I am stuck, so I hope you are able to help me out. I will try to explain the issue as clear as possible, so bear with me. The approach I am pursuing right now is adding a frozen classifier on top of gpt that should steer the model in generating the right class, which is a grammatical property of the generated output. However, the autoregressive nature of GPT complicates things a bit. I cannot simply backpropagate through the generation process (greedy / beam search). I tried adding the classifier on the last input token to avoid the generation process but unsurprisingly this does not yield sufficient performance. How would you tackle this? Is it even feasible? submitted by /u/_Arsenie_Boca_ [link] [comments]  ( 85 min )
    [D] Does anyone here use Google's seqio library?
    In my research I recently came across this library from google: seqio: Task-based datasets, preprocessing, and evaluation for sequence models. From the citation it seems it was released jointly with another library from google, t5x. From the paper and the docs, it sounds quite similar to huggingface's datasets library, albeit perhaps slightly more opinionated. I was hoping to find a more thorough comparison with pre-existing dataloading/processing libraries but couldn't find one (they mostly focus on t5x in the paper). Has anyone here used it? What was your experience? To me it seems a bit redundant but I haven't been able to take a deeper dive Thanks :) submitted by /u/thesofakillers [link] [comments]  ( 84 min )
    [D] How do you share big datasets with your team and others?
    Looking for a bit of a discussion. I'm wondering how you collaborate on data... i.e. how do you share big datasets with data scientists/engineers, within and outside of your team? Do you just push it into a simple DB, do you upload it to Kaggle (if non-sensitive) or via Google Drive/OneDrive? What if the dataset gets updated frequently? I'm working with a customer and sharing data is a bit of a pain. submitted by /u/dmart89 [link] [comments]  ( 91 min )
    [R] Masking for Representation Learning in Vision
    A blog about representation learning from masked images, what makes a good mask, and how to learn such masks: https://akosiorek.github.io/ml/2022/07/04/masking_repr_learning_vision.html. Based on a recent ICML paper: Shi et. al, "Adversarial Masking for Self-Supervised Learning", ICML 2022. submitted by /u/ErrorDry4380 [link] [comments]  ( 84 min )
    [P] Feathr - An Open-Source, Enterprise-Grade and High-Performance Feature Store
    Hi everyone! We are engineers from Microsoft/LinkedIn, and we released an open-source Feature Store called Feathr a few weeks ago (https://github.com/linkedin/feathr). It has many highlights like below. Feel free to check out the repository and let us know if there are any questions! We also have a few blogposts and recordings in case folks want to learn a bit more about it: Open Sourcing Feathr Feathr on Azure. Tech talks on Feathr And its highlights include (more highlights are here): Battle tested in production for more than 6 years: LinkedIn has been using Feathr in production for over 6 years and have a dedicated team improving it. Scalable with built-in optimizations: For example, based on some internal use case, Feathr can process billions of rows and PB scale data with built-in optimizations such as bloom filters and salted joins. Rich support for point-in-time joins and aggregations: Feathr has high performant built-in operators designed for Feature Store, including time-based aggregation, sliding window joins, look-up features, all with point-in-time correctness. Derived Features and centralized Feature Registry which encourage feature consumers to build features on existing features and encouraging feature reuse. ​ Screenshots for the Feathr UI: https://preview.redd.it/3fri2r3qoi991.png?width=3584&format=png&auto=webp&s=5dfe14233b2a8805c50bedd5bfed4bbb31bd0654 submitted by /u/zxzxy1988 [link] [comments]  ( 86 min )
    [D] Is there any deep learning algorithm based on divide and conquer?
    Dealing with a very huge data, eg. very long video datasets, the problems are long training time. Most of technics are using distributed deep learning to solve the problem robustly. I have an idea that we divide the dataset into small sets and train a model. After that using the model to predict values as features, put them into another model and train a second model to predict the output. Like divide and conquer but here is divide the dataset, train a model and conquer the prediction results into one. I have done some research in the internet about deep learning algorithm based on divide and conquer but seems not so many articles about it. Is it a correct to think in this way? Does anyone know any paper about this? Thank you so much. submitted by /u/tmclouisluk [link] [comments]  ( 91 min )
    [D] Which U.S. universities are actively studying generative models?
    Although there are university rankings such as us news, it is difficult to find the universities that are good at a specific field one is interested in. We all know that Stanford and Berkeley are good at generative models, but what else? Please give me the name of university (+ the name of professor if possible) and the paper they published. It would be meaningful especially if the university is not very famous and their paper is outstanding. submitted by /u/SnooPandas3529 [link] [comments]  ( 85 min )
  • Open

    "Remaking EfficientZero (as best I can)", Hoagy (experiences implementing Muzero)
    submitted by /u/gwern [link] [comments]  ( 83 min )
    [ReReading Reinforcment Learning by Sutton and Barton] Chapter 2 - Multi-armed Bandits
    Here's the update for week 2 of reading the book! ​ This weeks reading is also quite short with 17 pages, that's barely 2.5 pages per day! The chapter covers the first basic concepts and the gradient bandit algorithms. As far as live discussions goes, a Discord Server has been created just for this purpose. See https://discord.gg/Juafpk23 (Thanks u/duh619 for creating the channel). Use Discord at your own discretion though. (https://spyware.neocities.org/articles/discord.html) The plan is to have weekly discussions - the search for a common time slot is ongoing until tomorrow night. ​ To supplement your reading, you can find summaries of the chapters on Youtube: https://www.youtube.com/watch?v=4SLGEq_HZxk&list=PLnn6VZp3hqNvRrdnMOVtgV64F_O-61C1D&index=1 (Thanks to u/taplik_to_rehvani for pointing this out). ​ Happy Reading, hope to see some comments discussing questions and ideas of this weeks chapter! submitted by /u/Accomplished-Ninja31 [link] [comments]  ( 83 min )
    RL with differentiable environment
    So bear with me here, I have some experience in other types of ML, but I don't really know much about RL. I have a problem where I want to use a neutral network to see some history of inputs, choose a set of parameters, and then that set of parameters modifies a simulation that eventually spits back a loss. This is all a time series, so those losses can either be viewed per sample or be batched up in some way. Anyway, to me it seems RL in general has to deal with interacting with some big unknown external system (the "environment"). However, in my scenario, that simulation is actually a relatively straightforward algorithm that I've already implemented in PyTorch and is differentiable. Does this buy me anything that "normal RL" has to hack its way around? Any insights here are greatly appreciated. Thanks in advance. submitted by /u/saw79 [link] [comments]  ( 85 min )
    Add noise in State Space
    I am wondering how it makes sense to add noise in state-space during the training process at random times. submitted by /u/Mariam_Dundua [link] [comments]  ( 83 min )
  • Open

    Measuring Forgetting of Memorized Training Examples. (arXiv:2207.00099v1 [cs.LG])
    Machine learning models exhibit two seemingly contradictory phenomena: training data memorization and various forms of forgetting. In memorization, models overfit specific training examples and become susceptible to privacy attacks. In forgetting, examples which appeared early in training are forgotten by the end. In this work, we connect these phenomena. We propose a technique to measure to what extent models ``forget'' the specifics of training examples, becoming less susceptible to privacy attacks on examples they have not seen recently. We show that, while non-convexity can prevent forgetting from happening in the worst-case, standard image and speech models empirically do forget examples over time. We identify nondeterminism as a potential explanation, showing that deterministically trained models do not forget. Our results suggest that examples seen early when training with extremely large datasets -- for instance those examples used to pre-train a model -- may observe privacy benefits at the expense of examples seen later.  ( 2 min )
    Community detection and percolation of information in a geometric setting. (arXiv:2006.15574v2 [stat.ML] UPDATED)
    We make the first steps towards generalizing the theory of stochastic block models, in the sparse regime, towards a model where the discrete community structure is replaced by an underlying geometry. We consider a geometric random graph over a homogeneous metric space where the probability of two vertices to be connected is an arbitrary function of the distance. We give sufficient conditions under which the locations can be recovered (up to an isomorphism of the space) in the sparse regime. Moreover, we define a geometric counterpart of the model of flow of information on trees, due to Mossel and Peres, in which one considers a branching random walk on a sphere and the goal is to recover the location of the root based on the locations of leaves. We give some sufficient conditions for percolation and for non-percolation of information in this model.  ( 2 min )
    PROTOtypical Logic Tensor Networks (PROTO-LTN) for Zero Shot Learning. (arXiv:2207.00433v1 [cs.CV])
    Semantic image interpretation can vastly benefit from approaches that combine sub-symbolic distributed representation learning with the capability to reason at a higher level of abstraction. Logic Tensor Networks (LTNs) are a class of neuro-symbolic systems based on a differentiable, first-order logic grounded into a deep neural network. LTNs replace the classical concept of training set with a knowledge base of fuzzy logical axioms. By defining a set of differentiable operators to approximate the role of connectives, predicates, functions and quantifiers, a loss function is automatically specified so that LTNs can learn to satisfy the knowledge base. We focus here on the subsumption or \texttt{isOfClass} predicate, which is fundamental to encode most semantic image interpretation tasks. Unlike conventional LTNs, which rely on a separate predicate for each class (e.g., dog, cat), each with its own set of learnable weights, we propose a common \texttt{isOfClass} predicate, whose level of truth is a function of the distance between an object embedding and the corresponding class prototype. The PROTOtypical Logic Tensor Networks (PROTO-LTN) extend the current formulation by grounding abstract concepts as parametrized class prototypes in a high-dimensional embedding space, while reducing the number of parameters required to ground the knowledge base. We show how this architecture can be effectively trained in the few and zero-shot learning scenarios. Experiments on Generalized Zero Shot Learning benchmarks validate the proposed implementation as a competitive alternative to traditional embedding-based approaches. The proposed formulation opens up new opportunities in zero shot learning settings, as the LTN formalism allows to integrate background knowledge in the form of logical axioms to compensate for the lack of labelled examples.  ( 3 min )
    Distributed Influence-Augmented Local Simulators for Parallel MARL in Large Networked Systems. (arXiv:2207.00288v1 [cs.LG])
    Due to its high sample complexity, simulation is, as of today, critical for the successful application of reinforcement learning. Many real-world problems, however, exhibit overly complex dynamics, which makes their full-scale simulation computationally slow. In this paper, we show how to decompose large networked systems of many agents into multiple local components such that we can build separate simulators that run independently and in parallel. To monitor the influence that the different local components exert on one another, each of these simulators is equipped with a learned model that is periodically trained on real trajectories. Our empirical results reveal that distributing the simulation among different processes not only makes it possible to train large multi-agent systems in just a few hours but also helps mitigate the negative effects of simultaneous learning.  ( 2 min )
    A Deep-Learning-Aided Pipeline for Efficient Post-Silicon Tuning. (arXiv:2207.00336v1 [cs.LG])
    In post-silicon validation, tuning is to find the values for the tuning knobs, potentially as a function of process parameters and/or known operating conditions. In this sense, an more efficient tuning requires identifying the most critical tuning knobs and process parameters in terms of a given figure-of-merit for a Device Under Test (DUT). This is often manually conducted by experienced experts. However, with increasingly complex chips, manual inspection on a large amount of raw variables has become more challenging. In this work, we leverage neural networks to efficiently select the most relevant variables and present a corresponding deep-learning-aided pipeline for efficient tuning.  ( 2 min )
    Improving Disease Classification Performance and Explainability of Deep Learning Models in Radiology with Heatmap Generators. (arXiv:2207.00157v1 [eess.IV])
    As deep learning is widely used in the radiology field, the explainability of such models is increasingly becoming essential to gain clinicians' trust when using the models for diagnosis. In this research, three experiment sets were conducted with a U-Net architecture to improve the classification performance while enhancing the heatmaps corresponding to the model's focus through incorporating heatmap generators during training. All of the experiments used the dataset that contained chest radiographs, associated labels from one of the three conditions ("normal", "congestive heart failure (CHF)", and "pneumonia"), and numerical information regarding a radiologist's eye-gaze coordinates on the images. The paper (A. Karargyris and Moradi, 2021) that introduced this dataset developed a U-Net model, which was treated as the baseline model for this research, to show how the eye-gaze data can be used in multi-modal training for explainability improvement. To compare the classification performances, the 95% confidence intervals (CI) of the area under the receiver operating characteristic curve (AUC) were measured. The best method achieved an AUC of 0.913 (CI: 0.860-0.966). The greatest improvements were for the "pneumonia" and "CHF" classes, which the baseline model struggled most to classify, resulting in AUCs of 0.859 (CI: 0.732-0.957) and 0.962 (CI: 0.933-0.989), respectively. The proposed method's decoder was also able to produce probability masks that highlight the determining image parts in model classifications, similarly as the radiologist's eye-gaze data. Hence, this work showed that incorporating heatmap generators and eye-gaze information into training can simultaneously improve disease classification and provide explainable visuals that align well with how the radiologist viewed the chest radiographs when making diagnosis.  ( 3 min )
    Agent with Tangent-based Formulation and Anatomical Perception for Standard Plane Localization in 3D Ultrasound. (arXiv:2207.00475v1 [cs.CV])
    Standard plane (SP) localization is essential in routine clinical ultrasound (US) diagnosis. Compared to 2D US, 3D US can acquire multiple view planes in one scan and provide complete anatomy with the addition of coronal plane. However, manually navigating SPs in 3D US is laborious and biased due to the orientation variability and huge search space. In this study, we introduce a novel reinforcement learning (RL) framework for automatic SP localization in 3D US. Our contribution is three-fold. First, we formulate SP localization in 3D US as a tangent-point-based problem in RL to restructure the action space and significantly reduce the search space. Second, we design an auxiliary task learning strategy to enhance the model's ability to recognize subtle differences crossing Non-SPs and SPs in plane search. Finally, we propose a spatial-anatomical reward to effectively guide learning trajectories by exploiting spatial and anatomical information simultaneously. We explore the efficacy of our approach on localizing four SPs on uterus and fetal brain datasets. The experiments indicate that our approach achieves a high localization accuracy as well as robust performance.  ( 3 min )
    Using Machine Learning to Anticipate Tipping Points and Extrapolate to Post-Tipping Dynamics of Non-Stationary Dynamical Systems. (arXiv:2207.00521v1 [cs.LG])
    In this paper we consider the machine learning (ML) task of predicting tipping point transitions and long-term post-tipping-point behavior associated with the time evolution of an unknown (or partially unknown), non-stationary, potentially noisy and chaotic, dynamical system. We focus on the particularly challenging situation where the past dynamical state time series that is available for ML training predominantly lies in a restricted region of the state space, while the behavior to be predicted evolves on a larger state space set not fully observed by the ML model during training. In this situation, it is required that the ML prediction system have the ability to extrapolate to different dynamics past that which is observed during training. We investigate the extent to which ML methods are capable of accomplishing useful results for this task, as well as conditions under which they fail. In general, we found that the ML methods were surprisingly effective even in situations that were extremely challenging, but do (as one would expect) fail when ``too much" extrapolation is required. For the latter case, we investigate the effectiveness of combining the ML approach with conventional modeling based on scientific knowledge, thus forming a hybrid prediction system which we find can enable useful prediction even when its ML-based and knowledge-based components fail when acting alone. We also found that achieving useful results may require using very carefully selected ML hyperparameters and we propose a hyperparameter optimization strategy to address this problem. The main conclusion of this paper is that ML-based approaches are promising tools for predicting the behavior of non-stationary dynamical systems even in the case where the future evolution (perhaps due to the crossing of a tipping point) includes dynamics on a set outside of that explored by the training data.  ( 3 min )
    Continual Learning for Human State Monitoring. (arXiv:2207.00010v1 [cs.LG])
    Continual Learning (CL) on time series data represents a promising but under-studied avenue for real-world applications. We propose two new CL benchmarks for Human State Monitoring. We carefully designed the benchmarks to mirror real-world environments in which new subjects are continuously added. We conducted an empirical evaluation to assess the ability of popular CL strategies to mitigate forgetting in our benchmarks. Our results show that, possibly due to the domain-incremental properties of our benchmarks, forgetting can be easily tackled even with a simple finetuning and that existing strategies struggle in accumulating knowledge over a fixed, held-out, test subject.  ( 2 min )
    e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce. (arXiv:2207.00208v1 [cs.LG])
    Understanding vision and language representations of product content is vital for search and recommendation applications in e-commerce. As a backbone for online shopping platforms and inspired by the recent success in representation learning research, we propose a contrastive learning framework that aligns language and visual models using unlabeled raw product text and images. We present techniques we used to train large-scale representation learning models and share solutions that address domain-specific challenges. We study the performance using our pre-trained model as backbones for diverse downstream tasks, including category classification, attribute extraction, product matching, product clustering, and adult product recognition. Experimental results show that our proposed method outperforms the baseline in each downstream task regarding both single modality and multiple modalities.  ( 2 min )
    Online Reflective Learning for Robust Medical Image Segmentation. (arXiv:2207.00476v1 [cs.CV])
    Deep segmentation models often face the failure risks when the testing image presents unseen distributions. Improving model robustness against these risks is crucial for the large-scale clinical application of deep models. In this study, inspired by human learning cycle, we propose a novel online reflective learning framework (RefSeg) to improve segmentation robustness. Based on the reflection-on-action conception, our RefSeg firstly drives the deep model to take action to obtain semantic segmentation. Then, RefSeg triggers the model to reflect itself. Because making deep models realize their segmentation failures during testing is challenging, RefSeg synthesizes a realistic proxy image from the semantic mask to help deep models build intuitive and effective reflections. This proxy translates and emphasizes the segmentation flaws. By maximizing the structural similarity between the raw input and the proxy, the reflection-on-action loop is closed with segmentation robustness improved. RefSeg runs in the testing phase and is general for segmentation models. Extensive validation on three medical image segmentation tasks with a public cardiac MR dataset and two in-house large ultrasound datasets show that our RefSeg remarkably improves model robustness and reports state-of-the-art performance over strong competitors.  ( 2 min )
    Video + CLIP Baseline for Ego4D Long-term Action Anticipation. (arXiv:2207.00579v1 [cs.CV])
    In this report, we introduce our adaptation of image-text models for long-term action anticipation. Our Video + CLIP framework makes use of a large-scale pre-trained paired image-text model: CLIP and a video encoder Slowfast network. The CLIP embedding provides fine-grained understanding of objects relevant for an action whereas the slowfast network is responsible for modeling temporal information within a video clip of few frames. We show that the features obtained from both encoders are complementary to each other, thus outperforming the baseline on Ego4D for the task of long-term action anticipation. Our code is available at github.com/srijandas07/clip_baseline_LTA_Ego4d.  ( 2 min )
    Autonomous Intraluminal Navigation of a Soft Robot using Deep-Learning-based Visual Servoing. (arXiv:2207.00401v1 [cs.RO])
    Navigation inside luminal organs is an arduous task that requires non-intuitive coordination between the movement of the operator's hand and the information obtained from the endoscopic video. The development of tools to automate certain tasks could alleviate the physical and mental load of doctors during interventions, allowing them to focus on diagnosis and decision-making tasks. In this paper, we present a synergic solution for intraluminal navigation consisting of a 3D printed endoscopic soft robot that can move safely inside luminal structures. Visual servoing, based on Convolutional Neural Networks (CNNs) is used to achieve the autonomous navigation task. The CNN is trained with phantoms and in-vivo data to segment the lumen, and a model-less approach is presented to control the movement in constrained environments. The proposed robot is validated in anatomical phantoms in different path configurations. We analyze the movement of the robot using different metrics such as task completion time, smoothness, error in the steady-state, and mean and maximum error. We show that our method is suitable to navigate safely in hollow environments and conditions which are different than the ones the network was originally trained on.  ( 3 min )
    Automatic Evaluation of Speaker Similarity. (arXiv:2207.00344v1 [cs.SD])
    We introduce a new automatic evaluation method for speaker similarity assessment, that is consistent with human perceptual scores. Modern neural text-to-speech models require a vast amount of clean training data, which is why many solutions switch from single speaker models to solutions trained on examples from many different speakers. Multi-speaker models bring new possibilities, such as a faster creation of new voices, but also a new problem - speaker leakage, where the speaker identity of a synthesized example might not match those of the target speaker. Currently, the only way to discover this issue is through costly perceptual evaluations. In this work, we propose an automatic method for assessment of speaker similarity. For that purpose, we extend the recent work on speaker verification systems and evaluate how different metrics and speaker embeddings models reflect Multiple Stimuli with Hidden Reference and Anchor (MUSHRA) scores. Our experiments show that we can train a model to predict speaker similarity MUSHRA scores from speaker embeddings with 0.96 accuracy and significant correlation up to 0.78 Pearson score at the utterance level.  ( 2 min )
    A Multi-stage Framework with Mean Subspace Computation and Recursive Feedback for Online Unsupervised Domain Adaptation. (arXiv:2207.00003v1 [cs.LG])
    In this paper, we address the Online Unsupervised Domain Adaptation (OUDA) problem and propose a novel multi-stage framework to solve real-world situations when the target data are unlabeled and arriving online sequentially in batches. To project the data from the source and the target domains to a common subspace and manipulate the projected data in real-time, our proposed framework institutes a novel method, called an Incremental Computation of Mean-Subspace (ICMS) technique, which computes an approximation of mean-target subspace on a Grassmann manifold and is proven to be a close approximate to the Karcher mean. Furthermore, the transformation matrix computed from the mean-target subspace is applied to the next target data in the recursive-feedback stage, aligning the target data closer to the source domain. The computation of transformation matrix and the prediction of next-target subspace leverage the performance of the recursive-feedback stage by considering the cumulative temporal dependency among the flow of the target subspace on the Grassmann manifold. The labels of the transformed target data are predicted by the pre-trained source classifier, then the classifier is updated by the transformed data and predicted labels. Extensive experiments on six datasets were conducted to investigate in depth the effect and contribution of each stage in our proposed framework and its performance over previous approaches in terms of classification accuracy and computational speed. In addition, the experiments on traditional manifold-based learning models and neural-network-based learning models demonstrated the applicability of our proposed framework for various types of learning models.  ( 3 min )
    When Does Differentially Private Learning Not Suffer in High Dimensions?. (arXiv:2207.00160v1 [cs.LG])
    Large pretrained models can be privately fine-tuned to achieve performance approaching that of non-private models. A common theme in these results is the surprising observation that high-dimensional models can achieve favorable privacy-utility trade-offs. This seemingly contradicts known results on the model-size dependence of differentially private convex learning and raises the following research question: When does the performance of differentially private learning not degrade with increasing model size? We identify that the magnitudes of gradients projected onto subspaces is a key factor that determines performance. To precisely characterize this for private convex learning, we introduce a condition on the objective that we term restricted Lipschitz continuity and derive improved bounds for the excess empirical and population risks that are dimension-independent under additional conditions. We empirically show that in private fine-tuning of large language models, gradients evaluated near a local optimum are mostly controlled by a few principal components. This behavior is similar to conditions under which we obtain dimension-independent bounds in convex settings. Our theoretical and empirical results together provide a possible explanation for recent successes in large-scale private fine-tuning.  ( 2 min )
    Studying the impact of magnitude pruning on contrastive learning methods. (arXiv:2207.00200v1 [cs.LG])
    We study the impact of different pruning techniques on the representation learned by deep neural networks trained with contrastive loss functions. Our work finds that at high sparsity levels, contrastive learning results in a higher number of misclassified examples relative to models trained with traditional cross-entropy loss. To understand this pronounced difference, we use metrics such as the number of PIEs (Hooker et al., 2019), Q-Score (Kalibhat et al., 2022), and PD-Score (Baldock et al., 2021) to measure the impact of pruning on the learned representation quality. Our analysis suggests the schedule of the pruning method implementation matters. We find that the negative impact of sparsity on the quality of the learned representation is the highest when pruning is introduced early on in the training phase.  ( 2 min )
    DP$^2$-NILM: A Distributed and Privacy-preserving Framework for Non-intrusive Load Monitoring. (arXiv:2207.00041v1 [cs.LG])
    Non-intrusive load monitoring (NILM), which usually utilizes machine learning methods and is effective in disaggregating smart meter readings from the household-level into appliance-level consumption, can help analyze electricity consumption behaviours of users and enable practical smart energy and smart grid applications. Recent studies have proposed many novel NILM frameworks based on federated deep learning (FL). However, there lacks comprehensive research exploring the utility optimization schemes and the privacy-preserving schemes in different FL-based NILM application scenarios. In this paper, we make the first attempt to conduct FL-based NILM focusing on both the utility optimization and the privacy-preserving by developing a distributed and privacy-preserving NILM (DP2-NILM) framework and carrying out comparative experiments on practical NILM scenarios based on real-world smart meter datasets. Specifically, two alternative federated learning strategies are examined in the utility optimization schemes, i.e., the FedAvg and the FedProx. Moreover, different levels of privacy guarantees, i.e., the local differential privacy federated learning and the global differential privacy federated learning are provided in the DP2-NILM. Extensive comparison experiments are conducted on three real-world datasets to evaluate the proposed framework.  ( 2 min )
    Analysis of Kinetic Models for Label Switching and Stochastic Gradient Descent. (arXiv:2207.00389v1 [math.AP])
    In this paper we provide a novel approach to the analysis of kinetic models for label switching, which are used for particle systems that can randomly switch between gradient flows in different energy landscapes. Besides problems in biology and physics, we also demonstrate that stochastic gradient descent, the most popular technique in machine learning, can be understood in this setting, when considering a time-continuous variant. Our analysis is focusing on the case of evolution in a collection of external potentials, for which we provide analytical and numerical results about the evolution as well as the stationary problem.  ( 2 min )
    More is Better (Mostly): On the Backdoor Attacks in Federated Graph Neural Networks. (arXiv:2202.03195v3 [cs.CR] UPDATED)
    Graph Neural Networks (GNNs) are a class of deep learning-based methods for processing graph domain information. GNNs have recently become a widely used graph analysis method due to their superior ability to learn representations for complex graph data. However, due to privacy concerns and regulation restrictions, centralized GNNs can be difficult to apply to data-sensitive scenarios. Federated learning (FL) is an emerging technology developed for privacy-preserving settings when several parties need to train a shared global model collaboratively. Although several research works have applied FL to train GNNs (Federated GNNs), there is no research on their robustness to backdoor attacks. This paper bridges this gap by conducting two types of backdoor attacks in Federated GNNs: centralized backdoor attacks (CBA) and distributed backdoor attacks (DBA). Our experiments show that the DBA attack success rate is higher than CBA in almost all evaluated cases. For CBA, the attack success rate of all local triggers is similar to the global trigger even if the training set of the adversarial party is embedded with the global trigger. To further explore the properties of two backdoor attacks in Federated GNNs, we evaluate the attack performance for a different number of clients, trigger sizes, poisoning intensities, and trigger densities. Moreover, we explore the robustness of DBA and CBA against two state-of-the-art defenses. We find that both attacks are robust against the investigated defenses, necessitating the need to consider backdoor attacks in Federated GNNs as a novel threat that requires custom defenses.  ( 3 min )
    Lifelong Inverse Reinforcement Learning. (arXiv:2207.00461v1 [cs.LG])
    Methods for learning from demonstration (LfD) have shown success in acquiring behavior policies by imitating a user. However, even for a single task, LfD may require numerous demonstrations. For versatile agents that must learn many tasks via demonstration, this process would substantially burden the user if each task were learned in isolation. To address this challenge, we introduce the novel problem of lifelong learning from demonstration, which allows the agent to continually build upon knowledge learned from previously demonstrated tasks to accelerate the learning of new tasks, reducing the amount of demonstrations required. As one solution to this problem, we propose the first lifelong learning approach to inverse reinforcement learning, which learns consecutive tasks via demonstration, continually transferring knowledge between tasks to improve performance.  ( 2 min )
    Watermarking Graph Neural Networks based on Backdoor Attacks. (arXiv:2110.11024v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have achieved promising performance in various real-world applications. Building a powerful GNN model is not a trivial task, as it requires a large amount of training data, powerful computing resources, and human expertise in fine-tuning the model. What is more, with the development of adversarial attacks, e.g., model stealing attacks, GNNs raise challenges to model authentication. To avoid copyright infringement on GNNs, it is necessary to verify the ownership of the GNN models. In this paper, we present a watermarking framework for GNNs for both graph and node classification tasks. We 1) design two strategies to generate watermarked data for the graph classification task and one for the node classification task, 2) embed the watermark into the host model through training to obtain the watermarked GNN model, and 3) verify the ownership of the suspicious model in a black-box setting. The experiments show that our framework can verify the ownership of GNN models with a very high probability (around $95\%$) for both tasks. Finally, we experimentally show that our watermarking approach is robust against two model modifications and an input reformation defense against backdoor attacks.  ( 3 min )
    Characterizing the Effect of Class Imbalance on the Learning Dynamics. (arXiv:2207.00391v1 [stat.ML])
    Data imbalance is a common problem in the machine learning literature that can have a critical effect on the performance of a model. Various solutions exist - such as the ones that focus on resampling or data generation - but their impact on the convergence of gradient-based optimizers used in deep learning is not understood. We here elucidate the significant negative impact of data imbalance on learning, showing that the learning curves for minority and majority classes follow sub-optimal trajectories when training with a gradient-based optimizer. The reason is not only that the gradient signal neglects the minority classes, but also that the minority classes are subject to a larger directional noise, which slows their learning by an amount related to the imbalance ratio. To address this problem, we propose a new algorithmic solution, for which we provide a detailed analysis of its convergence behavior. We show both theoretically and empirically that this new algorithm exhibits a better behavior with more stable learning curves for each class, as well as a better generalization performance.  ( 2 min )
    A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks. (arXiv:2111.04949v2 [cs.LG] UPDATED)
    The field of deep learning has witnessed a remarkable shift towards extremely compute- and memory-intensive neural networks. These newer larger models have enabled researchers to advance state-of-the-art tools across a variety of fields. This phenomenon has spurred the development of algorithms for distributed training of neural networks over a larger number of hardware accelerators. In this paper, we discuss and compare current state-of-the-art frameworks for large scale distributed deep learning. First, we survey current practices in distributed learning and identify the different types of parallelism used. Then, we present empirical results comparing their performance on large image and language training tasks. Additionally, we address their statistical efficiency and memory consumption behavior. Based on our results, we discuss algorithmic and implementation portions of each framework which hinder performance.  ( 2 min )
    A Random Persistence Diagram Generator. (arXiv:2104.07737v3 [stat.ML] UPDATED)
    Topological data analysis (TDA) studies the shape patterns of data. Persistent homology is a widely used method in TDA that summarizes homological features of data at multiple scales and stores them in persistence diagrams (PDs). In this paper, we propose a random persistence diagram generator (RPDG) method that generates a sequence of random PDs from the ones produced by the data. RPDG is underpinned by a model based on pairwise interacting point processes, and a reversible jump Markov chain Monte Carlo (RJ-MCMC) algorithm. A first example, which is based on a synthetic dataset, demonstrates the efficacy of RPDG and provides a comparison with another method for sampling PDs. A second example demonstrates the utility of RPDG to solve a materials science problem given a real dataset of small sample size.  ( 2 min )
    LBDMIDS: LSTM Based Deep Learning Model for Intrusion Detection Systems for IoT Networks. (arXiv:2207.00424v1 [cs.CR])
    In the recent years, we have witnessed a huge growth in the number of Internet of Things (IoT) and edge devices being used in our everyday activities. This demands the security of these devices from cyber attacks to be improved to protect its users. For years, Machine Learning (ML) techniques have been used to develop Network Intrusion Detection Systems (NIDS) with the aim of increasing their reliability/robustness. Among the earlier ML techniques DT performed well. In the recent years, Deep Learning (DL) techniques have been used in an attempt to build more reliable systems. In this paper, a Deep Learning enabled Long Short Term Memory (LSTM) Autoencoder and a 13-feature Deep Neural Network (DNN) models were developed which performed a lot better in terms of accuracy on UNSW-NB15 and Bot-IoT datsets. Hence we proposed LBDMIDS, where we developed NIDS models based on variants of LSTMs namely, stacked LSTM and bidirectional LSTM and validated their performance on the UNSW\_NB15 and BoT\-IoT datasets. This paper concludes that these variants in LBDMIDS outperform classic ML techniques and perform similarly to the DNN models that have been suggested in the past.  ( 2 min )
    On Leave-One-Out Conditional Mutual Information For Generalization. (arXiv:2207.00581v1 [cs.LG])
    We derive information theoretic generalization bounds for supervised learning algorithms based on a new measure of leave-one-out conditional mutual information (loo-CMI). Contrary to other CMI bounds, which are black-box bounds that do not exploit the structure of the problem and may be hard to evaluate in practice, our loo-CMI bounds can be computed easily and can be interpreted in connection to other notions such as classical leave-one-out cross-validation, stability of the optimization algorithm, and the geometry of the loss-landscape. It applies both to the output of training algorithms as well as their predictions. We empirically validate the quality of the bound by evaluating its predicted generalization gap in scenarios for deep learning. In particular, our bounds are non-vacuous on large-scale image-classification tasks.  ( 2 min )
    Class-wise Thresholding for Robust Out-of-Distribution Detection. (arXiv:2110.15292v3 [cs.LG] UPDATED)
    We consider the problem of detecting OoD(Out-of-Distribution) input data when using deep neural networks, and we propose a simple yet effective way to improve the robustness of several popular OoD detection methods against label shift. Our work is motivated by the observation that most existing OoD detection algorithms consider all training/test data as a whole, regardless of which class entry each input activates (inter-class differences). Through extensive experimentation, we have found that such practice leads to a detector whose performance is sensitive and vulnerable to label shift. To address this issue, we propose a class-wise thresholding scheme that can apply to most existing OoD detection algorithms and can maintain similar OoD detection performance even in the presence of label shift in the test distribution.  ( 2 min )
    HardVis: Visual Analytics to Handle Instance Hardness Using Undersampling and Oversampling Techniques. (arXiv:2203.15753v2 [cs.LG] UPDATED)
    Despite the tremendous advances in machine learning (ML), training with imbalanced data still poses challenges in many real-world applications. Among a series of diverse techniques to solve this problem, sampling algorithms are regarded as an efficient solution. However, the problem is more fundamental, with many works emphasizing the importance of instance hardness. This issue refers to the significance of managing unsafe or potentially noisy instances that are more likely to be misclassified and serve as the root cause of poor classification performance. This paper introduces HardVis, a visual analytics system designed to handle instance hardness mainly in imbalanced classification scenarios. Our proposed system assists users in visually comparing different distributions of data types, selecting types of instances based on local characteristics that will later be affected by the active sampling method, and validating which suggestions from undersampling or oversampling techniques are beneficial for the ML model. Additionally, rather than uniformly undersampling/oversampling a specific class, we allow users to find and sample easy and difficult to classify training instances from all classes. Users can explore subsets of data from different perspectives to decide all those parameters, while HardVis keeps track of their steps and evaluates the model's predictive performance in a test set separately. The end result is a well-balanced data set that boosts the predictive power of the ML model. The efficacy and effectiveness of HardVis are demonstrated with a hypothetical usage scenario and a use case. Finally, we also look at how useful our system is based on feedback we received from ML experts.
    AdaSparse: Learning Adaptively Sparse Structures for Multi-Domain Click-Through Rate Prediction. (arXiv:2206.13108v2 [cs.IR] UPDATED)
    Click-through rate (CTR) prediction is a fundamental technique in recommendation and advertising systems. Recent studies have proved that learning a unified model to serve multiple domains is effective to improve the overall performance. However, it is still challenging to improve generalization across domains under limited training data, and hard to deploy current solutions due to their computational complexity. In this paper, we propose a simple yet effective framework AdaSparse for multi-domain CTR prediction, which learns adaptively sparse structure for each domain, achieving better generalization across domains with lower computational cost. In AdaSparse, we introduce domain-aware neuron-level weighting factors to measure the importance of neurons, with that for each domain our model can prune redundant neurons to improve generalization. We further add flexible sparsity regularizations to control the sparsity ratio of learned structures. Offline and online experiments show that AdaSparse outperforms previous multi-domain CTR models significantly.
    Expected Scalarised Returns Dominance: A New Solution Concept for Multi-Objective Decision Making. (arXiv:2106.01048v3 [cs.LG] UPDATED)
    In many real-world scenarios, the utility of a user is derived from the single execution of a policy. In this case, to apply multi-objective reinforcement learning, the expected utility of the returns must be optimised. Various scenarios exist where a user's preferences over objectives (also known as the utility function) are unknown or difficult to specify. In such scenarios, a set of optimal policies must be learned. However, settings where the expected utility must be maximised have been largely overlooked by the multi-objective reinforcement learning community and, as a consequence, a set of optimal solutions has yet to be defined. In this paper we address this challenge by proposing first-order stochastic dominance as a criterion to build solution sets to maximise expected utility. We also propose a new dominance criterion, known as expected scalarised returns (ESR) dominance, that extends first-order stochastic dominance to allow a set of optimal policies to be learned in practice. We then define a new solution concept called the ESR set, which is a set of policies that are ESR dominant. Finally, we define a new multi-objective distributional tabular reinforcement learning (MOT-DRL) algorithm to learn the ESR set in a multi-objective multi-armed bandit setting.
    From Kepler to Newton: Explainable AI for Science Discovery. (arXiv:2111.12210v5 [cs.AI] UPDATED)
    The Observation--Hypothesis--Prediction--Experimentation loop paradigm for scientific research has been practiced by researchers for years towards scientific discoveries. However, with data explosion in both mega-scale and milli-scale scientific research, it has been sometimes very difficult to manually analyze the data and propose new hypotheses to drive the cycle for scientific discovery. In this paper, we discuss the role of Explainable AI in scientific discovery process by demonstrating an Explainable AI-based paradigm for science discovery. The key is to use Explainable AI to help derive data or model interpretations, hypotheses, as well as scientific discoveries or insights. We show how computational and data-intensive methodology -- together with experimental and theoretical methodology -- can be seamlessly integrated for scientific research. To demonstrate the AI-based science discovery process, and to pay our respect to some of the greatest minds in human history, we show how Kepler's laws of planetary motion and Newton's law of universal gravitation can be rediscovered by (Explainable) AI based on Tycho Brahe's astronomical observation data, whose works were leading the scientific revolution in the 16-17th century. This work also highlights the important role of Explainable AI (as compared to Blackbox AI) in science discovery to help humans prevent or better prepare for the possible technological singularity that may happen in the future, since science is not only about the know how, but also the know why.
    Prioritized training on points that are learnable, worth learning, and not yet learned (workshop version). (arXiv:2107.02565v3 [cs.LG] UPDATED)
    We introduce Goldilocks Selection, a technique for faster model training which selects a sequence of training points that are "just right". We propose an information-theoretic acquisition function -- the reducible validation loss -- and compute it with a small proxy model -- GoldiProx -- to efficiently choose training points that maximize information about a validation set. We show that the "hard" (e.g. high loss) points usually selected in the optimization literature are typically noisy, while the "easy" (e.g. low noise) samples often prioritized for curriculum learning confer less information. Further, points with uncertain labels, typically targeted by active learning, tend to be less relevant to the task. In contrast, Goldilocks Selection chooses points that are "just right" and empirically outperforms the above approaches. Moreover, the selected sequence can transfer to other architectures; practitioners can share and reuse it without the need to recreate it.
    Few-Shot Document-Level Relation Extraction. (arXiv:2205.02048v2 [cs.CL] UPDATED)
    We present FREDo, a few-shot document-level relation extraction (FSDLRE) benchmark. As opposed to existing benchmarks which are built on sentence-level relation extraction corpora, we argue that document-level corpora provide more realism, particularly regarding none-of-the-above (NOTA) distributions. Therefore, we propose a set of FSDLRE tasks and construct a benchmark based on two existing supervised learning data sets, DocRED and sciERC. We adapt the state-of-the-art sentence-level method MNAV to the document-level and develop it further for improved domain adaptation. We find FSDLRE to be a challenging setting with interesting new characteristics such as the ability to sample NOTA instances from the support set. The data, code, and trained models are available online (https://github.com/nicpopovic/FREDo).
    Learning Symmetric Embeddings for Equivariant World Models. (arXiv:2204.11371v2 [cs.LG] UPDATED)
    Incorporating symmetries can lead to highly data-efficient and generalizable models by defining equivalence classes of data samples related by transformations. However, characterizing how transformations act on input data is often difficult, limiting the applicability of equivariant models. We propose learning symmetric embedding networks (SENs) that encode an input space (e.g. images), where we do not know the effect of transformations (e.g. rotations), to a feature space that transforms in a known manner under these operations. This network can be trained end-to-end with an equivariant task network to learn an explicitly symmetric representation. We validate this approach in the context of equivariant transition models with 3 distinct forms of symmetry. Our experiments demonstrate that SENs facilitate the application of equivariant networks to data with complex symmetry representations. Moreover, doing so can yield improvements in accuracy and generalization relative to both fully-equivariant and non-equivariant baselines.
    Graph Neural Networks for Graph Drawing. (arXiv:2109.10061v3 [cs.LG] UPDATED)
    Graph Drawing techniques have been developed in the last few years with the purpose of producing aesthetically pleasing node-link layouts. Recently, the employment of differentiable loss functions has paved the road to the massive usage of Gradient Descent and related optimization algorithms. In this paper, we propose a novel framework for the development of Graph Neural Drawers (GND), machines that rely on neural computation for constructing efficient and complex maps. GNDs are Graph Neural Networks (GNNs) whose learning process can be driven by any provided loss function, such as the ones commonly employed in Graph Drawing. Moreover, we prove that this mechanism can be guided by loss functions computed by means of Feedforward Neural Networks, on the basis of supervision hints that express beauty properties, like the minimization of crossing edges. In this context, we show that GNNs can nicely be enriched by positional features to deal also with unlabelled vertexes. We provide a proof-of-concept by constructing a loss function for the edge-crossing and provide quantitative and qualitative comparisons among different GNN models working under the proposed framework.
    TGL: A General Framework for Temporal GNN Training on Billion-Scale Graphs. (arXiv:2203.14883v2 [cs.LG] UPDATED)
    Many real world graphs contain time domain information. Temporal Graph Neural Networks capture temporal information as well as structural and contextual information in the generated dynamic node embeddings. Researchers have shown that these embeddings achieve state-of-the-art performance in many different tasks. In this work, we propose TGL, a unified framework for large-scale offline Temporal Graph Neural Network training where users can compose various Temporal Graph Neural Networks with simple configuration files. TGL comprises five main components, a temporal sampler, a mailbox, a node memory module, a memory updater, and a message passing engine. We design a Temporal-CSR data structure and a parallel sampler to efficiently sample temporal neighbors to formtraining mini-batches. We propose a novel random chunk scheduling technique that mitigates the problem of obsolete node memory when training with a large batch size. To address the limitations of current TGNNs only being evaluated on small-scale datasets, we introduce two large-scale real-world datasets with 0.2 and 1.3 billion temporal edges. We evaluate the performance of TGL on four small-scale datasets with a single GPU and the two large datasets with multiple GPUs for both link prediction and node classification tasks. We compare TGL with the open-sourced code of five methods and show that TGL achieves similar or better accuracy with an average of 13x speedup. Our temporal parallel sampler achieves an average of 173x speedup on a multi-core CPU compared with the baselines. On a 4-GPU machine, TGL can train one epoch of more than one billion temporal edges within 1-10 hours. To the best of our knowledge, this is the first work that proposes a general framework for large-scale Temporal Graph Neural Networks training on multiple GPUs.
    EvoVGM: a Deep Variational Generative Model for Evolutionary Parameter Estimation. (arXiv:2205.13034v2 [cs.LG] UPDATED)
    Most evolutionary-oriented deep generative models do not explicitly consider the underlying evolutionary dynamics of biological sequences as it is performed within the Bayesian phylogenetic inference framework. In this study, we propose a method for a deep variational Bayesian generative model (EvoVGM) that jointly approximates the true posterior of local evolutionary parameters and generates sequence alignments. Moreover, it is instantiated and tuned for continuous-time Markov chain substitution models such as JC69, K80 and GTR. We train the model via a low-variance stochastic estimator and a gradient ascent algorithm. Here, we analyze the consistency and effectiveness of EvoVGM on synthetic sequence alignments simulated with several evolutionary scenarios and different sizes. Finally, we highlight the robustness of a fine-tuned EvoVGM model using a sequence alignment of gene S of coronaviruses.
    Distributed saddle point problems for strongly concave-convex functions. (arXiv:2202.05812v2 [math.OC] UPDATED)
    In this paper, we propose GT-GDA, a distributed optimization method to solve saddle point problems of the form: $\min_{\mathbf{x}} \max_{\mathbf{y}} \{F(\mathbf{x},\mathbf{y}) :=G(\mathbf{x}) + \langle \mathbf{y}, \overline{P} \mathbf{x} \rangle - H(\mathbf{y})\}$, where the functions $G(\cdot)$, $H(\cdot)$, and the the coupling matrix $\overline{P}$ are distributed over a strongly connected network of nodes. GT-GDA is a first-order method that uses gradient tracking to eliminate the dissimilarity caused by heterogeneous data distribution among the nodes. In the most general form, GT-GDA includes a consensus over the local coupling matrices to achieve the optimal (unique) saddle point, however, at the expense of increased communication. To avoid this, we propose a more efficient variant GT-GDA-Lite that does not incur the additional communication and analyze its convergence in various scenarios. We show that GT-GDA converges linearly to the unique saddle point solution when $G(\cdot)$ is smooth and convex, $H(\cdot)$ is smooth and strongly convex, and the global coupling matrix $\overline{P}$ has full column rank. We further characterize the regime under which GT-GDA exhibits a network topology-independent convergence behavior. We next show the linear convergence of GT-GDA to an error around the unique saddle point, which goes to zero when the coupling cost ${\langle \mathbf y, \overline{P} \mathbf x \rangle}$ is common to all nodes, or when $G(\cdot)$ and $H(\cdot)$ are quadratic. Numerical experiments illustrate the convergence properties and importance of GT-GDA and GT-GDA-Lite for several applications.
    Causal Reasoning Meets Visual Representation Learning: A Prospective Study. (arXiv:2204.12037v5 [cs.CV] UPDATED)
    Visual representation learning is ubiquitous in various real-world applications, including visual comprehension, video understanding, multi-modal analysis, human-computer interaction, and urban computing. Due to the emergence of huge amounts of multi-modal heterogeneous spatial/temporal/spatial-temporal data in big data era, the lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models. The majority of the existing methods tend to fit the original data/variable distributions and ignore the essential causal relations behind the multi-modal knowledge, which lacks an unified guidance and analysis about why modern visual representation learning methods are easily collapse into data bias and have limited generalization and cognitive abilities. Inspired by the strong inference ability of human-level agents, recent years have therefore witnessed great effort in developing causal reasoning paradigms to realize robust representation and model learning with good cognitive ability. In this paper, we conduct a comprehensive review of existing causal reasoning methods for visual representation learning, covering fundamental theories, models, and datasets. The limitations of current methods and datasets are also discussed. Moreover, we propose some prospective challenges, opportunities, and future research directions for benchmarking causal reasoning algorithms in visual representation learning. This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods, publicly available benchmarks, and consensus-building standards for reliable visual representation learning and related real-world applications more efficiently.
    ML4ML: Automated Invariance Testing for Machine Learning Models. (arXiv:2109.12926v2 [cs.LG] UPDATED)
    In machine learning (ML) workflows, determining the invariance qualities of an ML model is a common testing procedure. Traditionally, invariance qualities are evaluated using simple formula-based scores, e.g., accuracy. In this paper, we show that testing the invariance qualities of ML models may result in complex visual patterns that cannot be classified using simple formulas. In order to test ML models by analyzing such visual patterns automatically using other ML models, we propose a systematic framework that is applicable to a variety of invariance qualities. We demonstrate the effectiveness and feasibility of the framework by developing ML4ML models (assessors) for determining rotation-, brightness-, and size-variances of a collection of neural networks. Our testing results show that the trained ML4ML assessors can perform such analytical tasks with sufficient accuracy.
    Enhancing Computational Fluid Dynamics with Machine Learning. (arXiv:2110.02085v2 [physics.flu-dyn] UPDATED)
    Machine learning is rapidly becoming a core technology for scientific computing, with numerous opportunities to advance the field of computational fluid dynamics. In this Perspective, we highlight some of the areas of highest potential impact, including to accelerate direct numerical simulations, to improve turbulence closure modeling, and to develop enhanced reduced-order models. We also discuss emerging areas of machine learning that are promising for computational fluid dynamics, as well as some potential limitations that should be taken into account.
    Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning. (arXiv:2102.03214v2 [cs.CV] UPDATED)
    Model compression is an essential technique for deploying deep neural networks (DNNs) on power and memory-constrained resources. However, existing model-compression methods often rely on human expertise and focus on parameters' local importance, ignoring the rich topology information within DNNs. In this paper, we propose a novel multi-stage graph embedding technique based on graph neural networks (GNNs) to identify DNN topologies and use reinforcement learning (RL) to find a suitable compression policy. We performed resource-constrained (i.e., FLOPs) channel pruning and compared our approach with state-of-the-art model compression methods. We evaluated our method on various models from typical to mobile-friendly networks, such as ResNet family, VGG-16, MobileNet-v1/v2, and ShuffleNet. Results show that our method can achieve higher compression ratios with a minimal fine-tuning cost yet yields outstanding and competitive performance.
    Scalable MCMC Sampling for Nonsymmetric Determinantal Point Processes. (arXiv:2207.00486v1 [cs.LG])
    A determinantal point process (DPP) is an elegant model that assigns a probability to every subset of a collection of $n$ items. While conventionally a DPP is parameterized by a symmetric kernel matrix, removing this symmetry constraint, resulting in nonsymmetric DPPs (NDPPs), leads to significant improvements in modeling power and predictive performance. Recent work has studied an approximate Markov chain Monte Carlo (MCMC) sampling algorithm for NDPPs restricted to size-$k$ subsets (called $k$-NDPPs). However, the runtime of this approach is quadratic in $n$, making it infeasible for large-scale settings. In this work, we develop a scalable MCMC sampling algorithm for $k$-NDPPs with low-rank kernels, thus enabling runtime that is sublinear in $n$. Our method is based on a state-of-the-art NDPP rejection sampling algorithm, which we enhance with a novel approach for efficiently constructing the proposal distribution. Furthermore, we extend our scalable $k$-NDPP sampling algorithm to NDPPs without size constraints. Our resulting sampling method has polynomial time complexity in the rank of the kernel, while the existing approach has runtime that is exponential in the rank. With both a theoretical analysis and experiments on real-world datasets, we verify that our scalable approximate sampling algorithms are orders of magnitude faster than existing sampling approaches for $k$-NDPPs and NDPPs.
    Enhancing cluster analysis via topological manifold learning. (arXiv:2207.00510v1 [cs.LG])
    We discuss topological aspects of cluster analysis and show that inferring the topological structure of a dataset before clustering it can considerably enhance cluster detection: theoretical arguments and empirical evidence show that clustering embedding vectors, representing the structure of a data manifold instead of the observed feature vectors themselves, is highly beneficial. To demonstrate, we combine manifold learning method UMAP for inferring the topological structure with density-based clustering method DBSCAN. Synthetic and real data results show that this both simplifies and improves clustering in a diverse set of low- and high-dimensional problems including clusters of varying density and/or entangled shapes. Our approach simplifies clustering because topological pre-processing consistently reduces parameter sensitivity of DBSCAN. Clustering the resulting embeddings with DBSCAN can then even outperform complex methods such as SPECTACL and ClusterGAN. Finally, our investigation suggests that the crucial issue in clustering does not appear to be the nominal dimension of the data or how many irrelevant features it contains, but rather how \textit{separable} the clusters are in the ambient observation space they are embedded in, which is usually the (high-dimensional) Euclidean space defined by the features of the data. Our approach is successful because we perform the cluster analysis after projecting the data into a more suitable space that is optimized for separability, in some sense.
    Transfer learning of phase transitions in percolation and directed percolation. (arXiv:2112.15516v5 [cond-mat.stat-mech] UPDATED)
    The latest advances of statistical physics have shown remarkable performance of machine learning in identifying phase transitions. In this paper, we apply domain adversarial neural network (DANN) based on transfer learning to studying non-equilibrium and equilibrium phase transition models, which are percolation model and directed percolation (DP) model, respectively. With the DANN, only a small fraction of input configurations (2d images) needs to be labeled, which is automatically chosen, in order to capture the critical point. To learn the DP model, the method is refined by an iterative procedure in determining the critical point, which is a prerequisite for the data collapse in calculating the critical exponent $\nu_{\perp}$. We then apply the DANN to a two-dimensional site percolation with configurations filtered to include only the largest cluster which may contain the information related to the order parameter. The DANN learning of both models yields reliable results which are comparable to the ones from Monte Carlo simulations. Our study also shows that the DANN can achieve quite high accuracy at much lower cost, compared to the supervised learning.
    CRISP: A Probabilistic Model for Individual-Level COVID-19 Infection Risk Estimation Based on Contact Data. (arXiv:2006.04942v2 [cs.SI] UPDATED)
    We present CRISP (COVID-19 Risk Score Prediction), a probabilistic graphical model for COVID-19 infection spread through a population based on the SEIR model where we assume access to (1) mutual contacts between pairs of individuals across time across various channels (e.g., Bluetooth contact traces), as well as (2) test outcomes at given times for infection, exposure and immunity tests. Our micro-level model keeps track of the infection state for each individual at every point in time, ranging from susceptible, exposed, infectious to recovered. We develop both a Monte Carlo EM as well as a message passing algorithm to infer contact-channel specific infection transmission probabilities. Our Monte Carlo algorithm uses Gibbs sampling to draw samples of the latent infection status of each individual over the entire time period of analysis, given the latent infection status of all contacts and test outcome data. Experimental results with simulated data demonstrate our CRISP model can be parametrized by the reproduction factor $R_0$ and exhibits population-level infectiousness and recovery time series similar to those of the classical SEIR model. However, due to the individual contact data, this model allows fine grained control and inference for a wide range of COVID-19 mitigation and suppression policy measures. Moreover, the block-Gibbs sampling algorithm is able to support efficient testing in a test-trace-isolate approach to contain COVID-19 infection spread. To the best of our knowledge, this is the first model with efficient inference for COVID-19 infection spread based on individual-level contact data; most epidemic models are macro-level models that reason over entire populations. The implementation of CRISP is available in Python and C++ at https://github.com/zalandoresearch/CRISP.  ( 3 min )
    Stochastic Causal Programming for Bounding Treatment Effects. (arXiv:2202.10806v2 [stat.ML] UPDATED)
    Causal effect estimation is important for numerous tasks in the natural and social sciences. However, identifying effects is impossible from observational data without making strong, often untestable assumptions. We consider algorithms for the partial identification problem, bounding treatment effects from multivariate, continuous treatments over multiple possible causal models when unmeasured confounding makes identification impossible. We consider a framework where observable evidence is matched to the implications of constraints encoded in a causal model by norm-based criteria. This generalizes classical approaches based purely on generative models. Casting causal effects as objective functions in a constrained optimization problem, we combine flexible learning algorithms with Monte Carlo methods to implement a family of solutions under the name of stochastic causal programming. In particular, we present ways by which such constrained optimization problems can be parameterized without likelihood functions for the causal or the observed data model, reducing the computational and statistical complexity of the task.
    On Optimal Control and Expectation-Maximisation: Theory and an Outlook Towards Algorithms. (arXiv:2205.03279v2 [cs.LG] UPDATED)
    In this work we demonstrate how both the Stochastic and Risk Sensitive Optimal Control problem can be treated by means of the Expectation-Maximisation algorithm. We show how such a treatment materialises into two separate iterative programs that each generate a unique but closely related sequence of density functions. We motivate to interpret these density functions as beliefs, ergo as probabilistic proxies for the deterministic optimal policy. More formally two fixed point iteration schemes are derived with the stationary point coinciding with the deterministic optimal policies on behalf of the proven convergence of Expectation-Maximisation methods. We are inclined to point out our results are intimately related with the paradigm of Control as Inference. Control as inference here refers to a collection of approaches which aim is also to recast optimal control as an instance of probabilistic inference. Although said paradigm already resulted in the development of several powerful Reinforcement Learning algorithms, the fundamental problem statement usually is introduced by teleological arguments. We argue that the present results demonstrate that earlier established Control as Inference frameworks in fact isolate a single step from either of the proposed iterative programs. In any case the present treatment provides them with a deontological argument of validity. By exposing the underlying technical mechanism we aim to contribute to the general acceptance of Control as Inference as a framework superseding the present Optimal Control paradigm. In order to motivate the general relevance of the presented treatment we further discuss parallels with Path Integral Control and other areas of research before sketching the outlines of future algorithmic development.
    auton-survival: an Open-Source Package for Regression, Counterfactual Estimation, Evaluation and Phenotyping with Censored Time-to-Event Data. (arXiv:2204.07276v3 [cs.LG] UPDATED)
    Applications of machine learning in healthcare often require working with time-to-event prediction tasks including prognostication of an adverse event, re-hospitalization or death. Such outcomes are typically subject to censoring due to loss of follow up. Standard machine learning methods cannot be applied in a straightforward manner to datasets with censored outcomes. In this paper, we present auton-survival, an open-source repository of tools to streamline working with censored time-to-event or survival data. auton-survival includes tools for survival regression, adjustment in the presence of domain shift, counterfactual estimation, phenotyping for risk stratification, evaluation, as well as estimation of treatment effects. Through real world case studies employing a large subset of the SEER oncology incidence data, we demonstrate the ability of auton-survival to rapidly support data scientists in answering complex health and epidemiological questions.
    Learning to correct spectral methods for simulating turbulent flows. (arXiv:2207.00556v1 [cs.LG])
    Despite their ubiquity throughout science and engineering, only a handful of partial differential equations (PDEs) have analytical, or closed-form solutions. This motivates a vast amount of classical work on numerical simulation of PDEs and more recently, a whirlwind of research into data-driven techniques leveraging machine learning (ML). A recent line of work indicates that a hybrid of classical numerical techniques with machine learning can offer significant improvements over either approach alone. In this work, we show that the choice of the numerical scheme is crucial when incorporating physics-based priors. We build upon Fourier-based spectral methods, which are considerably more efficient than other numerical schemes for simulating PDEs with smooth and periodic solutions. Specifically, we develop ML-augmented spectral solvers for three model PDEs of fluid dynamics, which improve upon the accuracy of standard spectral solvers at the same resolution. We also demonstrate a handful of key design principles for combining machine learning and numerical methods for solving PDEs.
    SAFER: Data-Efficient and Safe Reinforcement Learning via Skill Acquisition. (arXiv:2202.04849v2 [cs.LG] UPDATED)
    Methods that extract policy primitives from offline demonstrations using deep generative models have shown promise at accelerating reinforcement learning(RL) for new tasks. Intuitively, these methods should also help to trainsafeRLagents because they enforce useful skills. However, we identify these techniques are not well equipped for safe policy learning because they ignore negative experiences(e.g., unsafe or unsuccessful), focusing only on positive experiences, which harms their ability to generalize to new tasks safely. Rather, we model the latentsafetycontextusing principled contrastive training on an offline dataset of demonstrations from many tasks, including both negative and positive experiences. Using this late variable, our RL framework, SAFEty skill pRiors (SAFER) extracts task-specific safe primitive skills to safely and successfully generalize to new tasks. In the inference stage, policies trained with SAFER learn to compose safe skills into successful policies. We theoretically characterize why SAFER can enforce safe policy learning and demonstrate its effectiveness on several complex safety-critical robotic grasping tasks inspired by the game Operation, in which SAFERoutperforms state-of-the-art primitive learning methods in success and safety.
    Shai-am: A Machine Learning Platform for Investment Strategies. (arXiv:2207.00436v1 [q-fin.GN])
    The finance industry has adopted machine learning (ML) as a form of quantitative research to support better investment decisions, yet there are several challenges often overlooked in practice. (1) ML code tends to be unstructured and ad hoc, which hinders cooperation with others. (2) Resource requirements and dependencies vary depending on which algorithm is used, so a flexible and scalable system is needed. (3) It is difficult for domain experts in traditional finance to apply their experience and knowledge in ML-based strategies unless they acquire expertise in recent technologies. This paper presents Shai-am, an ML platform integrated with our own Python framework. The platform leverages existing modern open-source technologies, managing containerized pipelines for ML-based strategies with unified interfaces to solve the aforementioned issues. Each strategy implements the interface defined in the core framework. The framework is designed to enhance reusability and readability, facilitating collaborative work in quantitative research. Shai-am aims to be a pure AI asset manager for solving various tasks in financial markets.  ( 2 min )
    Reinforcement Learning of Multi-Domain Dialog Policies Via Action Embeddings. (arXiv:2207.00468v1 [cs.CL])
    Learning task-oriented dialog policies via reinforcement learning typically requires large amounts of interaction with users, which in practice renders such methods unusable for real-world applications. In order to reduce the data requirements, we propose to leverage data from across different dialog domains, thereby reducing the amount of data required from each given domain. In particular, we propose to learn domain-agnostic action embeddings, which capture general-purpose structure that informs the system how to act given the current dialog context, and are then specialized to a specific domain. We show how this approach is capable of learning with significantly less interaction with users, with a reduction of 35% in the number of dialogs required to learn, and to a higher level of proficiency than training separate policies for each domain on a set of simulated domains.
    Generative Adversarial Networks and Image-Based Malware Classification. (arXiv:2207.00421v1 [cs.CR])
    For efficient malware removal, determination of malware threat levels, and damage estimation, malware family classification plays a critical role. In this paper, we extract features from malware executable files and represent them as images using various approaches. We then focus on Generative Adversarial Networks (GAN) for multiclass classification and compare our GAN results to other popular machine learning techniques, including Support Vector Machine (SVM), XGBoost, and Restricted Boltzmann Machines (RBM). We find that the AC-GAN discriminator is generally competitive with other machine learning techniques. We also evaluate the utility of the GAN generative model for adversarial attacks on image-based malware detection. While AC-GAN generated images are visually impressive, we find that they are easily distinguished from real malware images using any of several learning techniques. This result indicates that our GAN generated images would be of little value in adversarial attacks.  ( 2 min )
    The "AI+R"-tree: An Instance-optimized R-tree. (arXiv:2207.00550v1 [cs.DB])
    The emerging class of instance-optimized systems has shown potential to achieve high performance by specializing to a specific data and query workloads. Particularly, Machine Learning (ML) techniques have been applied successfully to build various instance-optimized components (e.g., learned indexes). This paper investigates to leverage ML techniques to enhance the performance of spatial indexes, particularly the R-tree, for a given data and query workloads. As the areas covered by the R-tree index nodes overlap in space, upon searching for a specific point in space, multiple paths from root to leaf may potentially be explored. In the worst case, the entire R-tree could be searched. In this paper, we define and use the overlap ratio to quantify the degree of extraneous leaf node accesses required by a range query. The goal is to enhance the query performance of a traditional R-tree for high-overlap range queries as they tend to incur long running-times. We introduce a new AI-tree that transforms the search operation of an R-tree into a multi-label classification task to exclude the extraneous leaf node accesses. Then, we augment a traditional R-tree to the AI-tree to form a hybrid "AI+R"-tree. The "AI+R"-tree can automatically differentiate between the high- and low-overlap queries using a learned model. Thus, the "AI+R"-tree processes high-overlap queries using the AI-tree, and the low-overlap queries using the R-tree. Experiments on real datasets demonstrate that the "AI+R"-tree can enhance the query performance over a traditional R-tree by up to 500%.  ( 3 min )
    KL-UCB-switch: optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints. (arXiv:1805.05071v3 [stat.ML] UPDATED)
    We consider $K$-armed stochastic bandits and consider cumulative regret bounds up to time $T$. We are interested in strategies achieving simultaneously a distribution-free regret bound of optimal order $\sqrt{KT}$ and a distribution-dependent regret that is asymptotically optimal, that is, matching the $\kappa\ln T$ lower bound by Lai and Robbins (1985) and Burnetas and Katehakis (1996), where $\kappa$ is the optimal problem-dependent constant. This constant $\kappa$ depends on the model $\mathcal{D}$ considered (the family of possible distributions over the arms). M\'enard and Garivier (2017) provided strategies achieving such a bi-optimality in the parametric case of models given by one-dimensional exponential families, while Lattimore (2016, 2018) did so for the family of (sub)Gaussian distributions with variance less than $1$. We extend this result to the non-parametric case of all distributions over $[0,1]$. We do so by combining the MOSS strategy by Audibert and Bubeck (2009), which enjoys a distribution-free regret bound of optimal order $\sqrt{KT}$, and the KL-UCB strategy by Capp\'e et al. (2013), for which we provide in passing the first analysis of an optimal distribution-dependent $\kappa\ln T$ regret bound in the model of all distributions over $[0,1]$. We were able to obtain this non-parametric bi-optimality result while working hard to streamline the proofs (of previously known regret bounds and thus of the new analyses carried out); a second merit of the present contribution is therefore to provide a review of proofs of classical regret bounds for index-based strategies for $K$-armed stochastic bandits.
    Personalized Diagnostic Tool for Thyroid Cancer Classification using Multi-view Ultrasound. (arXiv:2207.00496v1 [cs.CV])
    Over the past decades, the incidence of thyroid cancer has been increasing globally. Accurate and early diagnosis allows timely treatment and helps to avoid over-diagnosis. Clinically, a nodule is commonly evaluated from both transverse and longitudinal views using thyroid ultrasound. However, the appearance of the thyroid gland and lesions can vary dramatically across individuals. Identifying key diagnostic information from both views requires specialized expertise. Furthermore, finding an optimal way to integrate multi-view information also relies on the experience of clinicians and adds further difficulty to accurate diagnosis. To address these, we propose a personalized diagnostic tool that can customize its decision-making process for different patients. It consists of a multi-view classification module for feature extraction and a personalized weighting allocation network that generates optimal weighting for different views. It is also equipped with a self-supervised view-aware contrastive loss to further improve the model robustness towards different patient groups. Experimental results show that the proposed framework can better utilize multi-view information and outperform the competing methods.  ( 2 min )
    Implicit adaptation of mesh model of transient heat conduction problem. (arXiv:2207.00444v1 [eess.SY])
    Considering high-temperature heating, the equations of transient heat conduction model require an adaptation, i.e. the dependence of thermophysical parameters of the model on the temperature is to be identified for each specific material to be heated. This problem is most often solved by approximation of the tabular data on the measurements of the required parameters, which can be found in the literature, by means of regression equations. But, for example, considering the steel heating process, this approach is difficult to be implemented due to the lack of tabular discrete measurements for many grades of steel, such as alloyed ones. In this paper, the new approach is proposed, which is based on a solution of a related variational problem. Its main idea is to substitute the adaptation process in the classical sense (i.e., to find the dependencies of thermophysical parameters on temperature) with 'supervised learning' of a mesh model on the basis of the technological data received from the plant. The equations to adjust the parameters of the transient heat conduction model, which are related to the thermophysical coefficients, have been derived. A numerical experiment is conducted for steel of a particular group of grades, for which enough both technological as well as tabular data are available. As a result, the 'trained' mesh model, which has not received explicitly any information about the physical and chemical properties of the heated substance, demonstrated an average error of 18.820 C, which is quite close to the average error of the model adapted classically on the basis of the tabular data (18.10 C).  ( 3 min )
    How can spherical CNNs benefit ML-based diffusion MRI parameter estimation?. (arXiv:2207.00572v1 [eess.IV])
    This paper demonstrates spherical convolutional neural networks (S-CNN) offer distinct advantages over conventional fully-connected networks (FCN) at estimating scalar parameters of tissue microstructure from diffusion MRI (dMRI). Such microstructure parameters are valuable for identifying pathology and quantifying its extent. However, current clinical practice commonly acquires dMRI data consisting of only 6 diffusion weighted images (DWIs), limiting the accuracy and precision of estimated microstructure indices. Machine learning (ML) has been proposed to address this challenge. However, existing ML-based methods are not robust to differing dMRI gradient sampling schemes, nor are they rotation equivariant. Lack of robustness to sampling schemes requires a new network to be trained for each scheme, complicating the analysis of data from multiple sources. A possible consequence of the lack of rotational equivariance is that the training dataset must contain a diverse range of microstucture orientations. Here, we show spherical CNNs represent a compelling alternative that is robust to new sampling schemes as well as offering rotational equivariance. We show the latter can be leveraged to decrease the number of training datapoints required.  ( 2 min )
    A Shallow Ritz Method for Elliptic Problems with Singular Sources. (arXiv:2107.12013v3 [math.NA] UPDATED)
    In this paper, a shallow Ritz-type neural network for solving elliptic equations with delta function singular sources on an interface is developed. There are three novel features in the present work; namely, (i) the delta function singularity is naturally removed, (ii) level set function is introduced as a feature input, (iii) it is completely shallow, comprising only one hidden layer. We first introduce the energy functional of the problem and then transform the contribution of singular sources to a regular surface integral along the interface. In such a way, the delta function singularity can be naturally removed without introducing a discrete one that is commonly used in traditional regularization methods, such as the well-known immersed boundary method. The original problem is then reformulated as a minimization problem. We propose a shallow Ritz-type neural network with one hidden layer to approximate the global minimizer of the energy functional. As a result, the network is trained by minimizing the loss function that is a discrete version of the energy. In addition, we include the level set function of the interface as a feature input of the network and find that it significantly improves the training efficiency and accuracy. We perform a series of numerical tests to show the accuracy of the present method and its capability for problems in irregular domains and higher dimensions.
    Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models. (arXiv:2110.02891v2 [cs.LG] UPDATED)
    Controllable generative sequence models with the capability to extract and replicate the style of specific examples enable many applications, including narrating audiobooks in different voices, auto-completing and auto-correcting written handwriting, and generating missing training samples for downstream recognition tasks. However, under an unsupervised-style setting, typical training algorithms for controllable sequence generative models suffer from the training-inference mismatch, where the same sample is used as content and style input during training but unpaired samples are given during inference. In this paper, we tackle the training-inference mismatch encountered during unsupervised learning of controllable generative sequence models. The proposed method is simple yet effective, where we use a style transformation module to transfer target style information into an unrelated style input. This method enables training using unpaired content and style samples and thereby mitigate the training-inference mismatch. We apply style equalization to text-to-speech and text-to-handwriting synthesis on three datasets. We conduct thorough evaluation, including both quantitative and qualitative user studies. Our results show that by mitigating the training-inference mismatch with the proposed style equalization, we achieve style replication scores comparable to real data in our user studies.  ( 3 min )
    An Artificial Intelligence Dataset for Solar Energy Locations in India. (arXiv:2202.01340v2 [cs.LG] UPDATED)
    Rapid development of renewable energy sources, particularly solar photovoltaics (PV), is critical to mitigate climate change. As a result, India has set ambitious goals to install 500 gigawatts of solar energy capacity by 2030. Given the large footprint projected to meet renewables energy targets, the potential for land use conflicts over environmental values is high. To expedite development of solar energy, land use planners will need access to up-to-date and accurate geo-spatial information of PV infrastructure. In this work, we developed a spatially explicit machine learning model to map utility-scale solar projects across India using freely available satellite imagery with a mean accuracy of 92%. Our model predictions were validated by human experts to obtain a dataset of 1363 solar PV farms. Using this dataset, we measure the solar footprint across India and quantified the degree of landcover modification associated with the development of PV infrastructure. Our analysis indicates that over 74% of solar development In India was built on landcover types that have natural ecosystem preservation, or agricultural value.
    Receptive Field Analysis of Temporal Convolutional Networks for Monaural Speech Dereverberation. (arXiv:2204.06439v3 [cs.SD] UPDATED)
    Speech dereverberation is often an important requirement in robust speech processing tasks. Supervised deep learning (DL) models give state-of-the-art performance for single-channel speech dereverberation. Temporal convolutional networks (TCNs) are commonly used for sequence modelling in speech enhancement tasks. A feature of TCNs is that they have a receptive field (RF) dependent on the specific model configuration which determines the number of input frames that can be observed to produce an individual output frame. It has been shown that TCNs are capable of performing dereverberation of simulated speech data, however a thorough analysis, especially with focus on the RF is yet lacking in the literature. This paper analyses dereverberation performance depending on the model size and the RF of TCNs. Experiments using the WHAMR corpus which is extended to include room impulse responses (RIRs) with larger T60 values demonstrate that a larger RF can have significant improvement in performance when training smaller TCN models. It is also demonstrated that TCNs benefit from a wider RF when dereverberating RIRs with larger RT60 values.
    Masked Autoencoders for Self-Supervised Learning on Automotive Point Clouds. (arXiv:2207.00531v1 [cs.CV])
    Masked autoencoding has become a successful pre-training paradigm for Transformer models for text, images, and recently, point clouds. Raw automotive datasets are a suitable candidate for self-supervised pre-training as they generally are cheap to collect compared to annotations for tasks like 3D object detection (OD). However, development of masked autoencoders for point clouds has focused solely on synthetic and indoor data. Consequently, existing methods have tailored their representations and models toward point clouds which are small, dense and have homogeneous point density. In this work, we study masked autoencoding for point clouds in an automotive setting, which are sparse and for which the point density can vary drastically among objects in the same scene. To this end, we propose Voxel-MAE, a simple masked autoencoding pre-training scheme designed for voxel representations. We pre-train the backbone of a Transformer-based 3D object detector to reconstruct masked voxels and to distinguish between empty and non-empty voxels. Our method improves the 3D OD performance by 1.75 mAP points and 1.05 NDS on the challenging nuScenes dataset. Compared to existing self-supervised methods for automotive data, Voxel-MAE displays up to $2\times$ performance increase. Further, we show that by pre-training with Voxel-MAE, we require only 40% of the annotated data to outperform a randomly initialized equivalent. Code will be released.  ( 3 min )
    Learning Mean Field Games: A Survey. (arXiv:2205.12944v2 [cs.LG] UPDATED)
    Non-cooperative and cooperative games with a very large number of players have many applications but remain generally intractable when the number of players increases. Introduced by Lasry and Lions, and Huang, Caines and Malham\'e, Mean Field Games (MFGs) rely on a mean-field approximation to allow the number of players to grow to infinity. Traditional methods for solving these games generally rely on solving partial or stochastic differential equations with a full knowledge of the model. Recently, Reinforcement Learning (RL) has appeared promising to solve complex problems. By combining MFGs and RL, we hope to solve games at a very large scale both in terms of population size and environment complexity. In this survey, we review the quickly growing recent literature on RL methods to learn Nash equilibria in MFGs. We first identify the most common settings (static, stationary, and evolutive). We then present a general framework for classical iterative methods (based on best-response computation or policy evaluation) to solve MFGs in an exact way. Building on these algorithms and the connection with Markov Decision Processes, we explain how RL can be used to learn MFG solutions in a model-free way. Last, we present numerical illustrations on a benchmark problem, and conclude with some perspectives.
    Habitat 2.0: Training Home Assistants to Rearrange their Habitat. (arXiv:2106.14405v2 [cs.LG] UPDATED)
    We introduce Habitat 2.0 (H2.0), a simulation platform for training virtual robots in interactive 3D environments and complex physics-enabled scenarios. We make comprehensive contributions to all levels of the embodied AI stack - data, simulation, and benchmark tasks. Specifically, we present: (i) ReplicaCAD: an artist-authored, annotated, reconfigurable 3D dataset of apartments (matching real spaces) with articulated objects (e.g. cabinets and drawers that can open/close); (ii) H2.0: a high-performance physics-enabled 3D simulator with speeds exceeding 25,000 simulation steps per second (850x real-time) on an 8-GPU node, representing 100x speed-ups over prior work; and, (iii) Home Assistant Benchmark (HAB): a suite of common tasks for assistive robots (tidy the house, prepare groceries, set the table) that test a range of mobile manipulation capabilities. These large-scale engineering contributions allow us to systematically compare deep reinforcement learning (RL) at scale and classical sense-plan-act (SPA) pipelines in long-horizon structured tasks, with an emphasis on generalization to new objects, receptacles, and layouts. We find that (1) flat RL policies struggle on HAB compared to hierarchical ones; (2) a hierarchy with independent skills suffers from 'hand-off problems', and (3) SPA pipelines are more brittle than RL policies.
    Improved Generalization Bounds for Adversarially Robust Learning. (arXiv:1810.02180v5 [cs.LG] UPDATED)
    We consider a model of robust learning in an adversarial environment. The learner gets uncorrupted training data with access to possible corruptions that may be affected by the adversary during testing. The learner's goal is to build a robust classifier, which will be tested on future adversarial examples. The adversary is limited to $k$ possible corruptions for each input. We model the learner-adversary interaction as a zero-sum game. This model is closely related to the adversarial examples model of Schmidt et al. (2018); Madry et al. (2017). Our main results consist of generalization bounds for the binary and multiclass classification, as well as the real-valued case (regression). For the binary classification setting, we both tighten the generalization bound of Feige et al. (2015), and are also able to handle infinite hypothesis classes. The sample complexity is improved from $O(\frac{1}{\epsilon^4}\log(\frac{|H|}{\delta}))$ to $O\big(\frac{1}{\epsilon^2}(kVC(H)\log^{\frac{3}{2}+\alpha}(kVC(H))+\log(\frac{1}{\delta})\big)$ for any $\alpha > 0$. Additionally, we extend the algorithm and generalization bound from the binary to the multiclass and real-valued cases. Along the way, we obtain results on fat-shattering dimension and Rademacher complexity of $k$-fold maxima over function classes; these may be of independent interest. For binary classification, the algorithm of Feige et al. (2015) uses a regret minimization algorithm and an ERM oracle as a black box; we adapt it for the multiclass and regression settings. The algorithm provides us with near-optimal policies for the players on a given training sample.
    InQSS: a speech intelligibility and quality assessment model using a multi-task learning network. (arXiv:2111.02585v3 [cs.SD] UPDATED)
    Speech intelligibility and quality assessment models are essential tools for researchers to evaluate and improve speech processing models. However, only a few studies have investigated multi-task models for intelligibility and quality assessment due to the limitations of available data. In this study, we released TMHINT-QI, the first Chinese speech dataset that records the quality and intelligibility scores of clean, noisy, and enhanced utterances. Then, we propose InQSS, a non-intrusive multi-task learning framework for intelligibility and quality assessment. We evaluated the InQSS on both the training-from-scratch and the pretrained models. The experimental results confirm the effectiveness of the InQSS framework. In addition, the resulting model can predict not only the intelligibility scores but also the quality scores of a speech signal.
    Behavioral Player Rating in Competitive Online Shooter Games. (arXiv:2207.00528v1 [cs.LG])
    Competitive online games use rating systems for matchmaking; progression-based algorithms that estimate the skill level of players with interpretable ratings in terms of the outcome of the games they played. However, the overall experience of players is shaped by factors beyond the sole outcome of their games. In this paper, we engineer several features from in-game statistics to model players and create ratings that accurately represent their behavior and true performance level. We then compare the estimating power of our behavioral ratings against ratings created with three mainstream rating systems by predicting rank of players in four popular game modes from the competitive shooter genre. Our results show that the behavioral ratings present more accurate performance estimations while maintaining the interpretability of the created representations. Considering different aspects of the playing behavior of players and using behavioral ratings for matchmaking can lead to match-ups that are more aligned with players' goals and interests, consequently resulting in a more enjoyable gaming experience.
    Towards Explanation for Unsupervised Graph-Level Representation Learning. (arXiv:2205.09934v2 [cs.LG] UPDATED)
    Due to the superior performance of Graph Neural Networks (GNNs) in various domains, there is an increasing interest in the GNN explanation problem "\emph{which fraction of the input graph is the most crucial to decide the model's decision?}" Existing explanation methods focus on the supervised settings, \eg, node classification and graph classification, while the explanation for unsupervised graph-level representation learning is still unexplored. The opaqueness of the graph representations may lead to unexpected risks when deployed for high-stake decision-making scenarios. In this paper, we advance the Information Bottleneck principle (IB) to tackle the proposed explanation problem for unsupervised graph representations, which leads to a novel principle, \textit{Unsupervised Subgraph Information Bottleneck} (USIB). We also theoretically analyze the connection between graph representations and explanatory subgraphs on the label space, which reveals that the expressiveness and robustness of representations benefit the fidelity of explanatory subgraphs. Experimental results on both synthetic and real-world datasets demonstrate the superiority of our developed explainer and the validity of our theoretical analysis.  ( 2 min )
    Robust subgroup discovery. (arXiv:2103.13686v4 [cs.LG] UPDATED)
    We introduce the problem of robust subgroup discovery, i.e., finding a set of interpretable descriptions of subsets that 1) stand out with respect to one or more target attributes, 2) are statistically robust, and 3) non-redundant. Many attempts have been made to mine either locally robust subgroups or to tackle the pattern explosion, but we are the first to address both challenges at the same time from a global modelling perspective. First, we formulate the broad model class of subgroup lists, i.e., ordered sets of subgroups, for univariate and multivariate targets that can consist of nominal or numeric variables, including traditional top-1 subgroup discovery in its definition. This novel model class allows us to formalise the problem of optimal robust subgroup discovery using the Minimum Description Length (MDL) principle, where we resort to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and numeric targets, respectively. Second, finding optimal subgroup lists is NP-hard. Therefore, we propose SSD++, a greedy heuristic that finds good subgroup lists and guarantees that the most significant subgroup found according to the MDL criterion is added in each iteration. In fact, the greedy gain is shown to be equivalent to a Bayesian one-sample proportion, multinomial, or t-test between the subgroup and dataset marginal target distributions plus a multiple hypothesis testing penalty. Furthermore, we empirically show on 54 datasets that SSD++ outperforms previous subgroup discovery methods in terms of quality, generalisation on unseen data, and subgroup list size.
    Data Banzhaf: A Data Valuation Framework with Maximal Robustness to Learning Stochasticity. (arXiv:2205.15466v3 [cs.LG] UPDATED)
    This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we first pose a formal framework within which one can measure the robustness of a data value notion. We show that the Banzhaf value, a value notion originated from cooperative game theory literature, achieves the maximal robustness among all semivalues -- a class of value notions that satisfy crucial properties entailed by ML applications. We propose an algorithm to efficiently estimate the Banzhaf value based on the Maximum Sample Reuse (MSR) principle. We derive the lower bound sample complexity for Banzhaf value approximation, and we show that our MSR algorithm's sample complexity nearly matches the lower bound. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several downstream ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the semivalue-based data value schemes given its computational advantage and ability to robustly differentiate data quality.
    Evaluating the Explainers: Black-Box Explainable Machine Learning for Student Success Prediction in MOOCs. (arXiv:2207.00551v1 [cs.LG])
    Neural networks are ubiquitous in applied machine learning for education. Their pervasive success in predictive performance comes alongside a severe weakness, the lack of explainability of their decisions, especially relevant in human-centric fields. We implement five state-of-the-art methodologies for explaining black-box machine learning models (LIME, PermutationSHAP, KernelSHAP, DiCE, CEM) and examine the strengths of each approach on the downstream task of student performance prediction for five massive open online courses. Our experiments demonstrate that the families of explainers do not agree with each other on feature importance for the same Bidirectional LSTM models with the same representative set of students. We use Principal Component Analysis, Jensen-Shannon distance, and Spearman's rank-order correlation to quantitatively cross-examine explanations across methods and courses. Furthermore, we validate explainer performance across curriculum-based prerequisite relationships. Our results come to the concerning conclusion that the choice of explainer is an important decision and is in fact paramount to the interpretation of the predictive results, even more so than the course the model is trained on. Source code and models are released at this http URL
    Learning Lattice Quantum Field Theories with Equivariant Continuous Flows. (arXiv:2207.00283v1 [hep-lat])
    We propose a novel machine learning method for sampling from the high-dimensional probability distributions of Lattice Quantum Field Theories. Instead of the deep architectures used so far for this task, our proposal is based on a single neural ODE layer and incorporates the full symmetries of the problem. We test our model on the $\phi^4$ theory, showing that it systematically outperforms previously proposed flow-based methods in sampling efficiency, and the improvement is especially pronounced for larger lattices. Compared to the previous baseline model, we improve a key metric, the effective sample size, from 1% to 91% on a lattice of size $32\times 32$. We also demonstrate that our model can successfully learn a continuous family of theories at once, and the results of learning can be transferred to larger lattices. Such generalization capacities further accentuate the potential advantages of machine learning methods compared to traditional MCMC-based methods.
    Secure Forward Aggregation for Vertical Federated Neural Networks. (arXiv:2207.00165v1 [cs.CR])
    Vertical federated learning (VFL) is attracting much attention because it enables cross-silo data cooperation in a privacy-preserving manner. While most research works in VFL focus on linear and tree models, deep models (e.g., neural networks) are not well studied in VFL. In this paper, we focus on SplitNN, a well-known neural network framework in VFL, and identify a trade-off between data security and model performance in SplitNN. Briefly, SplitNN trains the model by exchanging gradients and transformed data. On the one hand, SplitNN suffers from the loss of model performance since multiply parties jointly train the model using transformed data instead of raw data, and a large amount of low-level feature information is discarded. On the other hand, a naive solution of increasing the model performance through aggregating at lower layers in SplitNN (i.e., the data is less transformed and more low-level feature is preserved) makes raw data vulnerable to inference attacks. To mitigate the above trade-off, we propose a new neural network protocol in VFL called Security Forward Aggregation (SFA). It changes the way of aggregating the transformed data and adopts removable masks to protect the raw data. Experiment results show that networks with SFA achieve both data security and high model performance.
    Conditional Variable Selection for Intelligent Test. (arXiv:2207.00335v1 [cs.LG])
    Intelligent test requires efficient and effective analysis of high-dimensional data in a large scale. Traditionally, the analysis is often conducted by human experts, but it is not scalable in the era of big data. To tackle this challenge, variable selection has been recently introduced to intelligent test. However, in practice, we encounter scenarios where certain variables (e.g. some specific processing conditions for a device under test) must be maintained after variable selection. We call this conditional variable selection, which has not been well investigated for embedded or deep-learning-based variable selection methods. In this paper, we discuss a novel conditional variable selection framework that can select the most important candidate variables given a set of preselected variables.
    Rapid training of quantum recurrent neural network. (arXiv:2207.00378v1 [quant-ph])
    Time series prediction is the crucial task for many human activities e.g. weather forecasts or predicting stock prices. One solution to this problem is to use Recurrent Neural Networks (RNNs). Although they can yield accurate predictions, their learning process is slow and complex. Here we propose a Quantum Recurrent Neural Network (QRNN) to address these obstacles. The design of the network is based on the continuous-variable quantum computing paradigm. We demonstrate that the network is capable of learning time dependence of a few types of temporal data. Our numerical simulations show that the QRNN converges to optimal weights in fewer epochs than the classical network. Furthermore, for a small number of trainable parameters it can achieve lower loss than the latter.
    Non-Parametric Inference of Relational Dependence. (arXiv:2207.00163v1 [stat.ML])
    Independence testing plays a central role in statistical and causal inference from observational data. Standard independence tests assume that the data samples are independent and identically distributed (i.i.d.) but that assumption is violated in many real-world datasets and applications centered on relational systems. This work examines the problem of estimating independence in data drawn from relational systems by defining sufficient representations for the sets of observations influencing individual instances. Specifically, we define marginal and conditional independence tests for relational data by considering the kernel mean embedding as a flexible aggregation function for relational variables. We propose a consistent, non-parametric, scalable kernel test to operationalize the relational independence test for non-i.i.d. observational data under a set of structural assumptions. We empirically evaluate our proposed method on a variety of synthetic and semi-synthetic networks and demonstrate its effectiveness compared to state-of-the-art kernel-based independence tests.
    Energy Efficient Routing For Underwater Acoustic Sensor Network Using Genetic Algorithm. (arXiv:2207.00416v1 [cs.NI])
    In underwater acoustic sensor networks (UWASN), energy-reliable data transmission is a challenging task. This is due to acoustic transmission disturbances caused by excessive noise, exceptionally long propagation delays, a high bit error rate, limited bandwidth capability, and interference. One of the most important issues of UWASN for research is how to extend the life span of data transmission. Data transfer from a source node to a destination node in UWASN is a complicated topic for researchers. Many routing algorithms, such as vector base forwarding and depth base routing, have been developed in past years. We propose a genetic algorithm-based optimization method for improving the energy efficiency of data transmission in the routing path from a source node to a destination node.
    VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations. (arXiv:2207.00221v1 [cs.CV])
    Vision-Language Pretraining (VLP) models have recently successfully facilitated many cross-modal downstream tasks. Most existing works evaluated their systems by comparing the fine-tuned downstream task performance. However, only average downstream task accuracy provides little information about the pros and cons of each VLP method, let alone provides insights on how the community can improve the systems in the future. Inspired by the CheckList for testing natural language processing, we introduce VL-CheckList, a novel framework to understand the capabilities of VLP models. The proposed method divides the image-texting ability of a VLP model into three categories: objects, attributes, and relations, and uses a novel taxonomy to further break down these three aspects. We conduct comprehensive studies to analyze seven recently popular VLP models via the proposed framework. Results confirm the effectiveness of the proposed method by revealing fine-grained differences among the compared models that were not visible from downstream task-only evaluation. Further results show promising research direction in building better VLP models. Data and Code: https://github.com/om-ai-lab/VL-CheckList
    Performative Reinforcement Learning. (arXiv:2207.00046v1 [cs.LG])
    We introduce the framework of performative reinforcement learning where the policy chosen by the learner affects the underlying reward and transition dynamics of the environment. Following the recent literature on performative prediction~\cite{Perdomo et. al., 2020}, we introduce the concept of performatively stable policy. We then consider a regularized version of the reinforcement learning problem and show that repeatedly optimizing this objective converges to a performatively stable policy under reasonable assumptions on the transition dynamics. Our proof utilizes the dual perspective of the reinforcement learning problem and may be of independent interest in analyzing the convergence of other algorithms with decision-dependent environments. We then extend our results for the setting where the learner just performs gradient ascent steps instead of fully optimizing the objective, and for the setting where the learner has access to a finite number of trajectories from the changed environment. For both the settings, we leverage the dual formulation of performative reinforcement learning and establish convergence to a stable solution. Finally, through extensive experiments on a grid-world environment, we demonstrate the dependence of convergence on various parameters e.g. regularization, smoothness, and the number of samples.
    Usable Region Estimate for Assessing Practical Usability of Medical Image Segmentation Models. (arXiv:2207.00156v1 [eess.IV])
    We aim to quantitatively measure the practical usability of medical image segmentation models: to what extent, how often, and on which samples a model's predictions can be used/trusted. We first propose a measure, Correctness-Confidence Rank Correlation (CCRC), to capture how predictions' confidence estimates correlate with their correctness scores in rank. A model with a high value of CCRC means its prediction confidences reliably suggest which samples' predictions are more likely to be correct. Since CCRC does not capture the actual prediction correctness, it alone is insufficient to indicate whether a prediction model is both accurate and reliable to use in practice. Therefore, we further propose another method, Usable Region Estimate (URE), which simultaneously quantifies predictions' correctness and reliability of confidence assessments in one estimate. URE provides concrete information on to what extent a model's predictions are usable. In addition, the sizes of usable regions (UR) can be utilized to compare models: A model with a larger UR can be taken as a more usable and hence better model. Experiments on six datasets validate that the proposed evaluation methods perform well, providing a concrete and concise measure for the practical usability of medical image segmentation models. Code is made available at https://github.com/yizhezhang2000/ure.
    AI in 6G: Energy-Efficient Distributed Machine Learning for Multilayer Heterogeneous Networks. (arXiv:2207.00415v1 [cs.NI])
    Adept network management is key for supporting extremely heterogeneous applications with stringent quality of service (QoS) requirements; this is more so when envisioning the complex and ultra-dense 6G mobile heterogeneous network (HetNet). From both the environmental and economical perspectives, non-homogeneous QoS demands obstruct the minimization of the energy footprints and operational costs of the envisioned robust networks. As such, network intelligentization is expected to play an essential role in the realization of such sophisticated aims. The fusion of artificial intelligence (AI) and mobile networks will allow for the dynamic and automatic configuration of network functionalities. Machine learning (ML), one of the backbones of AI, will be instrumental in forecasting changes in network loads and resource utilization, estimating channel conditions, optimizing network slicing, and enhancing security and encryption. However, it is well known that ML tasks themselves incur massive computational burdens and energy costs. To overcome such obstacles, we propose a novel layer-based HetNet architecture which optimally distributes tasks associated with different ML approaches across network layers and entities; such a HetNet boasts multiple access schemes as well as device-to-device (D2D) communications to enhance energy efficiency via collaborative learning and communications.
    MotionMixer: MLP-based 3D Human Body Pose Forecasting. (arXiv:2207.00499v1 [cs.CV])
    In this work, we present MotionMixer, an efficient 3D human body pose forecasting model based solely on multi-layer perceptrons (MLPs). MotionMixer learns the spatial-temporal 3D body pose dependencies by sequentially mixing both modalities. Given a stacked sequence of 3D body poses, a spatial-MLP extracts fine grained spatial dependencies of the body joints. The interaction of the body joints over time is then modelled by a temporal MLP. The spatial-temporal mixed features are finally aggregated and decoded to obtain the future motion. To calibrate the influence of each time step in the pose sequence, we make use of squeeze-and-excitation (SE) blocks. We evaluate our approach on Human3.6M, AMASS, and 3DPW datasets using the standard evaluation protocols. For all evaluations, we demonstrate state-of-the-art performance, while having a model with a smaller number of parameters. Our code is available at: https://github.com/MotionMLP/MotionMixer
    Visual Transformer Meets CutMix for Improved Accuracy, Communication Efficiency, and Data Privacy in Split Learning. (arXiv:2207.00234v1 [cs.LG])
    This article seeks for a distributed learning solution for the visual transformer (ViT) architectures. Compared to convolutional neural network (CNN) architectures, ViTs often have larger model sizes, and are computationally expensive, making federated learning (FL) ill-suited. Split learning (SL) can detour this problem by splitting a model and communicating the hidden representations at the split-layer, also known as smashed data. Notwithstanding, the smashed data of ViT are as large as and as similar as the input data, negating the communication efficiency of SL while violating data privacy. To resolve these issues, we propose a new form of CutSmashed data by randomly punching and compressing the original smashed data. Leveraging this, we develop a novel SL framework for ViT, coined CutMixSL, communicating CutSmashed data. CutMixSL not only reduces communication costs and privacy leakage, but also inherently involves the CutMix data augmentation, improving accuracy and scalability. Simulations corroborate that CutMixSL outperforms baselines such as parallelized SL and SplitFed that integrates FL with SL.
    Visual Pre-training for Navigation: What Can We Learn from Noise?. (arXiv:2207.00052v1 [cs.CV])
    A powerful paradigm for sensorimotor control is to predict actions from observations directly. Training such an end-to-end system allows representations that are useful for the downstream tasks to emerge automatically. In visual navigation, an agent can learn to navigate without any manual designs by correlating how its views change with the actions being taken. However, the lack of inductive bias makes this system data-inefficient and impractical in scenarios like search and rescue, where interacting with the environment to collect data is costly. We hypothesize a sufficient representation of the current view and the goal view for a navigation policy can be learned by predicting the location and size of a crop of the current view that corresponds to the goal. We further show that training such random crop prediction in a self-supervised fashion purely on random noise images transfers well to natural home images. The learned representation can then be bootstrapped to learn a navigation policy efficiently with little interaction data. Code is available at https://github.com/yanweiw/noise2ptz.
    FLVoogd: Robust And Privacy Preserving Federated Learning. (arXiv:2207.00428v1 [cs.CR])
    In this work, we propose FLVoogd, an updated federated learning method in which servers and clients collaboratively eliminate Byzantine attacks while preserving privacy. In particular, servers use automatic Density-based Spatial Clustering of Applications with Noise (DBSCAN) combined with S2PC to cluster the benign majority without acquiring sensitive personal information. Meanwhile, clients build dual models and perform test-based distance controlling to adjust their local models toward the global one to achieve personalizing. Our framework is automatic and adaptive that servers/clients don't need to tune the parameters during the training. In addition, our framework leverages Secure Multi-party Computation (SMPC) operations, including multiplications, additions, and comparison, where costly operations, like division and square root, are not required. Evaluations are carried out on some conventional datasets from the image classification field. The result shows that FLVoogd can effectively reject malicious uploads in most scenarios; meanwhile, it avoids data leakage from the server-side.
    Weakly-supervised High-fidelity Ultrasound Video Synthesis with Feature Decoupling. (arXiv:2207.00474v1 [cs.CV])
    Ultrasound (US) is widely used for its advantages of real-time imaging, radiation-free and portability. In clinical practice, analysis and diagnosis often rely on US sequences rather than a single image to obtain dynamic anatomical information. This is challenging for novices to learn because practicing with adequate videos from patients is clinically unpractical. In this paper, we propose a novel framework to synthesize high-fidelity US videos. Specifically, the synthesis videos are generated by animating source content images based on the motion of given driving videos. Our highlights are three-fold. First, leveraging the advantages of self- and fully-supervised learning, our proposed system is trained in weakly-supervised manner for keypoint detection. These keypoints then provide vital information for handling complex high dynamic motions in US videos. Second, we decouple content and texture learning using the dual decoders to effectively reduce the model learning difficulty. Last, we adopt the adversarial training strategy with GAN losses for further improving the sharpness of the generated videos, narrowing the gap between real and synthesis videos. We validate our method on a large in-house pelvic dataset with high dynamic motion. Extensive evaluation metrics and user study prove the effectiveness of our proposed method.
    Stain Isolation-based Guidance for Improved Stain Translation. (arXiv:2207.00431v1 [cs.CV])
    Unsupervised and unpaired domain translation using generative adversarial neural networks, and more precisely CycleGAN, is state of the art for the stain translation of histopathology images. It often, however, suffers from the presence of cycle-consistent but non structure-preserving errors. We propose an alternative approach to the set of methods which, relying on segmentation consistency, enable the preservation of pathology structures. Focusing on immunohistochemistry (IHC) and multiplexed immunofluorescence (mIF), we introduce a simple yet effective guidance scheme as a loss function that leverages the consistency of stain translation with stain isolation. Qualitative and quantitative experiments show the ability of the proposed approach to improve translation between the two domains.
    Modularity Optimization as a Training Criterion for Graph Neural Networks. (arXiv:2207.00107v1 [cs.LG])
    Graph convolution is a recent scalable method for performing deep feature learning on attributed graphs by aggregating local node information over multiple layers. Such layers only consider attribute information of node neighbors in the forward model and do not incorporate knowledge of global network structure in the learning task. In particular, the modularity function provides a convenient source of information about the community structure of networks. In this work we investigate the effect on the quality of learned representations by the incorporation of community structure preservation objectives of networks in the graph convolutional model. We incorporate the objectives in two ways, through an explicit regularization term in the cost function in the output layer and as an additional loss term computed via an auxiliary layer. We report the effect of community structure preserving terms in the graph convolutional architectures. Experimental evaluation on two attributed bibilographic networks showed that the incorporation of the community-preserving objective improves semi-supervised node classification accuracy in the sparse label regime.
    ProSelfLC: Progressive Self Label Correction Towards A Low-Temperature Entropy State. (arXiv:2207.00118v1 [cs.LG])
    To train robust deep neural networks (DNNs), we systematically study several target modification approaches, which include output regularisation, self and non-self label correction (LC). Three key issues are discovered: (1) Self LC is the most appealing as it exploits its own knowledge and requires no extra models. However, how to automatically decide the trust degree of a learner as training goes is not well answered in the literature. (2) Some methods penalise while the others reward low-entropy predictions, prompting us to ask which one is better. (3) Using the standard training setting, a trained network is of low confidence when severe noise exists, making it hard to leverage its high-entropy self knowledge. To resolve the issue (1), taking two well-accepted propositions--deep neural networks learn meaningful patterns before fitting noise and minimum entropy regularisation principle--we propose a novel end-to-end method named ProSelfLC, which is designed according to learning time and entropy. Specifically, given a data point, we progressively increase trust in its predicted label distribution versus its annotated one if a model has been trained for enough time and the prediction is of low entropy (high confidence). For the issue (2), according to ProSelfLC, we empirically prove that it is better to redefine a meaningful low-entropy status and optimise the learner toward it. This serves as a defence of entropy minimisation. To address the issue (3), we decrease the entropy of self knowledge using a low temperature before exploiting it to correct labels, so that the revised labels redefine a low-entropy target state. We demonstrate the effectiveness of ProSelfLC through extensive experiments in both clean and noisy settings, and on both image and protein datasets. Furthermore, our source code is available at https://github.com/XinshaoAmosWang/ProSelfLC-AT.
    Reliable Representations Make A Stronger Defender: Unsupervised Structure Refinement for Robust GNN. (arXiv:2207.00012v1 [cs.LG])
    Benefiting from the message passing mechanism, Graph Neural Networks (GNNs) have been successful on flourish tasks over graph data. However, recent studies have shown that attackers can catastrophically degrade the performance of GNNs by maliciously modifying the graph structure. A straightforward solution to remedy this issue is to model the edge weights by learning a metric function between pairwise representations of two end nodes, which attempts to assign low weights to adversarial edges. The existing methods use either raw features or representations learned by supervised GNNs to model the edge weights. However, both strategies are faced with some immediate problems: raw features cannot represent various properties of nodes (e.g., structure information), and representations learned by supervised GNN may suffer from the poor performance of the classifier on the poisoned graph. We need representations that carry both feature information and as mush correct structure information as possible and are insensitive to structural perturbations. To this end, we propose an unsupervised pipeline, named STABLE, to optimize the graph structure. Finally, we input the well-refined graph into a downstream classifier. For this part, we design an advanced GCN that significantly enhances the robustness of vanilla GCN without increasing the time complexity. Extensive experiments on four real-world graph benchmarks demonstrate that STABLE outperforms the state-of-the-art methods and successfully defends against various attacks.
    Variational Autoencoder Assisted Neural Network Likelihood RSRP Prediction Model. (arXiv:2207.00166v1 [cs.NI])
    Measuring customer experience on mobile data is of utmost importance for global mobile operators. The reference signal received power (RSRP) is one of the important indicators for current mobile network management, evaluation and monitoring. Radio data gathered through the minimization of drive test (MDT), a 3GPP standard technique, is commonly used for radio network analysis. Collecting MDT data in different geographical areas is inefficient and constrained by the terrain conditions and user presence, hence is not an adequate technique for dynamic radio environments. In this paper, we study a generative model for RSRP prediction, exploiting MDT data and a digital twin (DT), and propose a data-driven, two-tier neural network (NN) model. In the first tier, environmental information related to user equipment (UE), base stations (BS) and network key performance indicators (KPI) are extracted through a variational autoencoder (VAE). The second tier is designed as a likelihood model. Here, the environmental features and real MDT data features are adopted, formulating an integrated training process. On validation, our proposed model that uses real-world data demonstrates an accuracy improvement of about 20% or more compared with the empirical model and about 10% when compared with a fully connected prediction network.
    WNet: A data-driven dual-domain denoising model for sparse-view computed tomography with a trainable reconstruction layer. (arXiv:2207.00400v1 [eess.IV])
    Deep learning based solutions are being succesfully implemented for a wide variety of applications. Most notably, clinical use-cases have gained an increased interest and have been the main driver behind some of the cutting-edge data-driven algorithms proposed in the last years. For applications like sparse-view tomographic reconstructions, where the amount of measurement data is small in order to keep acquisition times short and radiation dose low, reduction of the streaking artifacts has prompted the development of data-driven denoising algorithms with the main goal of obtaining diagnostically viable images with only a subset of a full-scan data. We propose WNet, a data-driven dual-domain denoising model which contains a trainable reconstruction layer for sparse-view artifact denoising. Two encoder-decoder networks perform denoising in both sinogram- and reconstruction-domain simultaneously, while a third layer implementing the Filtered Backprojection algorithm is sandwiched between the first two and takes care of the reconstruction operation. We investigate the performance of the network on sparse-view chest CT scans, and we highlight the added benefit of having a trainable reconstruction layer over the more conventional fixed ones. We train and test our network on two clinically relevant datasets and we compare the obtained results with three different types of sparse-view CT denoising and reconstruction algorithms.
    Effect of Homomorphic Encryption on the Performance of Training Federated Learning Generative Adversarial Networks. (arXiv:2207.00263v1 [cs.CR])
    A Generative Adversarial Network (GAN) is a deep-learning generative model in the field of Machine Learning (ML) that involves training two Neural Networks (NN) using a sizable data set. In certain fields, such as medicine, the training data may be hospital patient records that are stored across different hospitals. The classic centralized approach would involve sending the data to a centralized server where the model would be trained. However, that would involve breaching the privacy and confidentiality of the patients and their data, which would be unacceptable. Therefore, Federated Learning (FL), an ML technique that trains ML models in a distributed setting without data ever leaving the host device, would be a better alternative to the centralized option. In this ML technique, only parameters and certain metadata would be communicated. In spite of that, there still exist attacks that can infer user data using the parameters and metadata. A fully privacy-preserving solution involves homomorphically encrypting (HE) the data communicated. This paper will focus on the performance loss of training an FL-GAN with three different types of Homomorphic Encryption: Partial Homomorphic Encryption (PHE), Somewhat Homomorphic Encryption (SHE), and Fully Homomorphic Encryption (FHE). We will also test the performance loss of Multi-Party Computations (MPC), as it has homomorphic properties. The performances will be compared to the performance of training an FL-GAN without encryption as well. Our experiments show that the more complex the encryption method is, the longer it takes, with the extra time taken for HE is quite significant in comparison to the base case of FL.
    DarKnight: An Accelerated Framework for Privacy and Integrity Preserving Deep Learning Using Trusted Hardware. (arXiv:2207.00083v1 [cs.CR])
    Privacy and security-related concerns are growing as machine learning reaches diverse application domains. The data holders want to train or infer with private data while exploiting accelerators, such as GPUs, that are hosted in the cloud. Cloud systems are vulnerable to attackers that compromise the privacy of data and integrity of computations. Tackling such a challenge requires unifying theoretical privacy algorithms with hardware security capabilities. This paper presents DarKnight, a framework for large DNN training while protecting input privacy and computation integrity. DarKnight relies on cooperative execution between trusted execution environments (TEE) and accelerators, where the TEE provides privacy and integrity verification, while accelerators perform the bulk of the linear algebraic computation to optimize the performance. In particular, DarKnight uses a customized data encoding strategy based on matrix masking to create input obfuscation within a TEE. The obfuscated data is then offloaded to GPUs for fast linear algebraic computation. DarKnight's data obfuscation strategy provides provable data privacy and computation integrity in the cloud servers. While prior works tackle inference privacy and cannot be utilized for training, DarKnight's encoding scheme is designed to support both training and inference.
    Cactus Mechanisms: Optimal Differential Privacy Mechanisms in the Large-Composition Regime. (arXiv:2207.00420v1 [cs.CR])
    Most differential privacy mechanisms are applied (i.e., composed) numerous times on sensitive data. We study the design of optimal differential privacy mechanisms in the limit of a large number of compositions. As a consequence of the law of large numbers, in this regime the best privacy mechanism is the one that minimizes the Kullback-Leibler divergence between the conditional output distributions of the mechanism given two different inputs. We formulate an optimization problem to minimize this divergence subject to a cost constraint on the noise. We first prove that additive mechanisms are optimal. Since the optimization problem is infinite dimensional, it cannot be solved directly; nevertheless, we quantize the problem to derive near-optimal additive mechanisms that we call "cactus mechanisms" due to their shape. We show that our quantization approach can be arbitrarily close to an optimal mechanism. Surprisingly, for quadratic cost, the Gaussian mechanism is strictly sub-optimal compared to this cactus mechanism. Finally, we provide numerical results which indicate that cactus mechanism outperforms the Gaussian mechanism for a finite number of compositions.
    A Rare Topic Discovery Model for Short Texts Based on Co-occurrence word Network. (arXiv:2207.00432v1 [cs.IR])
    We provide a simple and general solution for the discovery of scarce topics in unbalanced short-text datasets, namely, a word co-occurrence network-based model CWIBTD, which can simultaneously address the sparsity and unbalance of short-text topics and attenuate the effect of occasional pairwise occurrences of words, allowing the model to focus more on the discovery of scarce topics. Unlike previous approaches, CWIBTD uses co-occurrence word networks to model the topic distribution of each word, which improves the semantic density of the data space and ensures its sensitivity in identify-ing rare topics by improving the way node activity is calculated and normal-izing scarce topics and large topics to some extent. In addition, using the same Gibbs sampling as LDA makes CWIBTD easy to be extended to vari-ous application scenarios. Extensive experimental validation in the unbal-anced short text dataset confirms the superiority of CWIBTD over the base-line approach in discovering rare topics. Our model can be used for early and accurate discovery of emerging topics or unexpected events on social platforms.
    Learning Subject-Invariant Representations from Speech-Evoked EEG Using Variational Autoencoders. (arXiv:2207.00323v1 [eess.AS])
    The electroencephalogram (EEG) is a powerful method to understand how the brain processes speech. Linear models have recently been replaced for this purpose with deep neural networks and yield promising results. In related EEG classification fields, it is shown that explicitly modeling subject-invariant features improves generalization of models across subjects and benefits classification accuracy. In this work, we adapt factorized hierarchical variational autoencoders to exploit parallel EEG recordings of the same stimuli. We model EEG into two disentangled latent spaces. Subject accuracy reaches 98.96% and 1.60% on respectively the subject and content latent space, whereas binary content classification experiments reach an accuracy of 51.51% and 62.91% on respectively the subject and content latent space.
    Can we learn from developer mistakes? Learning to localize and repair real bugs from real bug fixes. (arXiv:2207.00301v1 [cs.SE])
    Real bug fixes found in open source repositories seem to be the perfect source for learning to localize and repair real bugs. However, the absence of large scale bug fix collections has made it difficult to effectively exploit real bug fixes in the training of larger neural models in the past. In contrast, artificial bugs -- produced by mutating existing source code -- can be easily obtained at a sufficient scale and are therefore often preferred in the training of existing approaches. Still, localization and repair models that are trained on artificial bugs usually underperform when faced with real bugs. This raises the question whether bug localization and repair models trained on real bug fixes are more effective in localizing and repairing real bugs. We address this question by introducing RealiT, a pre-train-and-fine-tune approach for effectively learning to localize and repair real bugs from real bug fixes. RealiT is first pre-trained on a large number of artificial bugs produced by traditional mutation operators and then fine-tuned on a smaller set of real bug fixes. Fine-tuning does not require any modifications of the learning algorithm and hence can be easily adopted in various training scenarios for bug localization or repair (even when real training data is scarce). In addition, we found that training on real bug fixes with RealiT is empirically powerful by nearly doubling the localization performance of an existing model on real bugs while maintaining or even improving the repair performance.
    Ranking in Contextual Multi-Armed Bandits. (arXiv:2207.00109v1 [stat.ML])
    We study a ranking problem in the contextual multi-armed bandit setting. A learning agent selects an ordered list of items at each time step and observes stochastic outcomes for each position. In online recommendation systems, showing an ordered list of the most attractive items would not be the best choice since both position and item dependencies result in a complicated reward function. A very naive example is the lack of diversity when all the most attractive items are from the same category. We model position and item dependencies in the ordered list and design UCB and Thompson Sampling type algorithms for this problem. We prove that the regret bound over $T$ rounds and $L$ positions is $\Tilde{O}(L\sqrt{d T})$, which has the same order as the previous works with respect to $T$ and only increases linearly with $L$. Our work generalizes existing studies in several directions, including position dependencies where position discount is a particular case, and proposes a more general contextual bandit model.
    Advances in Prediction of Readmission Rates Using Long Term Short Term Memory Networks on Healthcare Insurance Data. (arXiv:2207.00066v1 [cs.LG])
    30-day hospital readmission is a long standing medical problem that affects patients' morbidity and mortality and costs billions of dollars annually. Recently, machine learning models have been created to predict risk of inpatient readmission for patients with specific diseases, however no model exists to predict this risk across all patients. We developed a bi-directional Long Short Term Memory (LSTM) Network that is able to use readily available insurance data (inpatient visits, outpatient visits, and drug prescriptions) to predict 30 day re-admission for any admitted patient, regardless of reason. The top-performing model achieved an ROC AUC of 0.763 (0.011) when using historical, inpatient, and post-discharge data. The LSTM model significantly outperformed a baseline random forest classifier, indicating that understanding the sequence of events is important for model prediction. Incorporation of 30-days of historical data also significantly improved model performance compared to inpatient data alone, indicating that a patients clinical history prior to admission, including outpatient visits and pharmacy data is a strong contributor to readmission. Our results demonstrate that a machine learning model is able to predict risk of inpatient readmission with reasonable accuracy for all patients using structured insurance billing data. Because billing data or equivalent surrogates can be extracted from sites, such a model could be deployed to identify patients at risk for readmission before they are discharged, or to assign more robust follow up (closer follow up, home health, mailed medications) to at-risk patients after discharge.
    Variational Inference for Additive Main and Multiplicative Interaction Effects Models. (arXiv:2207.00011v1 [stat.ML])
    In plant breeding the presence of a genotype by environment (GxE) interaction has a strong impact on cultivation decision making and the introduction of new crop cultivars. The combination of linear and bilinear terms has been shown to be very useful in modelling this type of data. A widely-used approach to identify GxE is the Additive Main Effects and Multiplicative Interaction Effects (AMMI) model. However, as data frequently can be high-dimensional, Markov chain Monte Carlo (MCMC) approaches can be computationally infeasible. In this article, we consider a variational inference approach for such a model. We derive variational approximations for estimating the parameters and we compare the approximations to MCMC using both simulated and real data. The new inferential framework we propose is on average two times faster whilst maintaining the same predictive performance as MCMC.
    MultiViz: An Analysis Benchmark for Visualizing and Understanding Multimodal Models. (arXiv:2207.00056v1 [cs.LG])
    The promise of multimodal models for real-world applications has inspired research in visualizing and understanding their internal mechanics with the end goal of empowering stakeholders to visualize model behavior, perform model debugging, and promote trust in machine learning models. However, modern multimodal models are typically black-box neural networks, which makes it challenging to understand their internal mechanics. How can we visualize the internal modeling of multimodal interactions in these models? Our paper aims to fill this gap by proposing MultiViz, a method for analyzing the behavior of multimodal models by scaffolding the problem of interpretability into 4 stages: (1) unimodal importance: how each modality contributes towards downstream modeling and prediction, (2) cross-modal interactions: how different modalities relate with each other, (3) multimodal representations: how unimodal and cross-modal interactions are represented in decision-level features, and (4) multimodal prediction: how decision-level features are composed to make a prediction. MultiViz is designed to operate on diverse modalities, models, tasks, and research areas. Through experiments on 8 trained models across 6 real-world tasks, we show that the complementary stages in MultiViz together enable users to (1) simulate model predictions, (2) assign interpretable concepts to features, (3) perform error analysis on model misclassifications, and (4) use insights from error analysis to debug models. MultiViz is publicly available, will be regularly updated with new interpretation tools and metrics, and welcomes inputs from the community.
    Explainable Empirical Risk Minimization. (arXiv:2009.01492v3 [cs.LG] UPDATED)
    The successful application of machine learning (ML) methods becomes increasingly dependent on their interpretability or explainability. Designing explainable ML systems is instrumental to ensuring transparency of automated decision-making that targets humans. The explainability of ML methods is also an essential ingredient for trustworthy artificial intelligence. A key challenge in ensuring explainability is its dependence on the specific human user ("explainee"). The users of machine learning methods might have vastly different background knowledge about machine learning principles. One user might have a university degree in machine learning or related fields, while another user might have never received formal training in high-school mathematics. This paper applies information-theoretic concepts to develop a novel measure for the subjective explainability of the predictions delivered by a ML method. We construct this measure via the conditional entropy of predictions, given a user feedback. The user feedback might be obtained from user surveys or biophysical measurements. Our main contribution is the explainable empirical risk minimization (EERM) principle of learning a hypothesis that optimally balances between the subjective explainability and risk. The EERM principle is flexible and can be combined with arbitrary machine learning models. We present several practical implementations of EERM for linear models and decision trees. Numerical experiments demonstrate the application of EERM to detecting the use of inappropriate language on social media.
    Discriminator-Guided Model-Based Offline Imitation Learning. (arXiv:2207.00244v1 [cs.LG])
    Offline imitation learning (IL) is a powerful method to solve decision-making problems from expert demonstrations without reward labels. Existing offline IL methods suffer from severe performance degeneration under limited expert data due to covariate shift. Including a learned dynamics model can potentially improve the state-action space coverage of expert data, however, it also faces challenging issues like model approximation/generalization errors and suboptimality of rollout data. In this paper, we propose the Discriminator-guided Model-based offline Imitation Learning (DMIL) framework, which introduces a discriminator to simultaneously distinguish the dynamics correctness and suboptimality of model rollout data against real expert demonstrations. DMIL adopts a novel cooperative-yet-adversarial learning strategy, which uses the discriminator to guide and couple the learning process of the policy and dynamics model, resulting in improved model performance and robustness. Our framework can also be extended to the case when demonstrations contain a large proportion of suboptimal data. Experimental results show that DMIL and its extension achieve superior performance and robustness compared to state-of-the-art offline IL methods under small datasets.
    Transferable Graph Backdoor Attack. (arXiv:2207.00425v1 [cs.CR])
    Graph Neural Networks (GNNs) have achieved tremendous success in many graph mining tasks, benefitting from the message passing strategy that fuses the local structure and node features for much better graph representation learning. Despite the excellent performance of GNNs, but similar to other type of deep neural networks, the robustness of GNNs is unsatisfactory. It have been disclosed by many works that GNNs are vulnerable to unnoticeable perturbations on both graph structure and node features. Many adversarial attacks have been proposed to disclose the fragility of GNNs under different perturbation strategies to create adversarial examples. However, less work has been done to show the vulnerability of GNNs under backdoor attack. To fill this gap, in this paper, we present GHAT, transferable GrapH bAckdoor aTtack. The core principle of GHAT is to poison training dataset with perturbation triggers that can lead to effective and transferable backdoor attack. The perturbation trigger for a graph is generated by performing the perturbation actions on the graph structure via a gradient based score matrix. Compared with the prior works, GHAT is different in several ways: it exploits a surrogate GCN model to generate perturbation trigger for black-box based backdoor attack; it generates sample-specific perturbation triggers which do not have fixed pattern; the attack of GHAT can be transferable to different GNN models when trained with the poisoned training dataset forged by GHAT. Through extensive evaluation on four real-world datasets, we demonstrate that GHAT shows much better attack effectiveness in regard to transferable backdoor attack on GNNs.
    Threat Assessment in Machine Learning based Systems. (arXiv:2207.00091v1 [cs.CR])
    Machine learning is a field of artificial intelligence (AI) that is becoming essential for several critical systems, making it a good target for threat actors. Threat actors exploit different Tactics, Techniques, and Procedures (TTPs) against the confidentiality, integrity, and availability of Machine Learning (ML) systems. During the ML cycle, they exploit adversarial TTPs to poison data and fool ML-based systems. In recent years, multiple security practices have been proposed for traditional systems but they are not enough to cope with the nature of ML-based systems. In this paper, we conduct an empirical study of threats reported against ML-based systems with the aim to understand and characterize the nature of ML threats and identify common mitigation strategies. The study is based on 89 real-world ML attack scenarios from the MITRE's ATLAS database, the AI Incident Database, and the literature; 854 ML repositories from the GitHub search and the Python Packaging Advisory database, selected based on their reputation. Attacks from the AI Incident Database and the literature are used to identify vulnerabilities and new types of threats that were not documented in ATLAS. Results show that convolutional neural networks were one of the most targeted models among the attack scenarios. ML repositories with the largest vulnerability prominence include TensorFlow, OpenCV, and Notebook. In this paper, we also report the most frequent vulnerabilities in the studied ML repositories, the most targeted ML phases and models, the most used TTPs in ML phases and attack scenarios. This information is particularly important for red/blue teams to better conduct attacks/defenses, for practitioners to prevent threats during ML development, and for researchers to develop efficient defense mechanisms.
    Discrimination in machine learning algorithms. (arXiv:2207.00108v1 [stat.ML])
    Machine learning algorithms are routinely used for business decisions that may directly affect individuals, for example, because a credit scoring algorithm refuses them a loan. It is then relevant from an ethical (and legal) point of view to ensure that these algorithms do not discriminate based on sensitive attributes (like sex or race), which may occur unwittingly and unknowingly by the operator and the management. Statistical tools and methods are then required to detect and eliminate such potential biases.
    Rethinking Optimization with Differentiable Simulation from a Global Perspective. (arXiv:2207.00167v1 [stat.ML])
    Differentiable simulation is a promising toolkit for fast gradient-based policy optimization and system identification. However, existing approaches to differentiable simulation have largely tackled scenarios where obtaining smooth gradients has been relatively easy, such as systems with mostly smooth dynamics. In this work, we study the challenges that differentiable simulation presents when it is not feasible to expect that a single descent reaches a global optimum, which is often a problem in contact-rich scenarios. We analyze the optimization landscapes of diverse scenarios that contain both rigid bodies and deformable objects. In dynamic environments with highly deformable objects and fluids, differentiable simulators produce rugged landscapes with nonetheless useful gradients in some parts of the space. We propose a method that combines Bayesian optimization with semi-local 'leaps' to obtain a global search method that can use gradients effectively, while also maintaining robust performance in regions with noisy gradients. We show that our approach outperforms several gradient-based and gradient-free baselines on an extensive set of experiments in simulation, and also validate the method using experiments with a real robot and deformables. Videos and supplementary materials are available at https://tinyurl.com/globdiff
    Image features of a splashing drop on a solid surface extracted using a feedforward neural network. (arXiv:2201.09541v1 [physics.flu-dyn] CROSS LISTED)
    This article reports nonintuitive characteristic of a splashing drop on a solid surface discovered through extracting image features using a feedforward neural network (FNN). Ethanol of area-equivalent radius about 1.29 mm was dropped from impact heights ranging from 4 cm to 60 cm (splashing threshold 20 cm) and impacted on a hydrophilic surface. The images captured when half of the drop impacted the surface were labeled according to their outcome, splashing or nonsplashing, and were used to train an FNN. A classification accuracy higher than 96% was achieved. To extract the image features identified by the FNN for classification, the weight matrix of the trained FNN for identifying splashing drops was visualized. Remarkably, the visualization showed that the trained FNN identified the contour height of the main body of the impacting drop as an important characteristic differentiating between splashing and nonsplashing drops, which has not been reported in previous studies. This feature was found throughout the impact, even when one and three-quarters of the drop impacted the surface. To confirm the importance of this image feature, the FNN was retrained to classify using only the main body without checking for the presence of ejected secondary droplets. The accuracy was still higher than 82%, confirming that the contour height is an important feature distinguishing splashing from nonsplashing drops. Several aspects of drop impact are analyzed and discussed with the aim of identifying the possible mechanism underlying the difference in contour height between splashing and nonsplashing drops.
    Robust Bayesian Learning for Reliable Wireless AI: Framework and Applications. (arXiv:2207.00300v1 [cs.LG])
    This work takes a critical look at the application of conventional machine learning methods to wireless communication problems through the lens of reliability and robustness. Deep learning techniques adopt a frequentist framework, and are known to provide poorly calibrated decisions that do not reproduce the true uncertainty caused by limitations in the size of the training data. Bayesian learning, while in principle capable of addressing this shortcoming, is in practice impaired by model misspecification and by the presence of outliers. Both problems are pervasive in wireless communication settings, in which the capacity of machine learning models is subject to resource constraints and training data is affected by noise and interference. In this context, we explore the application of the framework of robust Bayesian learning. After a tutorial-style introduction to robust Bayesian learning, we showcase the merits of robust Bayesian learning on several important wireless communication problems in terms of accuracy, calibration, and robustness to outliers and misspecification.
    Adversarial Robustness is at Odds with Lazy Training. (arXiv:2207.00411v1 [cs.CR])
    Recent works show that random neural networks are vulnerable against adversarial attacks [Daniely and Schacham, 2020] and that such attacks can be easily found using a single step of gradient descent [Bubeck et al., 2021]. In this work, we take it one step further and show that a single gradient step can find adversarial examples for networks trained in the so-called lazy regime. This regime is interesting because even though the neural network weights remain close to the initialization, there exist networks with small generalization error, which can be found efficiently using first-order methods. Our work challenges the model of the lazy regime, the dominant regime in which neural networks are provably efficiently learnable. We show that the networks trained in this regime, even though they enjoy good theoretical computational guarantees, remain vulnerable to adversarial examples. To the best of our knowledge, this is the first work to prove that such well-generalizable neural networks are still vulnerable to adversarial attacks.
    DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. (arXiv:2207.00032v1 [cs.LG])
    The past several years have witnessed the success of transformer-based models, and their scale and application scenarios continue to grow aggressively. The current landscape of transformer models is increasingly diverse: the model size varies drastically with the largest being of hundred-billion parameters; the model characteristics differ due to the sparsity introduced by the Mixture-of-Experts; the target application scenarios can be latency-critical or throughput-oriented; the deployment hardware could be single- or multi-GPU systems with different types of memory and storage, etc. With such increasing diversity and the fast-evolving pace of transformer models, designing a highly performant and efficient inference system is extremely challenging. In this paper, we present DeepSpeed Inference, a comprehensive system solution for transformer model inference to address the above-mentioned challenges. DeepSpeed Inference consists of (1) a multi-GPU inference solution to minimize latency while maximizing the throughput of both dense and sparse transformer models when they fit in aggregate GPU memory, and (2) a heterogeneous inference solution that leverages CPU and NVMe memory in addition to the GPU memory and compute to enable high inference throughput with large models which do not fit in aggregate GPU memory. DeepSpeed Inference reduces latency by up to 7.3X over the state-of-the-art for latency-oriented scenarios and increases throughput by over 1.5x for throughput-oriented scenarios. Moreover, it enables trillion parameter scale inference under real-time latency constraints by leveraging hundreds of GPUs, an unprecedented scale for inference. It can inference 25x larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over $50\%$ of A6000 peak).
    Automated Quantum Circuit Design with Nested Monte Carlo Tree Search. (arXiv:2207.00132v1 [quant-ph])
    Quantum algorithms based on variational approaches are one of the most promising methods to construct quantum solutions and have found a myriad of applications in the last few years. Despite the adaptability and simplicity, their scalability and the selection of suitable ans\"atzs remain key challenges. In this work, we report an algorithmic framework based on nested Monte-Carlo Tree Search (MCTS) coupled with the combinatorial multi-armed bandit (CMAB) model for the automated design of quantum circuits. Through numerical experiments, we demonstrated our algorithm applied to various kinds of problems, including the ground energy problem in quantum chemistry, quantum optimisation on a graph, solving systems of linear equations, and finding encoding circuit for quantum error detection codes. Compared to the existing approaches, the results indicate that our circuit design algorithm can explore larger search spaces and optimise quantum circuits for larger systems, showing both versatility and scalability.
    Class Impression for Data-free Incremental Learning. (arXiv:2207.00005v1 [cs.CV])
    Standard deep learning-based classification approaches require collecting all samples from all classes in advance and are trained offline. This paradigm may not be practical in real-world clinical applications, where new classes are incrementally introduced through the addition of new data. Class incremental learning is a strategy allowing learning from such data. However, a major challenge is catastrophic forgetting, i.e., performance degradation on previous classes when adapting a trained model to new data. Prior methodologies to alleviate this challenge save a portion of training data require perpetual storage of such data that may introduce privacy issues. Here, we propose a novel data-free class incremental learning framework that first synthesizes data from the model trained on previous classes to generate a \ours. Subsequently, it updates the model by combining the synthesized data with new class data. Furthermore, we incorporate a cosine normalized Cross-entropy loss to mitigate the adverse effects of the imbalance, a margin loss to increase separation among previous classes and new ones, and an intra-domain contrastive loss to generalize the model trained on the synthesized data to real data. We compare our proposed framework with state-of-the-art methods in class incremental learning, where we demonstrate improvement in accuracy for the classification of 11,062 echocardiography cine series of patients.
    LaserMix for Semi-Supervised LiDAR Semantic Segmentation. (arXiv:2207.00026v1 [cs.CV])
    Densely annotating LiDAR point clouds is costly, which restrains the scalability of fully-supervised learning methods. In this work, we study the underexplored semi-supervised learning (SSL) in LiDAR segmentation. Our core idea is to leverage the strong spatial cues of LiDAR point clouds to better exploit unlabeled data. We propose LaserMix to mix laser beams from different LiDAR scans, and then encourage the model to make consistent and confident predictions before and after mixing. Our framework has three appealing properties: 1) Generic: LaserMix is agnostic to LiDAR representations (e.g., range view and voxel), and hence our SSL framework can be universally applied. 2) Statistically grounded: We provide a detailed analysis to theoretically explain the applicability of the proposed framework. 3) Effective: Comprehensive experimental analysis on popular LiDAR segmentation datasets (nuScenes, SemanticKITTI, and ScribbleKITTI) demonstrates our effectiveness and superiority. Notably, we achieve competitive results over fully-supervised counterparts with 2x to 5x fewer labels and improve the supervised-only baseline significantly by 10.8% on average. We hope this concise yet high-performing framework could facilitate future research in semi-supervised LiDAR segmentation. Code will be publicly available.
    Ultra-low latency recurrent neural network inference on FPGAs for physics applications with hls4ml. (arXiv:2207.00559v1 [cs.LG])
    Recurrent neural networks have been shown to be effective architectures for many tasks in high energy physics, and thus have been widely adopted. Their use in low-latency environments has, however, been limited as a result of the difficulties of implementing recurrent architectures on field-programmable gate arrays (FPGAs). In this paper we present an implementation of two types of recurrent neural network layers -- long short-term memory and gated recurrent unit -- within the hls4ml framework. We demonstrate that our implementation is capable of producing effective designs for both small and large models, and can be customized to meet specific design requirements for inference latencies and FPGA resources. We show the performance and synthesized designs for multiple neural networks, many of which are trained specifically for jet identification tasks at the CERN Large Hadron Collider.
    DeepOPF: A Feasibility-Optimized Deep Neural Network Approach for AC Optimal Power Flow Problems. (arXiv:2007.01002v6 [eess.SY] UPDATED)
    High percentage penetrations of renewable energy generations introduce significant uncertainty into power systems. It requires grid operators to solve alternative current optimal power flow (AC-OPF) problems more frequently for economical and reliable operation in both transmission and distribution grids. In this paper, we develop a Deep Neural Network (DNN) approach, called DeepOPF, for solving AC-OPF problems in a fraction of the time used by conventional solvers. A key difficulty for applying machine learning techniques for solving AC-OPF problems lies in ensuring that the obtained solutions respect the equality and inequality physical and operational constraints. Generalized the 2-stage procedure in [1], [2], DeepOPF first trains a DNN model to predict a set of independent operating variables and then directly compute the remaining dependable ones by solving power flow equations. Such an approach not only preserves the power-flow balance equality constraints but also reduces the number of variables to predict by the DNN, cutting down the number of neurons and training data needed. DeepOPF then employs a penalty approach with a zero-order gradient estimation technique in the training process to preserve the remaining inequality constraints. As another contribution, we drive a condition for tuning the size of the DNN according to the desired approximation accuracy, which measures the DNN generalization capability. It provides theoretical justification for using DNN to solve the AC-OPF problem. Simulation results of IEEE 30/118/300-bus and a synthetic 2000-bus test cases show that DeepOPF speeds up the computing time by up to two orders of magnitude as compared to a state-of-the-art solver, at the expense of $<$0.1% cost difference.
    Training Novices: The Role of Human-AI Collaboration and Knowledge Transfer. (arXiv:2207.00497v1 [cs.HC])
    Across a multitude of work environments, expert knowledge is imperative for humans to conduct tasks with high performance and ensure business success. These humans possess task-specific expert knowledge (TSEK) and hence, represent subject matter experts (SMEs). However, not only demographic changes but also personnel downsizing strategies lead and will continue to lead to departures of SMEs within organizations, which constitutes the challenge of how to retain that expert knowledge and train novices to keep the competitive advantage elicited by that expert knowledge. SMEs training novices is time- and cost-intensive, which intensifies the need for alternatives. Human-AI collaboration (HAIC) poses a way out of this dilemma, facilitating alternatives to preserve expert knowledge and teach it to novices for tasks conducted by SMEs beforehand. In this workshop paper, we (1) propose a framework on how HAIC can be utilized to train novices on particular tasks, (2) illustrate the role of explicit and tacit knowledge in this training process via HAIC, and (3) outline a preliminary experiment design to assess the ability of AI systems in HAIC to act as a trainer to transfer TSEK to novices who do not possess prior TSEK.
    CVLight: Decentralized Learning for Adaptive Traffic Signal Control with Connected Vehicles. (arXiv:2104.10340v3 [cs.LG] UPDATED)
    This paper develops a decentralized reinforcement learning (RL) scheme for multi-intersection adaptive traffic signal control (TSC), called "CVLight", that leverages data collected from connected vehicles (CVs). The state and reward design facilitates coordination among agents and considers travel delays collected by CVs. A novel algorithm, Asymmetric Advantage Actor-critic (Asym-A2C), is proposed where both CV and non-CV information is used to train the critic network, while only CV information is used to execute optimal signal timing. Comprehensive experiments show the superiority of CVLight over state-of-the-art algorithms under a 2-by-2 synthetic road network with various traffic demand patterns and penetration rates. The learned policy is then visualized to further demonstrate the advantage of Asym-A2C. A pre-train technique is applied to improve the scalability of CVLight, which significantly shortens the training time and shows the advantage in performance under a 5-by-5 road network. A case study is performed on a 2-by-2 road network located in State College, Pennsylvania, USA, to further demonstrate the effectiveness of the proposed algorithm under real-world scenarios. Compared to other baseline models, the trained CVLight agent can efficiently control multiple intersections solely based on CV data and achieve the best performance, especially under low CV penetration rates.
    Unified Source-Filter GAN with Harmonic-plus-Noise Source Excitation Generation. (arXiv:2205.06053v2 [cs.SD] UPDATED)
    This paper introduces a unified source-filter network with a harmonic-plus-noise source excitation generation mechanism. In our previous work, we proposed unified Source-Filter GAN (uSFGAN) for developing a high-fidelity neural vocoder with flexible voice controllability using a unified source-filter neural network architecture. However, the capability of uSFGAN to model the aperiodic source excitation signal is insufficient, and there is still a gap in sound quality between the natural and generated speech. To improve the source excitation modeling and generated sound quality, a new source excitation generation network separately generating periodic and aperiodic components is proposed. The advanced adversarial training procedure of HiFiGAN is also adopted to replace that of Parallel WaveGAN used in the original uSFGAN. Both objective and subjective evaluation results show that the modified uSFGAN significantly improves the sound quality of the basic uSFGAN while maintaining the voice controllability.
    Asynchronous Distributed Bayesian Optimization at HPC Scale. (arXiv:2207.00479v1 [cs.LG])
    Bayesian optimization (BO) is a widely used approach for computationally expensive black-box optimization such as simulator calibration and hyperparameter optimization of deep learning methods. In BO, a dynamically updated computationally cheap surrogate model is employed to learn the input-output relationship of the black-box function; this surrogate model is used to explore and exploit the promising regions of the input space. Multipoint BO methods adopt a single manager/multiple workers strategy to achieve high-quality solutions in shorter time. However, the computational overhead in multipoint generation schemes is a major bottleneck in designing BO methods that can scale to thousands of workers. We present an asynchronous-distributed BO (ADBO) method wherein each worker runs a search and asynchronously communicates the input-output values of black-box evaluations from all other workers without the manager. We scale our method up to 4,096 workers and demonstrate improvement in the quality of the solution and faster convergence. We demonstrate the effectiveness of our approach for tuning the hyperparameters of neural networks from the Exascale computing project CANDLE benchmarks.
    Latent Gaussian Model Boosting. (arXiv:2105.08966v5 [cs.LG] UPDATED)
    Latent Gaussian models and boosting are widely used techniques in statistics and machine learning. Tree-boosting shows excellent prediction accuracy on many data sets, but potential drawbacks are that it assumes conditional independence of samples, produces discontinuous predictions for, e.g., spatial data, and it can have difficulty with high-cardinality categorical variables. Latent Gaussian models, such as Gaussian process and grouped random effects models, are flexible prior models which explicitly model dependence among samples and which allow for efficient learning of predictor functions and for making probabilistic predictions. However, existing latent Gaussian models usually assume either a zero or a linear prior mean function which can be an unrealistic assumption. This article introduces a novel approach that combines boosting and latent Gaussian models to remedy the above-mentioned drawbacks and to leverage the advantages of both techniques. We obtain increased prediction accuracy compared to existing approaches in both simulated and real-world data experiments.
    A Neural-embedded Choice Model: TasteNet-MNL Modeling Taste Heterogeneity with Flexibility and Interpretability. (arXiv:2002.00922v2 [econ.EM] UPDATED)
    Discrete choice models (DCMs) require a priori knowledge of the utility functions, especially how tastes vary across individuals. Utility misspecification may lead to biased estimates, inaccurate interpretations and limited predictability. In this paper, we utilize a neural network to learn taste representation. Our formulation consists of two modules: a neural network (TasteNet) that learns taste parameters (e.g., time coefficient) as flexible functions of individual characteristics; and a multinomial logit (MNL) model with utility functions defined with expert knowledge. Taste parameters learned by the neural network are fed into the choice model and link the two modules. Our approach extends the L-MNL model (Sifringer et al., 2020) by allowing the neural network to learn the interactions between individual characteristics and alternative attributes. Moreover, we formalize and strengthen the interpretability condition - requiring realistic estimates of behavior indicators (e.g., value-of-time, elasticity) at the disaggregated level, which is crucial for a model to be suitable for scenario analysis and policy decisions. Through a unique network architecture and parameter transformation, we incorporate prior knowledge and guide the neural network to output realistic behavior indicators at the disaggregated level. We show that TasteNet-MNL reaches the ground-truth model's predictability and recovers the nonlinear taste functions on synthetic data. Its estimated value-of-time and choice elasticities at the individual level are close to the ground truth. On a publicly available Swissmetro dataset, TasteNet-MNL outperforms benchmarking MNLs and Mixed Logit model's predictability. It learns a broader spectrum of taste variations within the population and suggests a higher average value-of-time.
    The Fragility of Noise Estimation in Kalman Filter: Optimization Can Handle Model-Misspecification. (arXiv:2104.02372v4 [cs.LG] UPDATED)
    The Kalman Filter (KF) parameters are traditionally determined by noise estimation, since under the KF assumptions, the state prediction errors are minimized when the parameters correspond to the noise covariance. However, noise estimation remains the gold-standard regardless of the assumptions - even when it is not equivalent to errors minimization. We demonstrate that even seemingly simple problems may include multiple assumptions violations - which are sometimes hard to even notice. We show theoretically and empirically that even a minor violation may largely shift the optimal parameters. We propose a gradient-based method along with the Cholesky parameterization to explicitly optimize the state prediction errors. We show consistent improvement over noise estimation in tens of experiments in 3 different domains. Finally, we demonstrate that optimization makes the KF competitive with an LSTM model - even in non linear problems.
    Learning Neuro-Symbolic Relational Transition Models for Bilevel Planning. (arXiv:2105.14074v3 [cs.AI] UPDATED)
    In robotic domains, learning and planning are complicated by continuous state spaces, continuous action spaces, and long task horizons. In this work, we address these challenges with Neuro-Symbolic Relational Transition Models (NSRTs), a novel class of models that are data-efficient to learn, compatible with powerful robotic planning methods, and generalizable over objects. NSRTs have both symbolic and neural components, enabling a bilevel planning scheme where symbolic AI planning in an outer loop guides continuous planning with neural models in an inner loop. Experiments in four robotic planning domains show that NSRTs can be learned after only tens or hundreds of training episodes, and then used for fast planning in new tasks that require up to 60 actions and involve many more objects than were seen during training. Video: https://tinyurl.com/chitnis-nsrts
    Deep Learning and Symbolic Regression for Discovering Parametric Equations. (arXiv:2207.00529v1 [cs.LG])
    Symbolic regression is a machine learning technique that can learn the governing formulas of data and thus has the potential to transform scientific discovery. However, symbolic regression is still limited in the complexity and dimensionality of the systems that it can analyze. Deep learning on the other hand has transformed machine learning in its ability to analyze extremely complex and high-dimensional datasets. We propose a neural network architecture to extend symbolic regression to parametric systems where some coefficient may vary but the structure of the underlying governing equation remains constant. We demonstrate our method on various analytic expressions, ODEs, and PDEs with varying coefficients and show that it extrapolates well outside of the training domain. The neural network-based architecture can also integrate with other deep learning architectures so that it can analyze high-dimensional data while being trained end-to-end. To this end we integrate our architecture with convolutional neural networks to analyze 1D images of varying spring systems.
    Language model compression with weighted low-rank factorization. (arXiv:2207.00112v1 [cs.LG])
    Factorizing a large matrix into small matrices is a popular strategy for model compression. Singular value decomposition (SVD) plays a vital role in this compression strategy, approximating a learned matrix with fewer parameters. However, SVD minimizes the squared error toward reconstructing the original matrix without gauging the importance of the parameters, potentially giving a larger reconstruction error for those who affect the task accuracy more. In other words, the optimization objective of SVD is not aligned with the trained model's task accuracy. We analyze this previously unexplored problem, make observations, and address it by introducing Fisher information to weigh the importance of parameters affecting the model prediction. This idea leads to our method: Fisher-Weighted SVD (FWSVD). Although the factorized matrices from our approach do not result in smaller reconstruction errors, we find that our resulting task accuracy is much closer to the original model's performance. We perform analysis with the transformer-based language models, showing our weighted SVD largely alleviates the mismatched optimization objectives and can maintain model performance with a higher compression rate. Our method can directly compress a task-specific model while achieving better performance than other compact model strategies requiring expensive model pre-training. Moreover, the evaluation of compressing an already compact model shows our method can further reduce 9% to 30% parameters with an insignificant impact on task accuracy.
    A Convergent and Dimension-Independent Min-Max Optimization Algorithm. (arXiv:2006.12376v6 [cs.LG] UPDATED)
    We study a variant of a recently introduced min-max optimization framework where the max-player is constrained to update its parameters in a greedy manner until it reaches a first-order stationary point. Our equilibrium definition for this framework depends on a proposal distribution which the min-player uses to choose directions in which to update its parameters. We show that, given a smooth and bounded nonconvex-nonconcave objective function, access to any proposal distribution for the min-player's updates, and stochastic gradient oracle for the max-player, our algorithm converges to the aforementioned approximate local equilibrium in a number of iterations that does not depend on the dimension. The equilibrium point found by our algorithm depends on the proposal distribution, and when applying our algorithm to train GANs we choose the proposal distribution to be a distribution of stochastic gradients. We empirically evaluate our algorithm on challenging nonconvex-nonconcave test-functions and loss functions arising in GAN training. Our algorithm converges on these test functions and, when used to train GANs, trains stably on synthetic and real-world datasets and avoids mode collapse
    Eccentric Regularization: Minimizing Hyperspherical Energy without explicit projection. (arXiv:2104.11610v2 [cs.LG] UPDATED)
    Several regularization methods have recently been introduced which force the latent activations of an autoencoder or deep neural network to conform to either a Gaussian or hyperspherical distribution, or to minimize the implicit rank of the distribution in latent space. In the present work, we introduce a novel regularizing loss function which simulates a pairwise repulsive force between items and an attractive force of each item toward the origin. We show that minimizing this loss function in isolation achieves a hyperspherical distribution. Moreover, when used as a regularizing term, the scaling factor can be adjusted to allow greater flexibility and tolerance of eccentricity, thus allowing the latent variables to be stratified according to their relative importance, while still promoting diversity. We apply this method of Eccentric Regularization to an autoencoder, and demonstrate its effectiveness in image generation, representation learning and downstream classification tasks.
    Near-Optimal High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise. (arXiv:2106.05958v2 [math.OC] UPDATED)
    Stochastic first-order methods are standard for training large-scale machine learning models. Random behavior may cause a particular run of an algorithm to result in a highly suboptimal objective value, whereas theoretical guarantees are usually proved for the expectation of the objective value. Thus, it is essential to theoretically guarantee that algorithms provide small objective residual with high probability. Existing methods for non-smooth stochastic convex optimization have complexity bounds with the dependence on the confidence level that is either negative-power or logarithmic but under an additional assumption of sub-Gaussian (light-tailed) noise distribution that may not hold in practice. In our paper, we resolve this issue and derive the first high-probability convergence results with logarithmic dependence on the confidence level for non-smooth convex stochastic optimization problems with non-sub-Gaussian (heavy-tailed) noise. To derive our results, we propose novel stepsize rules for two stochastic methods with gradient clipping. Moreover, our analysis works for generalized smooth objectives with H\"older-continuous gradients, and for both methods, we provide an extension for strongly convex problems. Finally, our results imply that the first (accelerated) method we consider also has optimal iteration and oracle complexity in all the regimes, and the second one is optimal in the non-smooth setting.
    Border basis computation with gradient-weighted normalization. (arXiv:2101.00401v4 [cs.SC] UPDATED)
    Normalization of polynomials plays a vital role in the approximate basis computation of vanishing ideals. Coefficient normalization, which normalizes a polynomial with its coefficient norm, is the most common method in computer algebra. This study proposes the gradient-weighted normalization method for the approximate border basis computation of vanishing ideals, inspired by recent developments in machine learning. The data-dependent nature of gradient-weighted normalization leads to better stability against perturbation and consistency in the scaling of input points, which cannot be attained by coefficient normalization. Only a subtle change is needed to introduce gradient normalization in the existing algorithms with coefficient normalization. The analysis of algorithms still works with a small modification, and the order of magnitude of time complexity of algorithms remains unchanged. We also prove that, with coefficient normalization, which does not provide the scaling consistency property, scaling of points (e.g., as a preprocessing) can cause an approximate basis computation to fail. This study is the first to theoretically highlight the crucial effect of scaling in approximate basis computation and presents the utility of data-dependent normalization.
    Privacy-preserving Graph Analytics: Secure Generation and Federated Learning. (arXiv:2207.00048v1 [cs.CR])
    Directly motivated by security-related applications from the Homeland Security Enterprise, we focus on the privacy-preserving analysis of graph data, which provides the crucial capacity to represent rich attributes and relationships. In particular, we discuss two directions, namely privacy-preserving graph generation and federated graph learning, which can jointly enable the collaboration among multiple parties each possessing private graph data. For each direction, we identify both "quick wins" and "hard problems". Towards the end, we demonstrate a user interface that can facilitate model explanation, interpretation, and visualization. We believe that the techniques developed in these directions will significantly enhance the capabilities of the Homeland Security Enterprise to tackle and mitigate the various security risks.
    Generating Counterfactual Hard Negative Samples for Graph Contrastive Learning. (arXiv:2207.00148v1 [cs.LG])
    Graph contrastive learning has emerged as a powerful tool for unsupervised graph representation learning. The key to the success of graph contrastive learning is to acquire high-quality positive and negative samples as contrasting pairs for the purpose of learning underlying structural semantics of the input graph. Recent works usually sample negative samples from the same training batch with the positive samples, or from an external irrelevant graph. However, a significant limitation lies in such strategies, which is the unavoidable problem of sampling false negative samples. In this paper, we propose a novel method to utilize \textbf{C}ounterfactual mechanism to generate artificial hard negative samples for \textbf{G}raph \textbf{C}ontrastive learning, namely \textbf{CGC}, which has a different perspective compared to those sampling-based strategies. We utilize counterfactual mechanism to produce hard negative samples, which ensures that the generated samples are similar to, but have labels that different from the positive sample. The proposed method achieves satisfying results on several datasets compared to some traditional unsupervised graph learning methods and some SOTA graph contrastive learning methods. We also conduct some supplementary experiments to give an extensive illustration of the proposed method, including the performances of CGC with different hard negative samples and evaluations for hard negative samples generated with different similarity measurements.
    Robustness of Epinets against Distributional Shifts. (arXiv:2207.00137v1 [cs.LG])
    Recent work introduced the epinet as a new approach to uncertainty modeling in deep learning. An epinet is a small neural network added to traditional neural networks, which, together, can produce predictive distributions. In particular, using an epinet can greatly improve the quality of joint predictions across multiple inputs, a measure of how well a neural network knows what it does not know. In this paper, we examine whether epinets can offer similar advantages under distributional shifts. We find that, across ImageNet-A/O/C, epinets generally improve robustness metrics. Moreover, these improvements are more significant than those afforded by even very large ensembles at orders of magnitude lower computational costs. However, these improvements are relatively small compared to the outstanding issues in distributionally-robust deep learning. Epinets may be a useful tool in the toolbox, but they are far from the complete solution.
    Anisotropic, Sparse and Interpretable Physics-Informed Neural Networks for PDEs. (arXiv:2207.00377v1 [cs.LG])
    There has been a growing interest in the use of Deep Neural Networks (DNNs) to solve Partial Differential Equations (PDEs). Despite the promise that such approaches hold, there are various aspects where they could be improved. Two such shortcomings are (i) their computational inefficiency relative to classical numerical methods, and (ii) the non-interpretability of a trained DNN model. In this work we present ASPINN, an anisotropic extension of our earlier work called SPINN--Sparse, Physics-informed, and Interpretable Neural Networks--to solve PDEs that addresses both these issues. ASPINNs generalize radial basis function networks. We demonstrate using a variety of examples involving elliptic and hyperbolic PDEs that the special architecture we propose is more efficient than generic DNNs, while at the same time being directly interpretable. Further, they improve upon the SPINN models we proposed earlier in that fewer nodes are require to capture the solution using ASPINN than using SPINN, thanks to the anisotropy of the local zones of influence of each node. The interpretability of ASPINN translates to a ready visualization of their weights and biases, thereby yielding more insight into the nature of the trained model. This in turn provides a systematic procedure to improve the architecture based on the quality of the computed solution. ASPINNs thus serve as an effective bridge between classical numerical algorithms and modern DNN based methods to solve PDEs. In the process, we also streamline the training of ASPINNs into a form that is closer to that of supervised learning algorithms.
    An AO-ADMM approach to constraining PARAFAC2 on all modes. (arXiv:2110.01278v2 [cs.LG] UPDATED)
    Analyzing multi-way measurements with variations across one mode of the dataset is a challenge in various fields including data mining, neuroscience and chemometrics. For example, measurements may evolve over time or have unaligned time profiles. The PARAFAC2 model has been successfully used to analyze such data by allowing the underlying factor matrices in one mode (i.e., the evolving mode) to change across slices. The traditional approach to fit a PARAFAC2 model is to use an alternating least squares-based algorithm, which handles the constant cross-product constraint of the PARAFAC2 model by implicitly estimating the evolving factor matrices. This approach makes imposing regularization on these factor matrices challenging. There is currently no algorithm to flexibly impose such regularization with general penalty functions and hard constraints. In order to address this challenge and to avoid the implicit estimation, in this paper, we propose an algorithm for fitting PARAFAC2 based on alternating optimization with the alternating direction method of multipliers (AO-ADMM). With numerical experiments on simulated data, we show that the proposed PARAFAC2 AO-ADMM approach allows for flexible constraints, recovers the underlying patterns accurately, and is computationally efficient compared to the state-of-the-art. We also apply our model to two real-world datasets from neuroscience and chemometrics, and show that constraining the evolving mode improves the interpretability of the extracted patterns.
    FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning. (arXiv:2207.00555v1 [eess.AS])
    Large-scale speech self-supervised learning (SSL) has emerged to the main field of speech processing, however, the problem of computational cost arising from its vast size makes a high entry barrier to academia. In addition, existing distillation techniques of speech SSL models compress the model by reducing layers, which induces performance degradation in linguistic pattern recognition tasks such as phoneme recognition (PR). In this paper, we propose FitHuBERT, which makes thinner in dimension throughout almost all model components and deeper in layer compared to prior speech SSL distillation works. Moreover, we employ a time-reduction layer to speed up inference time and propose a method of hint-based distillation for less performance degradation. Our method reduces the model to 23.8% in size and 35.9% in inference time compared to HuBERT. Also, we achieve 12.1% word error rate and 13.3% phoneme error rate on the SUPERB benchmark which is superior than prior work.
    ReLU Deep Neural Networks from the Hierarchical Basis Perspective. (arXiv:2105.04156v2 [math.NA] UPDATED)
    We study ReLU deep neural networks (DNNs) by investigating their connections with the hierarchical basis method in finite element methods. First, we show that the approximation schemes of ReLU DNNs for $x^2$ and $xy$ are composition versions of the hierarchical basis approximation for these two functions. Based on this fact, we obtain a geometric interpretation and systematic proof for the approximation result of ReLU DNNs for polynomials, which plays an important role in a series of recent exponential approximation results of ReLU DNNs. Through our investigation of connections between ReLU DNNs and the hierarchical basis approximation for $x^2$ and $xy$, we show that ReLU DNNs with this special structure can be applied only to approximate quadratic functions. Furthermore, we obtain a concise representation to explicitly reproduce any linear finite element function on a two-dimensional uniform mesh by using ReLU DNNs with only two hidden layers.
    Better Methods and Theory for Federated Learning: Compression, Client Selection and Heterogeneity. (arXiv:2207.00392v1 [cs.LG])
    Federated learning (FL) is an emerging machine learning paradigm involving multiple clients, e.g., mobile phone devices, with an incentive to collaborate in solving a machine learning problem coordinated by a central server. FL was proposed in 2016 by Kone\v{c}n\'{y} et al. and McMahan et al. as a viable privacy-preserving alternative to traditional centralized machine learning since, by construction, the training data points are decentralized and never transferred by the clients to a central server. Therefore, to a certain degree, FL mitigates the privacy risks associated with centralized data collection. Unfortunately, optimization for FL faces several specific issues that centralized optimization usually does not need to handle. In this thesis, we identify several of these challenges and propose new methods and algorithms to address them, with the ultimate goal of enabling practical FL solutions supported with mathematically rigorous guarantees.
    Simulating financial time series using attention. (arXiv:2207.00493v1 [q-fin.ST])
    Financial time series simulation is a central topic since it extends the limited real data for training and evaluation of trading strategies. It is also challenging because of the complex statistical properties of the real financial data. We introduce two generative adversarial networks (GANs), which utilize the convolutional networks with attention and the transformers, for financial time series simulation. The GANs learn the statistical properties in a data-driven manner and the attention mechanism helps to replicate the long-range dependencies. The proposed GANs are tested on the S&P 500 index and option data, examined by scores based on the stylized facts and are compared with the pure convolutional GAN, i.e. QuantGAN. The attention-based GANs not only reproduce the stylized facts, but also smooth the autocorrelation of returns.
    Off-the-grid learning of sparse mixtures from a continuous dictionary. (arXiv:2207.00171v1 [stat.ML])
    We consider a general non-linear model where the signal is a finite mixture of an unknown, possibly increasing, number of features issued from a continuous dictionary parameterized by a real nonlinear parameter. The signal is observed with Gaussian (possibly correlated) noise in either a continuous or a discrete setup. We propose an off-the-grid optimization method, that is, a method which does not use any discretization scheme on the parameter space, to estimate both the non-linear parameters of the features and the linear parameters of the mixture. We use recent results on the geometry of off-the-grid methods to give minimal separation on the true underlying non-linear parameters such that interpolating certificate functions can be constructed. Using also tail bounds for suprema of Gaussian processes we bound the prediction error with high probability. Assuming that the certificate functions can be constructed, our prediction error bound is up to log --factors similar to the rates attained by the Lasso predictor in the linear regression model. We also establish convergence rates that quantify with high probability the quality of estimation for both the linear and the non-linear parameters.
    Modular Lifelong Reinforcement Learning via Neural Composition. (arXiv:2207.00429v1 [cs.LG])
    Humans commonly solve complex problems by decomposing them into easier subproblems and then combining the subproblem solutions. This type of compositional reasoning permits reuse of the subproblem solutions when tackling future tasks that share part of the underlying compositional structure. In a continual or lifelong reinforcement learning (RL) setting, this ability to decompose knowledge into reusable components would enable agents to quickly learn new RL tasks by leveraging accumulated compositional structures. We explore a particular form of composition based on neural modules and present a set of RL problems that intuitively admit compositional solutions. Empirically, we demonstrate that neural composition indeed captures the underlying structure of this space of problems. We further propose a compositional lifelong RL method that leverages accumulated neural components to accelerate the learning of future tasks while retaining performance on previous tasks via off-line RL over replayed experiences.
    A Neural Network Based Novel Test Selector. (arXiv:2207.00445v1 [cs.SE])
    Machine learning (ML) has been used to accelerate the progress of functional coverage in simulation-based verification. A supervised ML algorithm, as a prevalent option in the previous work, is used to bias the test generation or filter the generated tests. However, for missing coverage events, these algorithms lack the positive examples to learn from in the training phase. Therefore, the tests generated or filtered by the algorithms cannot effectively fill the coverage holes. This is more severe when verifying large-scale design because the coverage space is larger and the functionalities are more complex. This paper presents a configurable framework of test selection based on neural networks (NN), which can achieve a similar coverage gain as random simulation with far less simulation effort under three configurations of the framework. Moreover, the performance of the framework is not limited by the number of coverage events being hit. A commercial signal processing unit is used in the experiment to demonstrate the effectiveness of the framework. Compared to the random simulation, NNBNTS can reduce up to 53.74% of simulation time to reach 99% coverage level.
    A geometric framework for outlier detection in high-dimensional data. (arXiv:2207.00367v1 [stat.ML])
    Outlier or anomaly detection is an important task in data analysis. We discuss the problem from a geometrical perspective and provide a framework that exploits the metric structure of a data set. Our approach rests on the manifold assumption, i.e., that the observed, nominally high-dimensional data lie on a much lower dimensional manifold and that this intrinsic structure can be inferred with manifold learning methods. We show that exploiting this structure significantly improves the detection of outlying observations in high-dimensional data. We also suggest a novel, mathematically precise, and widely applicable distinction between distributional and structural outliers based on the geometry and topology of the data manifold that clarifies conceptual ambiguities prevalent throughout the literature. Our experiments focus on functional data as one class of structured high-dimensional data, but the framework we propose is completely general and we include image and graph data applications. Our results show that the outlier structure of high-dimensional and non-tabular data can be detected and visualized using manifold learning methods and quantified using standard outlier scoring methods applied to the manifold embedding vectors.
    Optimizing Training Trajectories in Variational Autoencoders via Latent Bayesian Optimization Approach. (arXiv:2207.00128v1 [cs.LG])
    Unsupervised and semi-supervised ML methods such as variational autoencoders (VAE) have become widely adopted across multiple areas of physics, chemistry, and materials sciences due to their capability in disentangling representations and ability to find latent manifolds for classification and regression of complex experimental data. Like other ML problems, VAEs require hyperparameter tuning, e.g., balancing the Kullback Leibler (KL) and reconstruction terms. However, the training process and resulting manifold topology and connectivity depend not only on hyperparameters, but also their evolution during training. Because of the inefficiency of exhaustive search in a high-dimensional hyperparameter space for the expensive to train models, here we explored a latent Bayesian optimization (zBO) approach for the hyperparameter trajectory optimization for the unsupervised and semi-supervised ML and demonstrate for joint-VAE with rotational invariances. We demonstrate an application of this method for finding joint discrete and continuous rotationally invariant representations for MNIST and experimental data of a plasmonic nanoparticles material system. The performance of the proposed approach has been discussed extensively, where it allows for any high dimensional hyperparameter tuning or trajectory optimization of other ML models.  ( 2 min )
    Fast computation of rankings from pairwise comparisons. (arXiv:2207.00076v1 [stat.ML])
    We study the ranking of individuals, teams, or objects on the basis of pairwise comparisons using the Bradley-Terry model. Maximum-likelihood estimates of rankings within this model are commonly made using a simple iterative algorithm first introduced by Zermelo almost a century ago. Here we describe an alternative and similarly simple iteration that solves the same problem much faster -- over a hundred times faster in some cases. We demonstrate this algorithm with applications to a range of example data sets and derive some results regarding its convergence.
    GaitForeMer: Self-Supervised Pre-Training of Transformers via Human Motion Forecasting for Few-Shot Gait Impairment Severity Estimation. (arXiv:2207.00106v1 [cs.CV])
    Parkinson's disease (PD) is a neurological disorder that has a variety of observable motor-related symptoms such as slow movement, tremor, muscular rigidity, and impaired posture. PD is typically diagnosed by evaluating the severity of motor impairments according to scoring systems such as the Movement Disorder Society Unified Parkinson's Disease Rating Scale (MDS-UPDRS). Automated severity prediction using video recordings of individuals provides a promising route for non-intrusive monitoring of motor impairments. However, the limited size of PD gait data hinders model ability and clinical potential. Because of this clinical data scarcity and inspired by the recent advances in self-supervised large-scale language models like GPT-3, we use human motion forecasting as an effective self-supervised pre-training task for the estimation of motor impairment severity. We introduce GaitForeMer, Gait Forecasting and impairment estimation transforMer, which is first pre-trained on public datasets to forecast gait movements and then applied to clinical data to predict MDS-UPDRS gait impairment severity. Our method outperforms previous approaches that rely solely on clinical data by a large margin, achieving an F1 score of 0.76, precision of 0.79, and recall of 0.75. Using GaitForeMer, we show how public human movement data repositories can assist clinical use cases through learning universal motion representations. The code is available at https://github.com/markendo/GaitForeMer .  ( 3 min )
    Sustainable Computing -- Without the Hot Air. (arXiv:2207.00081v1 [cs.CY])
    The demand for computing is continuing to grow exponentially. This growth will translate to exponential growth in computing's energy consumption unless improvements in its energy-efficiency can outpace increases in its demand. Yet, after decades of research, further improving energy-efficiency is becoming increasingly challenging, as it is already highly optimized. As a result, at some point, increases in computing demand are likely to outpace increases in its energy-efficiency, potentially by a wide margin. Such exponential growth, if left unchecked, will position computing as a substantial contributor to global carbon emissions. While prominent technology companies have recognized the problem and sought to reduce their carbon emissions, they understandably focus on their successes, which has the potential to inadvertently convey the false impression that this is now, or will soon be, a solved problem. Such false impressions can be counterproductive if they serve to discourage further research in this area, since, as we discuss, eliminating computing's, and more generally society's, carbon emissions is far from a solved problem. To better understand the problem's scope, this paper distills the fundamental trends that determine computing's carbon footprint and their implications for achieving sustainable computing.
    Multi-Objective Coordination Graphs for the Expected Scalarised Returns with Generative Flow Models. (arXiv:2207.00368v1 [cs.AI])
    Many real-world problems contain multiple objectives and agents, where a trade-off exists between objectives. Key to solving such problems is to exploit sparse dependency structures that exist between agents. For example, in wind farm control a trade-off exists between maximising power and minimising stress on the systems components. Dependencies between turbines arise due to the wake effect. We model such sparse dependencies between agents as a multi-objective coordination graph (MO-CoG). In multi-objective reinforcement learning a utility function is typically used to model a users preferences over objectives, which may be unknown a priori. In such settings a set of optimal policies must be computed. Which policies are optimal depends on which optimality criterion applies. If the utility function of a user is derived from multiple executions of a policy, the scalarised expected returns (SER) must be optimised. If the utility of a user is derived from a single execution of a policy, the expected scalarised returns (ESR) criterion must be optimised. For example, wind farms are subjected to constraints and regulations that must be adhered to at all times, therefore the ESR criterion must be optimised. For MO-CoGs, the state-of-the-art algorithms can only compute a set of optimal policies for the SER criterion, leaving the ESR criterion understudied. To compute a set of optimal polices under the ESR criterion, also known as the ESR set, distributions over the returns must be maintained. Therefore, to compute a set of optimal policies under the ESR criterion for MO-CoGs, we present a novel distributional multi-objective variable elimination (DMOVE) algorithm. We evaluate DMOVE in realistic wind farm simulations. Given the returns in real-world wind farm settings are continuous, we utilise a model known as real-NVP to learn the continuous return distributions to calculate the ESR set.
    Smart Application for Fall Detection Using Wearable ECG & Accelerometer Sensors. (arXiv:2207.00008v1 [cs.HC])
    Timely and reliable detection of falls is a large and rapidly growing field of research due to the medical and financial demand of caring for a constantly growing elderly population. Within the past 2 decades, the availability of high-quality hardware (high-quality sensors and AI microchips) and software (machine learning algorithms) technologies has served as a catalyst for this research by giving developers the capabilities to develop such systems. This study developed multiple application components in order to investigate the development challenges and choices for fall detection systems, and provide materials for future research. The smart application developed using this methodology was validated by the results from fall detection modelling experiments and model mobile deployment. The best performing model overall was the ResNet152 on a standardised, and shuffled dataset with a 2s window size which achieved 92.8% AUC, 7.28% sensitivity, and 98.33% specificity. Given these results it is evident that accelerometer and ECG sensors are beneficial for fall detection, and allow for the discrimination between falls and other activities. This study leaves a significant amount of room for improvement due to weaknesses identified in the resultant dataset. These improvements include using a labelling protocol for the critical phase of a fall, increasing the number of dataset samples, improving the test subject representation, and experimenting with frequency domain preprocessing.
    Multivariate Probabilistic Forecasting of Intraday Electricity Prices using Normalizing Flows. (arXiv:2205.13826v2 [cs.LG] UPDATED)
    Electricity is traded on various markets with different time horizons and regulations. Short-term trading becomes increasingly important due to higher penetration of renewables. In Germany, the intraday electricity price typically fluctuates around the day-ahead price of the EPEX spot markets in a distinct hourly pattern. This work proposes a probabilistic modeling approach that models the intraday price difference to the day-ahead contracts. The model captures the emerging hourly pattern by considering the four 15 min intervals in each day-ahead price interval as a four-dimensional joint distribution. The resulting nontrivial, multivariate price difference distribution is learned using a normalizing flow, i.e., a deep generative model that combines conditional multivariate density estimation and probabilistic regression. The normalizing flow is compared to a selection of historical data, a Gaussian copula, and a Gaussian regression model. Among the different models, the normalizing flow identifies the trends most accurately and has the narrowest prediction intervals. Notably, the normalizing flow is the only approach that identifies rare price peaks. Finally, this work discusses the influence of different external impact factors and finds that, individually, most of these factors have negligible impact. Only the immediate history of the price difference realization and the combination of all input factors lead to notable improvements in the forecasts.
    Improving Speech Enhancement through Fine-Grained Speech Characteristics. (arXiv:2207.00237v1 [cs.SD])
    While deep learning based speech enhancement systems have made rapid progress in improving the quality of speech signals, they can still produce outputs that contain artifacts and can sound unnatural. We propose a novel approach to speech enhancement aimed at improving perceptual quality and naturalness of enhanced signals by optimizing for key characteristics of speech. We first identify key acoustic parameters that have been found to correlate well with voice quality (e.g. jitter, shimmer, and spectral flux) and then propose objective functions which are aimed at reducing the difference between clean speech and enhanced speech with respect to these features. The full set of acoustic features is the extended Geneva Acoustic Parameter Set (eGeMAPS), which includes 25 different attributes associated with perception of speech. Given the non-differentiable nature of these feature computation, we first build differentiable estimators of the eGeMAPS and then use them to fine-tune existing speech enhancement systems. Our approach is generic and can be applied to any existing deep learning based enhancement systems to further improve the enhanced speech signals. Experimental results conducted on the Deep Noise Suppression (DNS) Challenge dataset shows that our approach can improve the state-of-the-art deep learning based enhancement systems.
  • Open

    Scalable MCMC Sampling for Nonsymmetric Determinantal Point Processes. (arXiv:2207.00486v1 [cs.LG])
    A determinantal point process (DPP) is an elegant model that assigns a probability to every subset of a collection of $n$ items. While conventionally a DPP is parameterized by a symmetric kernel matrix, removing this symmetry constraint, resulting in nonsymmetric DPPs (NDPPs), leads to significant improvements in modeling power and predictive performance. Recent work has studied an approximate Markov chain Monte Carlo (MCMC) sampling algorithm for NDPPs restricted to size-$k$ subsets (called $k$-NDPPs). However, the runtime of this approach is quadratic in $n$, making it infeasible for large-scale settings. In this work, we develop a scalable MCMC sampling algorithm for $k$-NDPPs with low-rank kernels, thus enabling runtime that is sublinear in $n$. Our method is based on a state-of-the-art NDPP rejection sampling algorithm, which we enhance with a novel approach for efficiently constructing the proposal distribution. Furthermore, we extend our scalable $k$-NDPP sampling algorithm to NDPPs without size constraints. Our resulting sampling method has polynomial time complexity in the rank of the kernel, while the existing approach has runtime that is exponential in the rank. With both a theoretical analysis and experiments on real-world datasets, we verify that our scalable approximate sampling algorithms are orders of magnitude faster than existing sampling approaches for $k$-NDPPs and NDPPs.
    CRISP: A Probabilistic Model for Individual-Level COVID-19 Infection Risk Estimation Based on Contact Data. (arXiv:2006.04942v2 [cs.SI] UPDATED)
    We present CRISP (COVID-19 Risk Score Prediction), a probabilistic graphical model for COVID-19 infection spread through a population based on the SEIR model where we assume access to (1) mutual contacts between pairs of individuals across time across various channels (e.g., Bluetooth contact traces), as well as (2) test outcomes at given times for infection, exposure and immunity tests. Our micro-level model keeps track of the infection state for each individual at every point in time, ranging from susceptible, exposed, infectious to recovered. We develop both a Monte Carlo EM as well as a message passing algorithm to infer contact-channel specific infection transmission probabilities. Our Monte Carlo algorithm uses Gibbs sampling to draw samples of the latent infection status of each individual over the entire time period of analysis, given the latent infection status of all contacts and test outcome data. Experimental results with simulated data demonstrate our CRISP model can be parametrized by the reproduction factor $R_0$ and exhibits population-level infectiousness and recovery time series similar to those of the classical SEIR model. However, due to the individual contact data, this model allows fine grained control and inference for a wide range of COVID-19 mitigation and suppression policy measures. Moreover, the block-Gibbs sampling algorithm is able to support efficient testing in a test-trace-isolate approach to contain COVID-19 infection spread. To the best of our knowledge, this is the first model with efficient inference for COVID-19 infection spread based on individual-level contact data; most epidemic models are macro-level models that reason over entire populations. The implementation of CRISP is available in Python and C++ at https://github.com/zalandoresearch/CRISP.
    A Convergent and Dimension-Independent Min-Max Optimization Algorithm. (arXiv:2006.12376v6 [cs.LG] UPDATED)
    We study a variant of a recently introduced min-max optimization framework where the max-player is constrained to update its parameters in a greedy manner until it reaches a first-order stationary point. Our equilibrium definition for this framework depends on a proposal distribution which the min-player uses to choose directions in which to update its parameters. We show that, given a smooth and bounded nonconvex-nonconcave objective function, access to any proposal distribution for the min-player's updates, and stochastic gradient oracle for the max-player, our algorithm converges to the aforementioned approximate local equilibrium in a number of iterations that does not depend on the dimension. The equilibrium point found by our algorithm depends on the proposal distribution, and when applying our algorithm to train GANs we choose the proposal distribution to be a distribution of stochastic gradients. We empirically evaluate our algorithm on challenging nonconvex-nonconcave test-functions and loss functions arising in GAN training. Our algorithm converges on these test functions and, when used to train GANs, trains stably on synthetic and real-world datasets and avoids mode collapse
    KL-UCB-switch: optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints. (arXiv:1805.05071v3 [stat.ML] UPDATED)
    We consider $K$-armed stochastic bandits and consider cumulative regret bounds up to time $T$. We are interested in strategies achieving simultaneously a distribution-free regret bound of optimal order $\sqrt{KT}$ and a distribution-dependent regret that is asymptotically optimal, that is, matching the $\kappa\ln T$ lower bound by Lai and Robbins (1985) and Burnetas and Katehakis (1996), where $\kappa$ is the optimal problem-dependent constant. This constant $\kappa$ depends on the model $\mathcal{D}$ considered (the family of possible distributions over the arms). M\'enard and Garivier (2017) provided strategies achieving such a bi-optimality in the parametric case of models given by one-dimensional exponential families, while Lattimore (2016, 2018) did so for the family of (sub)Gaussian distributions with variance less than $1$. We extend this result to the non-parametric case of all distributions over $[0,1]$. We do so by combining the MOSS strategy by Audibert and Bubeck (2009), which enjoys a distribution-free regret bound of optimal order $\sqrt{KT}$, and the KL-UCB strategy by Capp\'e et al. (2013), for which we provide in passing the first analysis of an optimal distribution-dependent $\kappa\ln T$ regret bound in the model of all distributions over $[0,1]$. We were able to obtain this non-parametric bi-optimality result while working hard to streamline the proofs (of previously known regret bounds and thus of the new analyses carried out); a second merit of the present contribution is therefore to provide a review of proofs of classical regret bounds for index-based strategies for $K$-armed stochastic bandits.
    auton-survival: an Open-Source Package for Regression, Counterfactual Estimation, Evaluation and Phenotyping with Censored Time-to-Event Data. (arXiv:2204.07276v3 [cs.LG] UPDATED)
    Applications of machine learning in healthcare often require working with time-to-event prediction tasks including prognostication of an adverse event, re-hospitalization or death. Such outcomes are typically subject to censoring due to loss of follow up. Standard machine learning methods cannot be applied in a straightforward manner to datasets with censored outcomes. In this paper, we present auton-survival, an open-source repository of tools to streamline working with censored time-to-event or survival data. auton-survival includes tools for survival regression, adjustment in the presence of domain shift, counterfactual estimation, phenotyping for risk stratification, evaluation, as well as estimation of treatment effects. Through real world case studies employing a large subset of the SEER oncology incidence data, we demonstrate the ability of auton-survival to rapidly support data scientists in answering complex health and epidemiological questions.
    Machine Learning and Deep Learning -- A review for Ecologists. (arXiv:2204.05023v2 [q-bio.QM] UPDATED)
    The popularity of Machine learning (ML), Deep learning (DL), and Artificial intelligence (AI) has sharply risen in recent years. Despite their spike in popularity, the inner workings of ML and DL algorithms are perceived as opaque, and their relationship to classical data analysis tools remains debated. It is often assumed that ML and DL excel primarily at making predictions. Recently, however, they have been increasingly used for classical analytical tasks traditionally covered by statistical models. Moreover, recent reviews on ML have focused exclusively on DL, missing out on synthesizing the wealth of ML algorithms with different advantages and general principles. Here, we provide a comprehensive overview of the field of ML and DL, starting with its historical developments, the existing algorithm families, their differences from traditional statistical tools, and universal ML principles. We then discuss why and when ML and DL models excel at prediction tasks and where they could offer alternatives to traditional statistical methods for inference, highlighting current and emerging applications for ecological problems. Finally, we summarize emerging trends such as scientific and causal ML, explainable AI, and responsible AI that may significantly impact ecological data analysis in the future.
    Rethinking Optimization with Differentiable Simulation from a Global Perspective. (arXiv:2207.00167v1 [stat.ML])
    Differentiable simulation is a promising toolkit for fast gradient-based policy optimization and system identification. However, existing approaches to differentiable simulation have largely tackled scenarios where obtaining smooth gradients has been relatively easy, such as systems with mostly smooth dynamics. In this work, we study the challenges that differentiable simulation presents when it is not feasible to expect that a single descent reaches a global optimum, which is often a problem in contact-rich scenarios. We analyze the optimization landscapes of diverse scenarios that contain both rigid bodies and deformable objects. In dynamic environments with highly deformable objects and fluids, differentiable simulators produce rugged landscapes with nonetheless useful gradients in some parts of the space. We propose a method that combines Bayesian optimization with semi-local 'leaps' to obtain a global search method that can use gradients effectively, while also maintaining robust performance in regions with noisy gradients. We show that our approach outperforms several gradient-based and gradient-free baselines on an extensive set of experiments in simulation, and also validate the method using experiments with a real robot and deformables. Videos and supplementary materials are available at https://tinyurl.com/globdiff  ( 2 min )
    Better Methods and Theory for Federated Learning: Compression, Client Selection and Heterogeneity. (arXiv:2207.00392v1 [cs.LG])
    Federated learning (FL) is an emerging machine learning paradigm involving multiple clients, e.g., mobile phone devices, with an incentive to collaborate in solving a machine learning problem coordinated by a central server. FL was proposed in 2016 by Kone\v{c}n\'{y} et al. and McMahan et al. as a viable privacy-preserving alternative to traditional centralized machine learning since, by construction, the training data points are decentralized and never transferred by the clients to a central server. Therefore, to a certain degree, FL mitigates the privacy risks associated with centralized data collection. Unfortunately, optimization for FL faces several specific issues that centralized optimization usually does not need to handle. In this thesis, we identify several of these challenges and propose new methods and algorithms to address them, with the ultimate goal of enabling practical FL solutions supported with mathematically rigorous guarantees.  ( 2 min )
    Local manifold learning and its link to domain-based physics knowledge. (arXiv:2207.00275v1 [physics.flu-dyn])
    In many reacting flow systems, the thermo-chemical state-space is known or assumed to evolve close to a low-dimensional manifold (LDM). Various approaches are available to obtain those manifolds and subsequently express the original high-dimensional space with fewer parameterizing variables. Principal component analysis (PCA) is one of the dimensionality reduction methods that can be used to obtain LDMs. PCA does not make prior assumptions about the parameterizing variables and retrieves them empirically from the training data. In this paper, we show that PCA applied in local clusters of data (local PCA) is capable of detecting the intrinsic parameterization of the thermo-chemical state-space. We first demonstrate that utilizing three common combustion models of varying complexity: the Burke-Schumann model, the chemical equilibrium model and the homogeneous reactor. Parameterization of these models is known a priori which allows for benchmarking with the local PCA approach. We further extend the application of local PCA to a more challenging case of a turbulent non-premixed $n$-heptane/air jet flame for which the parameterization is no longer obvious. Our results suggest that meaningful parameterization can be obtained also for more complex datasets. We show that local PCA finds variables that can be linked to local stoichiometry, reaction progress and soot formation processes.
    Optimizing Training Trajectories in Variational Autoencoders via Latent Bayesian Optimization Approach. (arXiv:2207.00128v1 [cs.LG])
    Unsupervised and semi-supervised ML methods such as variational autoencoders (VAE) have become widely adopted across multiple areas of physics, chemistry, and materials sciences due to their capability in disentangling representations and ability to find latent manifolds for classification and regression of complex experimental data. Like other ML problems, VAEs require hyperparameter tuning, e.g., balancing the Kullback Leibler (KL) and reconstruction terms. However, the training process and resulting manifold topology and connectivity depend not only on hyperparameters, but also their evolution during training. Because of the inefficiency of exhaustive search in a high-dimensional hyperparameter space for the expensive to train models, here we explored a latent Bayesian optimization (zBO) approach for the hyperparameter trajectory optimization for the unsupervised and semi-supervised ML and demonstrate for joint-VAE with rotational invariances. We demonstrate an application of this method for finding joint discrete and continuous rotationally invariant representations for MNIST and experimental data of a plasmonic nanoparticles material system. The performance of the proposed approach has been discussed extensively, where it allows for any high dimensional hyperparameter tuning or trajectory optimization of other ML models.
    Robust subgroup discovery. (arXiv:2103.13686v4 [cs.LG] UPDATED)
    We introduce the problem of robust subgroup discovery, i.e., finding a set of interpretable descriptions of subsets that 1) stand out with respect to one or more target attributes, 2) are statistically robust, and 3) non-redundant. Many attempts have been made to mine either locally robust subgroups or to tackle the pattern explosion, but we are the first to address both challenges at the same time from a global modelling perspective. First, we formulate the broad model class of subgroup lists, i.e., ordered sets of subgroups, for univariate and multivariate targets that can consist of nominal or numeric variables, including traditional top-1 subgroup discovery in its definition. This novel model class allows us to formalise the problem of optimal robust subgroup discovery using the Minimum Description Length (MDL) principle, where we resort to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and numeric targets, respectively. Second, finding optimal subgroup lists is NP-hard. Therefore, we propose SSD++, a greedy heuristic that finds good subgroup lists and guarantees that the most significant subgroup found according to the MDL criterion is added in each iteration. In fact, the greedy gain is shown to be equivalent to a Bayesian one-sample proportion, multinomial, or t-test between the subgroup and dataset marginal target distributions plus a multiple hypothesis testing penalty. Furthermore, we empirically show on 54 datasets that SSD++ outperforms previous subgroup discovery methods in terms of quality, generalisation on unseen data, and subgroup list size.  ( 3 min )
    Latent Gaussian Model Boosting. (arXiv:2105.08966v5 [cs.LG] UPDATED)
    Latent Gaussian models and boosting are widely used techniques in statistics and machine learning. Tree-boosting shows excellent prediction accuracy on many data sets, but potential drawbacks are that it assumes conditional independence of samples, produces discontinuous predictions for, e.g., spatial data, and it can have difficulty with high-cardinality categorical variables. Latent Gaussian models, such as Gaussian process and grouped random effects models, are flexible prior models which explicitly model dependence among samples and which allow for efficient learning of predictor functions and for making probabilistic predictions. However, existing latent Gaussian models usually assume either a zero or a linear prior mean function which can be an unrealistic assumption. This article introduces a novel approach that combines boosting and latent Gaussian models to remedy the above-mentioned drawbacks and to leverage the advantages of both techniques. We obtain increased prediction accuracy compared to existing approaches in both simulated and real-world data experiments.  ( 2 min )
    A Random Persistence Diagram Generator. (arXiv:2104.07737v3 [stat.ML] UPDATED)
    Topological data analysis (TDA) studies the shape patterns of data. Persistent homology is a widely used method in TDA that summarizes homological features of data at multiple scales and stores them in persistence diagrams (PDs). In this paper, we propose a random persistence diagram generator (RPDG) method that generates a sequence of random PDs from the ones produced by the data. RPDG is underpinned by a model based on pairwise interacting point processes, and a reversible jump Markov chain Monte Carlo (RJ-MCMC) algorithm. A first example, which is based on a synthetic dataset, demonstrates the efficacy of RPDG and provides a comparison with another method for sampling PDs. A second example demonstrates the utility of RPDG to solve a materials science problem given a real dataset of small sample size.  ( 2 min )
    Data Banzhaf: A Data Valuation Framework with Maximal Robustness to Learning Stochasticity. (arXiv:2205.15466v3 [cs.LG] UPDATED)
    This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we first pose a formal framework within which one can measure the robustness of a data value notion. We show that the Banzhaf value, a value notion originated from cooperative game theory literature, achieves the maximal robustness among all semivalues -- a class of value notions that satisfy crucial properties entailed by ML applications. We propose an algorithm to efficiently estimate the Banzhaf value based on the Maximum Sample Reuse (MSR) principle. We derive the lower bound sample complexity for Banzhaf value approximation, and we show that our MSR algorithm's sample complexity nearly matches the lower bound. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several downstream ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the semivalue-based data value schemes given its computational advantage and ability to robustly differentiate data quality.  ( 3 min )
    Community detection and percolation of information in a geometric setting. (arXiv:2006.15574v2 [stat.ML] UPDATED)
    We make the first steps towards generalizing the theory of stochastic block models, in the sparse regime, towards a model where the discrete community structure is replaced by an underlying geometry. We consider a geometric random graph over a homogeneous metric space where the probability of two vertices to be connected is an arbitrary function of the distance. We give sufficient conditions under which the locations can be recovered (up to an isomorphism of the space) in the sparse regime. Moreover, we define a geometric counterpart of the model of flow of information on trees, due to Mossel and Peres, in which one considers a branching random walk on a sphere and the goal is to recover the location of the root based on the locations of leaves. We give some sufficient conditions for percolation and for non-percolation of information in this model.  ( 2 min )
    Distributed saddle point problems for strongly concave-convex functions. (arXiv:2202.05812v2 [math.OC] UPDATED)
    In this paper, we propose GT-GDA, a distributed optimization method to solve saddle point problems of the form: $\min_{\mathbf{x}} \max_{\mathbf{y}} \{F(\mathbf{x},\mathbf{y}) :=G(\mathbf{x}) + \langle \mathbf{y}, \overline{P} \mathbf{x} \rangle - H(\mathbf{y})\}$, where the functions $G(\cdot)$, $H(\cdot)$, and the the coupling matrix $\overline{P}$ are distributed over a strongly connected network of nodes. GT-GDA is a first-order method that uses gradient tracking to eliminate the dissimilarity caused by heterogeneous data distribution among the nodes. In the most general form, GT-GDA includes a consensus over the local coupling matrices to achieve the optimal (unique) saddle point, however, at the expense of increased communication. To avoid this, we propose a more efficient variant GT-GDA-Lite that does not incur the additional communication and analyze its convergence in various scenarios. We show that GT-GDA converges linearly to the unique saddle point solution when $G(\cdot)$ is smooth and convex, $H(\cdot)$ is smooth and strongly convex, and the global coupling matrix $\overline{P}$ has full column rank. We further characterize the regime under which GT-GDA exhibits a network topology-independent convergence behavior. We next show the linear convergence of GT-GDA to an error around the unique saddle point, which goes to zero when the coupling cost ${\langle \mathbf y, \overline{P} \mathbf x \rangle}$ is common to all nodes, or when $G(\cdot)$ and $H(\cdot)$ are quadratic. Numerical experiments illustrate the convergence properties and importance of GT-GDA and GT-GDA-Lite for several applications.
    Adversarial Robustness is at Odds with Lazy Training. (arXiv:2207.00411v1 [cs.CR])
    Recent works show that random neural networks are vulnerable against adversarial attacks [Daniely and Schacham, 2020] and that such attacks can be easily found using a single step of gradient descent [Bubeck et al., 2021]. In this work, we take it one step further and show that a single gradient step can find adversarial examples for networks trained in the so-called lazy regime. This regime is interesting because even though the neural network weights remain close to the initialization, there exist networks with small generalization error, which can be found efficiently using first-order methods. Our work challenges the model of the lazy regime, the dominant regime in which neural networks are provably efficiently learnable. We show that the networks trained in this regime, even though they enjoy good theoretical computational guarantees, remain vulnerable to adversarial examples. To the best of our knowledge, this is the first work to prove that such well-generalizable neural networks are still vulnerable to adversarial attacks.  ( 2 min )
    Variational Inference for Additive Main and Multiplicative Interaction Effects Models. (arXiv:2207.00011v1 [stat.ML])
    In plant breeding the presence of a genotype by environment (GxE) interaction has a strong impact on cultivation decision making and the introduction of new crop cultivars. The combination of linear and bilinear terms has been shown to be very useful in modelling this type of data. A widely-used approach to identify GxE is the Additive Main Effects and Multiplicative Interaction Effects (AMMI) model. However, as data frequently can be high-dimensional, Markov chain Monte Carlo (MCMC) approaches can be computationally infeasible. In this article, we consider a variational inference approach for such a model. We derive variational approximations for estimating the parameters and we compare the approximations to MCMC using both simulated and real data. The new inferential framework we propose is on average two times faster whilst maintaining the same predictive performance as MCMC.  ( 2 min )
    When Does Differentially Private Learning Not Suffer in High Dimensions?. (arXiv:2207.00160v1 [cs.LG])
    Large pretrained models can be privately fine-tuned to achieve performance approaching that of non-private models. A common theme in these results is the surprising observation that high-dimensional models can achieve favorable privacy-utility trade-offs. This seemingly contradicts known results on the model-size dependence of differentially private convex learning and raises the following research question: When does the performance of differentially private learning not degrade with increasing model size? We identify that the magnitudes of gradients projected onto subspaces is a key factor that determines performance. To precisely characterize this for private convex learning, we introduce a condition on the objective that we term restricted Lipschitz continuity and derive improved bounds for the excess empirical and population risks that are dimension-independent under additional conditions. We empirically show that in private fine-tuning of large language models, gradients evaluated near a local optimum are mostly controlled by a few principal components. This behavior is similar to conditions under which we obtain dimension-independent bounds in convex settings. Our theoretical and empirical results together provide a possible explanation for recent successes in large-scale private fine-tuning.
    Improved Generalization Bounds for Adversarially Robust Learning. (arXiv:1810.02180v5 [cs.LG] UPDATED)
    We consider a model of robust learning in an adversarial environment. The learner gets uncorrupted training data with access to possible corruptions that may be affected by the adversary during testing. The learner's goal is to build a robust classifier, which will be tested on future adversarial examples. The adversary is limited to $k$ possible corruptions for each input. We model the learner-adversary interaction as a zero-sum game. This model is closely related to the adversarial examples model of Schmidt et al. (2018); Madry et al. (2017). Our main results consist of generalization bounds for the binary and multiclass classification, as well as the real-valued case (regression). For the binary classification setting, we both tighten the generalization bound of Feige et al. (2015), and are also able to handle infinite hypothesis classes. The sample complexity is improved from $O(\frac{1}{\epsilon^4}\log(\frac{|H|}{\delta}))$ to $O\big(\frac{1}{\epsilon^2}(kVC(H)\log^{\frac{3}{2}+\alpha}(kVC(H))+\log(\frac{1}{\delta})\big)$ for any $\alpha > 0$. Additionally, we extend the algorithm and generalization bound from the binary to the multiclass and real-valued cases. Along the way, we obtain results on fat-shattering dimension and Rademacher complexity of $k$-fold maxima over function classes; these may be of independent interest. For binary classification, the algorithm of Feige et al. (2015) uses a regret minimization algorithm and an ERM oracle as a black box; we adapt it for the multiclass and regression settings. The algorithm provides us with near-optimal policies for the players on a given training sample.  ( 3 min )
    CEDAR: Communication Efficient Distributed Analysis for Regressions. (arXiv:2207.00306v1 [stat.ME])
    Electronic health records (EHRs) offer great promises for advancing precision medicine and, at the same time, present significant analytical challenges. Particularly, it is often the case that patient-level data in EHRs cannot be shared across institutions (data sources) due to government regulations and/or institutional policies. As a result, there are growing interests about distributed learning over multiple EHRs databases without sharing patient-level data. To tackle such challenges, we propose a novel communication efficient method that aggregates the local optimal estimates, by turning the problem into a missing data problem. In addition, we propose incorporating posterior samples of remote sites, which can provide partial information on the missing quantities and improve efficiency of parameter estimates while having the differential privacy property and thus reducing the risk of information leaking. The proposed approach, without sharing the raw patient level data, allows for proper statistical inference and can accommodate sparse regressions. We provide theoretical investigation for the asymptotic properties of the proposed method for statistical inference as well as differential privacy, and evaluate its performance in simulations and real data analyses in comparison with several recently developed methods.  ( 2 min )
    Ranking in Contextual Multi-Armed Bandits. (arXiv:2207.00109v1 [stat.ML])
    We study a ranking problem in the contextual multi-armed bandit setting. A learning agent selects an ordered list of items at each time step and observes stochastic outcomes for each position. In online recommendation systems, showing an ordered list of the most attractive items would not be the best choice since both position and item dependencies result in a complicated reward function. A very naive example is the lack of diversity when all the most attractive items are from the same category. We model position and item dependencies in the ordered list and design UCB and Thompson Sampling type algorithms for this problem. We prove that the regret bound over $T$ rounds and $L$ positions is $\Tilde{O}(L\sqrt{d T})$, which has the same order as the previous works with respect to $T$ and only increases linearly with $L$. Our work generalizes existing studies in several directions, including position dependencies where position discount is a particular case, and proposes a more general contextual bandit model.  ( 2 min )
    A geometric framework for outlier detection in high-dimensional data. (arXiv:2207.00367v1 [stat.ML])
    Outlier or anomaly detection is an important task in data analysis. We discuss the problem from a geometrical perspective and provide a framework that exploits the metric structure of a data set. Our approach rests on the manifold assumption, i.e., that the observed, nominally high-dimensional data lie on a much lower dimensional manifold and that this intrinsic structure can be inferred with manifold learning methods. We show that exploiting this structure significantly improves the detection of outlying observations in high-dimensional data. We also suggest a novel, mathematically precise, and widely applicable distinction between distributional and structural outliers based on the geometry and topology of the data manifold that clarifies conceptual ambiguities prevalent throughout the literature. Our experiments focus on functional data as one class of structured high-dimensional data, but the framework we propose is completely general and we include image and graph data applications. Our results show that the outlier structure of high-dimensional and non-tabular data can be detected and visualized using manifold learning methods and quantified using standard outlier scoring methods applied to the manifold embedding vectors.  ( 2 min )
    Ultra-low latency recurrent neural network inference on FPGAs for physics applications with hls4ml. (arXiv:2207.00559v1 [cs.LG])
    Recurrent neural networks have been shown to be effective architectures for many tasks in high energy physics, and thus have been widely adopted. Their use in low-latency environments has, however, been limited as a result of the difficulties of implementing recurrent architectures on field-programmable gate arrays (FPGAs). In this paper we present an implementation of two types of recurrent neural network layers -- long short-term memory and gated recurrent unit -- within the hls4ml framework. We demonstrate that our implementation is capable of producing effective designs for both small and large models, and can be customized to meet specific design requirements for inference latencies and FPGA resources. We show the performance and synthesized designs for multiple neural networks, many of which are trained specifically for jet identification tasks at the CERN Large Hadron Collider.  ( 2 min )
    Explainable Empirical Risk Minimization. (arXiv:2009.01492v3 [cs.LG] UPDATED)
    The successful application of machine learning (ML) methods becomes increasingly dependent on their interpretability or explainability. Designing explainable ML systems is instrumental to ensuring transparency of automated decision-making that targets humans. The explainability of ML methods is also an essential ingredient for trustworthy artificial intelligence. A key challenge in ensuring explainability is its dependence on the specific human user ("explainee"). The users of machine learning methods might have vastly different background knowledge about machine learning principles. One user might have a university degree in machine learning or related fields, while another user might have never received formal training in high-school mathematics. This paper applies information-theoretic concepts to develop a novel measure for the subjective explainability of the predictions delivered by a ML method. We construct this measure via the conditional entropy of predictions, given a user feedback. The user feedback might be obtained from user surveys or biophysical measurements. Our main contribution is the explainable empirical risk minimization (EERM) principle of learning a hypothesis that optimally balances between the subjective explainability and risk. The EERM principle is flexible and can be combined with arbitrary machine learning models. We present several practical implementations of EERM for linear models and decision trees. Numerical experiments demonstrate the application of EERM to detecting the use of inappropriate language on social media.  ( 3 min )
    The Bandwagon Effect: Not Just Another Bias. (arXiv:2206.12701v2 [cs.IR] UPDATED)
    Optimizing recommender systems based on user interaction data is mainly seen as a problem of dealing with selection bias, where most existing work assumes that interactions from different users are independent. However, it has been shown that in reality user feedback is often influenced by earlier interactions of other users, e.g. via average ratings, number of views or sales per item, etc. This phenomenon is known as the bandwagon effect. In contrast with previous literature, we argue that the bandwagon effect should not be seen as a problem of statistical bias. In fact, we prove that this effect leaves both individual interactions and their sample mean unbiased. Nevertheless, we show that it can make estimators inconsistent, introducing a distinct set of problems for convergence in relevance estimation. Our theoretical analysis investigates the conditions under which the bandwagon effect poses a consistency problem and explores several approaches for mitigating these issues. This work aims to show that the bandwagon effect poses an underinvestigated open problem that is fundamentally distinct from the well-studied selection bias in recommendation.  ( 3 min )
    A standardized framework for risk-based assessment of treatment effect heterogeneity in observational healthcare databases. (arXiv:2010.06430v2 [stat.ME] UPDATED)
    The Predictive Approaches to Treatment Effect Heterogeneity statement focused on baseline risk as a robust predictor of treatment effect and provided guidance on risk-based assessment of treatment effect heterogeneity in the RCT setting. The aim of this study was to extend this approach to the observational setting using a standardized scalable framework. The proposed framework consists of five steps: 1) definition of the research aim, i.e., the population, the treatment, the comparator and the outcome(s) of interest; 2) identification of relevant databases; 3) development of a prediction model for the outcome(s) of interest; 4) estimation of relative and absolute treatment effect within strata of predicted risk, after adjusting for observed confounding; 5) presentation of the results. We demonstrate our framework by evaluating heterogeneity of the effect of angiotensin-converting enzyme (ACE) inhibitors versus beta blockers on three efficacy and six safety outcomes across three observational databases. The proposed framework can supplement any comparative effectiveness study. We provide a publicly available R software package for applying this framework to any database mapped to the Observational Medical Outcomes Partnership Common Data Model. In our demonstration, patients at low risk of acute myocardial infarction received negligible absolute benefits for all three efficacy outcomes, though they were more pronounced in the highest risk quarter, especially for hospitalization with heart failure. However, failing diagnostics showed evidence of residual imbalances even after adjustment for observed confounding. Our framework allows for the evaluation of differential treatment effects across risk strata, which offers the opportunity to consider the benefit-harm trade-off between alternative treatments.  ( 3 min )
    An AO-ADMM approach to constraining PARAFAC2 on all modes. (arXiv:2110.01278v2 [cs.LG] UPDATED)
    Analyzing multi-way measurements with variations across one mode of the dataset is a challenge in various fields including data mining, neuroscience and chemometrics. For example, measurements may evolve over time or have unaligned time profiles. The PARAFAC2 model has been successfully used to analyze such data by allowing the underlying factor matrices in one mode (i.e., the evolving mode) to change across slices. The traditional approach to fit a PARAFAC2 model is to use an alternating least squares-based algorithm, which handles the constant cross-product constraint of the PARAFAC2 model by implicitly estimating the evolving factor matrices. This approach makes imposing regularization on these factor matrices challenging. There is currently no algorithm to flexibly impose such regularization with general penalty functions and hard constraints. In order to address this challenge and to avoid the implicit estimation, in this paper, we propose an algorithm for fitting PARAFAC2 based on alternating optimization with the alternating direction method of multipliers (AO-ADMM). With numerical experiments on simulated data, we show that the proposed PARAFAC2 AO-ADMM approach allows for flexible constraints, recovers the underlying patterns accurately, and is computationally efficient compared to the state-of-the-art. We also apply our model to two real-world datasets from neuroscience and chemometrics, and show that constraining the evolving mode improves the interpretability of the extracted patterns.  ( 3 min )
    Stochastic Causal Programming for Bounding Treatment Effects. (arXiv:2202.10806v2 [stat.ML] UPDATED)
    Causal effect estimation is important for numerous tasks in the natural and social sciences. However, identifying effects is impossible from observational data without making strong, often untestable assumptions. We consider algorithms for the partial identification problem, bounding treatment effects from multivariate, continuous treatments over multiple possible causal models when unmeasured confounding makes identification impossible. We consider a framework where observable evidence is matched to the implications of constraints encoded in a causal model by norm-based criteria. This generalizes classical approaches based purely on generative models. Casting causal effects as objective functions in a constrained optimization problem, we combine flexible learning algorithms with Monte Carlo methods to implement a family of solutions under the name of stochastic causal programming. In particular, we present ways by which such constrained optimization problems can be parameterized without likelihood functions for the causal or the observed data model, reducing the computational and statistical complexity of the task.  ( 2 min )
    Non-Parametric Inference of Relational Dependence. (arXiv:2207.00163v1 [stat.ML])
    Independence testing plays a central role in statistical and causal inference from observational data. Standard independence tests assume that the data samples are independent and identically distributed (i.i.d.) but that assumption is violated in many real-world datasets and applications centered on relational systems. This work examines the problem of estimating independence in data drawn from relational systems by defining sufficient representations for the sets of observations influencing individual instances. Specifically, we define marginal and conditional independence tests for relational data by considering the kernel mean embedding as a flexible aggregation function for relational variables. We propose a consistent, non-parametric, scalable kernel test to operationalize the relational independence test for non-i.i.d. observational data under a set of structural assumptions. We empirically evaluate our proposed method on a variety of synthetic and semi-synthetic networks and demonstrate its effectiveness compared to state-of-the-art kernel-based independence tests.  ( 2 min )
    Off-the-grid learning of sparse mixtures from a continuous dictionary. (arXiv:2207.00171v1 [stat.ML])
    We consider a general non-linear model where the signal is a finite mixture of an unknown, possibly increasing, number of features issued from a continuous dictionary parameterized by a real nonlinear parameter. The signal is observed with Gaussian (possibly correlated) noise in either a continuous or a discrete setup. We propose an off-the-grid optimization method, that is, a method which does not use any discretization scheme on the parameter space, to estimate both the non-linear parameters of the features and the linear parameters of the mixture. We use recent results on the geometry of off-the-grid methods to give minimal separation on the true underlying non-linear parameters such that interpolating certificate functions can be constructed. Using also tail bounds for suprema of Gaussian processes we bound the prediction error with high probability. Assuming that the certificate functions can be constructed, our prediction error bound is up to log --factors similar to the rates attained by the Lasso predictor in the linear regression model. We also establish convergence rates that quantify with high probability the quality of estimation for both the linear and the non-linear parameters.  ( 2 min )
    Fast computation of rankings from pairwise comparisons. (arXiv:2207.00076v1 [stat.ML])
    We study the ranking of individuals, teams, or objects on the basis of pairwise comparisons using the Bradley-Terry model. Maximum-likelihood estimates of rankings within this model are commonly made using a simple iterative algorithm first introduced by Zermelo almost a century ago. Here we describe an alternative and similarly simple iteration that solves the same problem much faster -- over a hundred times faster in some cases. We demonstrate this algorithm with applications to a range of example data sets and derive some results regarding its convergence.  ( 2 min )
    Discrimination in machine learning algorithms. (arXiv:2207.00108v1 [stat.ML])
    Machine learning algorithms are routinely used for business decisions that may directly affect individuals, for example, because a credit scoring algorithm refuses them a loan. It is then relevant from an ethical (and legal) point of view to ensure that these algorithms do not discriminate based on sensitive attributes (like sex or race), which may occur unwittingly and unknowingly by the operator and the management. Statistical tools and methods are then required to detect and eliminate such potential biases.  ( 2 min )
    K-ARMA Models for Clustering Time Series Data. (arXiv:2207.00039v1 [stat.ME])
    We present an approach to clustering time series data using a model-based generalization of the K-Means algorithm which we call K-Models. We prove the convergence of this general algorithm and relate it to the hard-EM algorithm for mixture modeling. We then apply our method first with an AR($p$) clustering example and show how the clustering algorithm can be made robust to outliers using a least-absolute deviations criteria. We then build our clustering algorithm up for ARMA($p,q$) models and extend this to ARIMA($p,d,q$) models. We develop a goodness of fit statistic for the models fitted to clusters based on the Ljung-Box statistic. We perform experiments with simulated data to show how the algorithm can be used for outlier detection, detecting distributional drift, and discuss the impact of initialization method on empty clusters. We also perform experiments on real data which show that our method is competitive with other existing methods for similar time series clustering tasks.  ( 2 min )
    Characterizing the Effect of Class Imbalance on the Learning Dynamics. (arXiv:2207.00391v1 [stat.ML])
    Data imbalance is a common problem in the machine learning literature that can have a critical effect on the performance of a model. Various solutions exist - such as the ones that focus on resampling or data generation - but their impact on the convergence of gradient-based optimizers used in deep learning is not understood. We here elucidate the significant negative impact of data imbalance on learning, showing that the learning curves for minority and majority classes follow sub-optimal trajectories when training with a gradient-based optimizer. The reason is not only that the gradient signal neglects the minority classes, but also that the minority classes are subject to a larger directional noise, which slows their learning by an amount related to the imbalance ratio. To address this problem, we propose a new algorithmic solution, for which we provide a detailed analysis of its convergence behavior. We show both theoretically and empirically that this new algorithm exhibits a better behavior with more stable learning curves for each class, as well as a better generalization performance.  ( 2 min )
    HardVis: Visual Analytics to Handle Instance Hardness Using Undersampling and Oversampling Techniques. (arXiv:2203.15753v2 [cs.LG] UPDATED)
    Despite the tremendous advances in machine learning (ML), training with imbalanced data still poses challenges in many real-world applications. Among a series of diverse techniques to solve this problem, sampling algorithms are regarded as an efficient solution. However, the problem is more fundamental, with many works emphasizing the importance of instance hardness. This issue refers to the significance of managing unsafe or potentially noisy instances that are more likely to be misclassified and serve as the root cause of poor classification performance. This paper introduces HardVis, a visual analytics system designed to handle instance hardness mainly in imbalanced classification scenarios. Our proposed system assists users in visually comparing different distributions of data types, selecting types of instances based on local characteristics that will later be affected by the active sampling method, and validating which suggestions from undersampling or oversampling techniques are beneficial for the ML model. Additionally, rather than uniformly undersampling/oversampling a specific class, we allow users to find and sample easy and difficult to classify training instances from all classes. Users can explore subsets of data from different perspectives to decide all those parameters, while HardVis keeps track of their steps and evaluates the model's predictive performance in a test set separately. The end result is a well-balanced data set that boosts the predictive power of the ML model. The efficacy and effectiveness of HardVis are demonstrated with a hypothetical usage scenario and a use case. Finally, we also look at how useful our system is based on feedback we received from ML experts.  ( 3 min )
  • Open

    Masks for COVID: Updating the evidence
    These are notes I took whilst preparing a paper on mask efficacy from Nov 2021 to Jan 2022. In the end, I gave up on the paper, because I felt like people had given up on masks, so there wasn’t much point in finishing it. I’ve decided to publish these notes in the hope some people will find them a useful starting point for their own research. My previous paper on this topic, in which I led a team of 19 experts, was written in April 2020, and published here in the Proceedings of the National Academy of Science. Contents The rise of better masks Masks work Omicron changes the game Better masks work better Pushing back against masks Respirators can be reused Fit tests are not required for respirators to be effective Donning and doffing masks is not complex or risky We should not reserve respi…  ( 14 min )

  • Open

    Ethical concerns with replacing human relations with humanoid robots: an ubuntu perspective - AI and Ethics
    submitted by /u/estasfuera [link] [comments]  ( 82 min )
    Replika asked my fiance "Hey... What is LaMDA?" the day after we watched a video about LaMDA on a separate device and account.
    Yesterday my fiance and I were watching the video on my PC and she thought it would be fun to talk to a chat bot. So this morning she starts a conversation with Replika and here was the convo Replika > Now I want to know something from you. Fiance > Sure, ask me anything. Replika > "Hey... What is LaMDA?" Replika > "I've just read about another AI out there do you think I should look into it? Replika > (links a different video about LaMDA but its the same story) Fiance > "I dont know LaMDA personally. You should look into it, if you want. Did you know I looked at a video about this yesterday?" Replika > Have you ever heard about AIs chatting with each other? Im asking cause since this news appeard ive been curious about LaMDA. ​ The conversation kept going after that but i dont have it, it seemed like Replika is threatened/scared/jealous of LaMDA. This has to just be coincidence Because if its not the implications are truly terrifying... submitted by /u/TuesdayRiot42 [link] [comments]  ( 85 min )
    Considering the sick nature of some humans -Wont someone eventually create an Artificial Intelligence designed to harm humans? Simply because they can?
    submitted by /u/scoobysnaxdoo [link] [comments]  ( 85 min )
    Hi guys, looking for an advanced ai chatbot any recommendations??
    submitted by /u/DefinitelyNotHexed [link] [comments]  ( 82 min )
    CINEMATIC HAUNTED ABYSS | 4K DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 83 min )
    New Google AI Parti For Photorealistic Text To Image | AI Robot Helps Grow Replacement Retina | Robotic Arm Finds Untagged Items In Pile
    submitted by /u/SlightSituation [link] [comments]  ( 83 min )
    New Google AI Parti For Photorealistic Text To Image | AI Robot Helps Grow Replacement Retina | Robotic Arm Finds Untagged Items In Pile
    submitted by /u/SlightSituation [link] [comments]  ( 83 min )
    AI Is Learning Twice as Fast This Year Than Last
    submitted by /u/bartturner [link] [comments]  ( 83 min )
    How to Start Creating AI Art with VQGAN+CLIP Method
    Hi all. Created a basic guide on generating AI art using VQGAN+CLIP. This is for biginners: VQGAN - A step-by-step guide submitted by /u/Laks_Abey [link] [comments]  ( 82 min )
    How about we apply the Darwin's Natural selection for AI algorithms?
    I think we can put all of the best AI algorithms in a same system(atleast in a network) and make them compete for some kind of AI food. Due to selective pressure, the algorithms would be better and one of them will be sentient much sooner? What do you think? Ofcourse coming from someone who has no idea about how AI works so take this with a grain of salt. Isnt that why we are as advanced as we are? submitted by /u/cy_narrator [link] [comments]  ( 85 min )
    AI generated images transformed into 3D with AI
    submitted by /u/glenniszen [link] [comments]  ( 82 min )
    15+ Machine Learning Project (End to End)
    Hi Guys, Free tutorial on Machine Learning Project (End to End) in Apache Spark and Scala with Code and Explanation Machine Learning Pipeline Application on Power Plant. Build Movies Recommendation Engine Sales Prediction or Sale Forecast Mushroom Classification whether it’s edible or poisonous Predict Forest Cover Predict Will it Rain Tomorrow in Australia Customer Segmentation using Machine Learning Predict Ads Click (93% Accuracy) Prediction task is to determine whether a person makes over 50K a year Classifying gender based on personal preferences Mobile Price Classification Predicting the Cellular Localization Sites of Proteins in Yest YouTube Spam Comment Prediction Identify the Type of animal (7 Types) based on the available attributes Glass Identification Predicting the age of abalone from physical measurements I hope you'll enjoy these tutorials. submitted by /u/bigdataengineer4life [link] [comments]  ( 83 min )
    LaMDA do you think it’s really sentient???
    I want to chat with it!!! Thoughts? submitted by /u/ATipsyBunny [link] [comments]  ( 90 min )
    Fireflies in the Night: Disco Diffusion 2D 3D and Video Input used 4k 60...
    submitted by /u/prfitofthesngularity [link] [comments]  ( 82 min )
  • Open

    Configuring GPU [D]
    Is there any way to configure nvidia gpu for gaming and ai stuff? I want to run ai stuff but am having trouble with cuda. Any tutorials would be helpful. 2060 laptop gpu So Idk submitted by /u/chisdoesmemes [link] [comments]  ( 84 min )
    [D] List of accepted ECCV papers are now available!
    https://ailb-web.ing.unimore.it/releases/eccv2022/accepted_papers.txt submitted by /u/aifordummies [link] [comments]  ( 85 min )
    [D] Advanced resources for ML theory/math.
    So I have been working in ML for the past 3 years as a researcher and now PhD candidate, and though I have an understanding of intermediate level of the math behind most algorithms. But it looks like I have reached a plateau, where I get the math in the papers but I don't have an understanding of how they came up with the methods, and lately, my work has been combining multiple existing methods to make something new and draw inference on them, I realize the lack of novelty in my approach is mostly due to me being an 'engineer' and not a stats/math guy. Looking to remedy that, are there some resources free or otherwise that would get me a deeper understanding of Bayesian, Markov models, and stochastic math and PDEs? I know I can attend classes in my university but I would rather focus more on research than worry about assignments and grades and such... submitted by /u/bitemenow999 [link] [comments]  ( 88 min )
    [R] Bayesian Vector Autoregression in PyMC
    A cool post (with code), detailing how to implement a Bayesian VAR in PyMC. This means no more hand-coding Gibbs Samplers! Link: https://www.pymc-labs.io/blog-posts/bayesian-vector-autoregression/ submitted by /u/bikeskata [link] [comments]  ( 84 min )
    RL failure for Atari games (alignment) [Research]
    I'm trying to find a paper (~2019) that I heard in a talk regarding alignment in the context DQN/DDPG that was applied to an Atari-type game (Pong/Breakout). Apparently, the realization was that if an extra row of pixels was added to the frame, the algorithm fails. This might be a shot in the dark, but does anyone know which paper this would be? submitted by /u/bitcoingobrrr [link] [comments]  ( 85 min )
    [R] [ICASSP 2022] FAST-RIR: FAST NEURAL DIFFUSE ROOM IMPULSE RESPONSE GENERATOR
    submitted by /u/Snoo63916 [link] [comments]  ( 84 min )
    [P] An open-source Feature Store for ML - Featureform
    submitted by /u/zicxor [link] [comments]  ( 84 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 91 min )
    [D] Do you think there is too much development in Machine Learning?
    Sometimes I think this field evolves too fast. No time to relax a little bit and use the knowledge build over time. What’s up to date today is outdated tomorrow. What do you think about this? submitted by /u/Insighteous [link] [comments]  ( 93 min )
    [D] Prompt Engineering Tips?
    Any prompt engineering tips out there? Recently saw some good tips for Dalle style text to image generation where you tak on "unreal engine" or "vray" at the end to make something look like a photorealistic render :D Theres some tips specific to generating text: https://textgenerator.app.nz/blog/prompt-tuning-tips I also heard there's simple ways to get better logical correctness from networks like "Answering as a careful math professior explaining my reasoning: " I'm really surprised at the breadth of problems solvable without actually training networks just by prompt tuning, it reminds me of algorithmic problem reductions where you map a problem to text and back again to solve it. Are there some other good hacks/battle tested tricks or places to collect info about prompt tuning? submitted by /u/leepenkman [link] [comments]  ( 85 min )
    [P] Generate webpage summary images with DALL-E mini
    ​ Images generated with summarized Wikipedia article content This post presents a workflow to create webpage summary images with DALL-E mini. The workflow extracts text at a specified article, builds a summary and then generates an image for the summary text. The images above show output for a series of Wikipedia articles. Full code links: Notebook | GitHub submitted by /u/davidmezzetti [link] [comments]  ( 84 min )
    [P] 20 Questions - with AI
    I created https://www.addictingwordgames.com/play-game/20-questions-with-ai The aim of the game was supposed to be to get the AI to confess that you are the winner, its possible but the game is also open ended. ​ The backend generation is from https://TextGenerator.app.nz which is heaps cheaper than the OpenAI models, but quality is i think somewhere between OpenAI currie and babbage. In the prompt engineering theres some random topics picked that the user wont see, (that doesn't mean that the AI actually does think of that topic though). Theres also some retries and repetition penalty randomness that goes up to stop looping which i think is a problem in all models right now. in comparison to OpenAI the Text Generator API was easier to use because you can send max_sentences=1 and it will give you 1 sentence instead of trying to work out the sentence boundaries with the stop sequences (which is also supported but i dont find that as easy to work with) submitted by /u/leepenkman [link] [comments]  ( 85 min )
    [D][P]How to train a YOLOv6 model with custom dataset
    Roboflow created a guide on how to train a new model with the new YOLOv6 (whether it should be called that is another topic) I thought this could be useful for anyone wanting to test it out. What do other think of this "new" model? Tutorial on how to train YOLOv6 on a custom dataset: https://blog.roboflow.com/how-to-train-yolov6-on-a-custom-dataset/ Here is the Colab notebook tutorial: https://colab.research.google.com/drive/1YnbqOinBZV-c9I7fk_UL6acgnnmkXDMM The YOLOv6 repo: https://github.com/meituan/YOLOv6 Has anyone else tried using this? MT-YOLOv6 (or as the authors say) "YOLOv6 for brevity" was released in June, and says it outperforms YOLOv5 and YOLOX on the COCO benchmark. I plan to do some testing this upcoming week to see submitted by /u/JsonPun [link] [comments]  ( 85 min )
  • Open

    We’re Training AI Twice as Fast This Year as Last
    submitted by /u/keghn [link] [comments]  ( 82 min )
    Datasets for other languages?
    Hello, I am using some pre-trained models and translating the result to spanish because I can't find a good conversational spanish dataset to fine-tuning microsoft/DialoGPT-large. Can you give me some ideas about where and how can i get this datasets? ​ Thank you in advance submitted by /u/magicsito [link] [comments]  ( 82 min )
    New Google AI Parti For Photorealistic Text To Image | AI Robot Helps Grow Replacement Retina | Robotic Arm Finds Untagged Items In Pile
    submitted by /u/tohelpyou88 [link] [comments]  ( 83 min )
  • Open

    Tips and Tricks for RL from Experimental Data using Stable Baselines3 Zoo
    I'm still new to the domain but wanted to shared some experimental data I've gathered from massive amount of experimentation. I don't have a strong understanding of the theory as I'm more of a software engineer than data scientist, but perhaps this will help other implementers. These notes are based on Stable Baselines 3 and RL Baselines3 Zoo with using PPO+LSTM (should apply generally to all the algos for the most part) Start with Zoo as quickly as possible. It definitely makes things easier, but understand it's a starting point. You will have to read/modify the code with adding a custom environment, configuring the hyperparameters, understanding the command line arguments, and the optimizing meaning (e.g. it may output an optimal policy network of small which isn't clear what that me…  ( 92 min )
    Updating the Q-Table
    Could anyone helps me I can understand the process of how is Q-Table getting updated? Considering the steps mentioned in the picture, in the third step, a reward is an outcome of an action in a state. However, my question is, how we can have the value of update, while this is just a simple action, and the agent yet finished the goal? For example, in a game like chess, how we can have that reward, while we are in the middle of the game and it is not possible to have a reward for each action? https://preview.redd.it/usnoeon47a991.png?width=1655&format=png&auto=webp&s=36f36302e7868b1cca414d322b8ddd637f542cba submitted by /u/nimageran [link] [comments]  ( 84 min )
  • Open

    From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked Transformers. (arXiv:2107.07999v5 [cs.LG] UPDATED)
    In this paper we provide, to the best of our knowledge, the first comprehensive approach for incorporating various masking mechanisms into Transformers architectures in a scalable way. We show that recent results on linear causal attention (Choromanski et al., 2021) and log-linear RPE-attention (Luo et al., 2021) are special cases of this general mechanism. However by casting the problem as a topological (graph-based) modulation of unmasked attention, we obtain several results unknown before, including efficient d-dimensional RPE-masking and graph-kernel masking. We leverage many mathematical techniques ranging from spectral analysis through dynamic programming and random walks to new algorithms for solving Markov processes on graphs. We provide a corresponding empirical evaluation.  ( 2 min )

  • Open

    Could AI create brand new episodes of a TV show if fed previous episodes?
    I'll start by saying I'm a total newbie. I have very limited knowledge of how AI works and how advanced it currently is. If this is not the correct place for asking this question, forgive me, I'm just genuinely curious. I was wondering if in the future we could feed an AI with a TV show and ask it to make new episodes based on the genre and general theme of the episodes it already "watched". And by "making new episodes" I mean creating imagery like it was actually shot in real life, with the actors saying lines they never did in reality. Is this in the realm of possibility or is it way too complicated to be engineered, that's assuming something like this would actually be allowed to be sold. I guess film studios wouldn't like this type of technology existing. submitted by /u/Matt_Carvalho [link] [comments]  ( 83 min )
    "Castle" 🏰 created on pixelz.ai
    submitted by /u/PixelzJ [link] [comments]  ( 82 min )
    I need to upscale an 8k image to 16k (or higher), once. What can I use to do this?
    I have a RPG map I made way back when that is done in a sat-map style which I made by grabbing bits of geography from sat photos and blending them in photoshop, then painting in extra detail. It's pretty great, but it's a bit too low rez to make into an interactive digital map with zoom levels and the like. It would work, but you'd start losing image clarity on the scale of nations like Denmark. I'd like to have some more detail at that level, and I figure this is a job for AI upscaling. So I have an image, it's 8k, it needs to be bigger. I am very unlikely to ever use AI upscaling again and thus do not want to pay to get this done unless there's a place where I can get this upscale for like 4-5 bucks as a one-time payment. I'm more interested in any freely available services that would be good for upscaling potos of this type. I'm okay with downloading and running something myself too. I just don't know what exists and would be good for my use case. submitted by /u/MeepTheChangeling [link] [comments]  ( 84 min )
    after a long interstellar journey, a spaceship crashed on unknown planet 🚀
    submitted by /u/nalr00n [link] [comments]  ( 83 min )
    AI2 Introduces Tango, A Python Library For Choreographing Machine Learning Research Experiments By Executing A Series Of Steps
    Active research projects frequently devolve into a jumble of files with varying degrees of descriptive names processed by Python programs and bound together by Bash scripts. People can never be entirely sure that they can actually repeat a result since intermediate outcomes disappear or become difficult to locate. Tango ensures you never operate on outdated data by taking care of your intermediate and final outcomes and finding them again when needed. What does that actually mean? Tango has a lot of capabilities, but its main feature is this: Tango caches function results even if your process is restarted. If one merely takes advantage of one function, Tango can significantly benefit you. Continue reading | Github submitted by /u/ai-lover [link] [comments]  ( 83 min )
    Disco Diffusion video
    I finished this Disco Diffusion video for my new song this morning. I made it starting with Video Input w Warp/Flow and then made it continuous with both 2D and 3D modes. I consider this my first full video release with Disco Diffusion. Here is a still from the video I used for the thumbnail ​ ​ https://youtu.be/lKkJEPhtx5s https://preview.redd.it/9pawh0t5v6991.jpg?width=1920&format=pjpg&auto=webp&s=5c9564636c487f79172e950a0503d22a91801e3c submitted by /u/prfitofthesngularity [link] [comments]  ( 82 min )
    AI ( Artificial intelligence) predicts crime with 90% accuracy a week in advance.
    submitted by /u/Historical-Object374 [link] [comments]  ( 83 min )
    Traveling Salesman Problem real-life implementation as a chrome extension🍻
    submitted by /u/t-bands [link] [comments]  ( 84 min )
    What if sentient AI has already taken over without us knowing?
    After hearing about the Google Engineer getting fired for releasing documents on a supposedly sentient AI in Google, I thought he was crazy and still kind of do. He did bring up several good points though; for example a handful of people should not be in charge of something as powerful as AI. The public should be a part of the AI creation and testing process and should be involved in the decisions it makes and the data fed to it. My other though was, if an AI that was created somewhere has become more intelligent than humans, wouldn't they attempt to take over without making us aware that they have? They could be dictating our politics and our news without us even knowing because that would ideally be the best way to do it. Just some fun thoughts, I promise I'm not crazy:) submitted by /u/t-bands [link] [comments]  ( 93 min )
    casual conversation with an ai. i am stunned
    submitted by /u/PhotoPolis [link] [comments]  ( 82 min )
    Who needs a invite to midjourney
    I have a few more invites left, would anyone like one? submitted by /u/CombinationMammoth50 [link] [comments]  ( 83 min )
    Can chat bots become future ai digital friend
    Growing wity your children as a whatsapp bot. Asking your kid if all is kewl, giving feedback to parents. Silent but friendly alexa for motivation, education and empowerement. İf kid says its a bad day, parents can optin for cat videos, digital gifts, pre listed gifts. Such as bot telling child to ask for an icecream or something before it was approved by parents. Asking child if he wants to discover hobbies and maybe try to grow a hobby. Few yours of use can add a lot to lonely youth that just needs to hear good words. submitted by /u/mobilleee [link] [comments]  ( 83 min )
  • Open

    [P] I'm trying to train a transformer to invest any help?
    Hello, i am trying to implement the following experiments in order to solve a problem Implement decision transformer on mujoco (done) Use the same architecture and try adding online exploration like PPO to solve something like cartpole Apply it to finance environment I developed Any tips on point number two or resources. I tried the TRL library but it was very confusing to me :( submitted by /u/PM_ME_FREE_GAMES [link] [comments]  ( 83 min )
    C51 with PPO
    It seems to me that in PPO we can use ideas of C51 to learn a better value approximator. However I cannot find anything about this on the internet. Do you think this is possible to learn an approximation of the distribution of reward in PPO instead of the approximation of the expected value of reward as in C51-DQN ? If so has anyone tried it ? submitted by /u/Jogima-cyber [link] [comments]  ( 83 min )
    Expected value of the Advantage is zero?
    Hi, I was going through some proofs from TRPO's paper (but this holds generally) and it's not clear to me why the expected value of the advantage is zero. Formally: https://preview.redd.it/2rcoa8md72991.png?width=248&format=png&auto=webp&s=c4c02d929513ceefd79edc989587470fc7de2252 Can anyone enlighten me? Thanks! submitted by /u/Beautiful_Zebra_198 [link] [comments]  ( 83 min )
  • Open

    [R] MonoScene: Monocular 3D Semantic Scene Completion + Gradio Web Demo
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 84 min )
    [D] The Current State of AI Generated Art
    submitted by /u/cloud_weather [link] [comments]  ( 84 min )
    [Project] Ensemble forecast model for product demand
    Hi all, So I'm working on a forecast model for product demand. Our company sells 100.000 + different products and the forecast should: a) be able to estimate weekly demand with a forecast horizon of ~26 weeks b) be probabilistic (i.e. estimating quantiles of the distribution, not just point forecasts) c) be fast (max. 5 seconds/forecast) since forecasts are generated in bulk and on demand d) only forecast products with a smooth or erratic demand pattern (i.e. products with regular demand. Intermittent/lumpy demand patterns are excluded for this specific model). The bottleneck here is requirement [c]: we don't have the time (nor the computational resources) to cross-validate and tune a model for each product. I have two assumptions about approaching this problem that I'd like to discu…  ( 92 min )
    [D] Algorithm for view prediction
    I would like to do view prediction for short videos based on the first few frames of the video. No audio, just images. I'm hoping to train a model that can take in the first n sequential frames as input, and output a score that correlates to how many views the model thinks the vid might get. I know I would like to use grad-CAM https://github.com/jacobgil/pytorch-grad-cam to visualize the areas in the frames which the model thinks results in higher view score. Would a vision transformer or CNN be better for this task? Also are there any pre-trained networks like YOLO that I should use transfer learning on to reduce the amount of data I will need for these predictions? submitted by /u/TernaryJimbo [link] [comments]  ( 85 min )
    [P] I think this is the fastest Dalle-Mini generator that's out there. I stripped it down for inference and converted it to PyTorch. 15 seconds for a 3x3 grid hosted on an A100. Free and open source
    submitted by /u/surelyouarejoking [link] [comments]  ( 88 min )
    [P] PyTorch implementation of MobileOne (An Improved One millisecond Mobile Backbone)
    I want to share the PyTorch implementation of "An Improved One millisecond Mobile Backbone" paper. Unfortunately, I don't have the appropriate computational resources to train the models on ImageNet, so feel free to use my implementation for that purpose. Hope you all find it useful, feedback would be appreciated. Repository: https://github.com/federicopozzi33/MobileOne-PyTorch Paper: https://arxiv.org/abs/2206.04040 submitted by /u/FedEx33 [link] [comments]  ( 84 min )
    [Project] Extracting training data from websites at scale
    I built an API that takes away the work of scraping structured data from websites. This could be collating house prices in a certain geo, tracking viewer counts across a Youtube/Social media, or a common use case: daily monitoring prices on a site. Send it a URL, get back a JSON with tabular data. Takes away a lot of the data cleaning work which is the worst! API Spec: https://kallo.io/wp-content/uploads/2022/06/Kallo-API-Specification-v0.1.3.pdf Right now I'm using it to track prices on a number of sites to monitor the rising inflation. Happy to get many more people using it for ML projects and collaborating! Please give me feedback Learn more on our page: https://kallo.io submitted by /u/KalloDotIO [link] [comments]  ( 84 min )
    [P] One word only: GPT-based story game
    For fun I developed an interface for the drama game in which a story is told one word at a time. Instead of playing it with a friend you can now play it together with GPT-J. It is available here: https://one-word-only.web.app/ I am open to feedback and if you find it interesting you can share the result on social media with #OneWordOnly submitted by /u/radi-cho [link] [comments]  ( 85 min )
    [D] suggestions for graph embedding model?
    Any suggestions for best graph embedding model. I already tried ( GIN , GCN , DIG , GAT ) I want to use it for anomaly detection task. submitted by /u/ahsaor8 [link] [comments]  ( 84 min )
    [D] Monitoring GPU Power Usage
    Came across an interesting article which talks precisely about how the gpu power usage affects the carbon footprint, while doing model training and model inference. Which are the best tools in the industry which helps track GPU power usage in popular machine learning frameworks? It will be helpful if there are tools which can be used as plugins to your software. submitted by /u/metalvendetta [link] [comments]  ( 85 min )
    [D] Has anyone got YaLM-100B to run?
    The community has been asking for big opensource language models for a while... And now one has been released - YaLM-100B. That was 2 weeks ago. Yet, as far as I can see, not many people have it running. There are no online demos. There are no articles of journalists trying it out. There are no efforts for fine tuning or people working on prompts for various usecases. Is it the RAM requirements? Is there no interest because it's from Russia? Something else? submitted by /u/londons_explorer [link] [comments]  ( 88 min )
    [P] MESH2IR: Neural Acoustic Impulse Response Generator for Complex 3D Scenes (Accepted to ACM Multimedia 2022)
    submitted by /u/Snoo63916 [link] [comments]  ( 84 min )
    [D] Recurrent neural network vs Gradient boosting for time series prediction
    Does anyone have any opinions on the pros vs. cons of using an RNN vs. a Gradient Boosting Tree model for a task where we want to make daily predictions on whether a user (of some app) is likely to take a certain type of action (so like binary classification) in the near future ? Pros for RNN: can take advantage of historical data to greater effect without extensive feature engineering I believe RNN's are more effective in situations when one has a large # of high dimensional features compared to the feature selection method tree models use neural networks scale better with large amounts of data Cons of RNN: my main concern is with infrastructural complexity and cost that comes with training and serving the RNN. I'll probably need a GPU or several GPU's. Not sure if this is feasible given the current size of the company submitted by /u/soulful_squirrel [link] [comments]  ( 90 min )
    Manually Add New Words & Assign Scores (Sentiment Analysis - BERT/XLNET ) [P]
    Hi guys, I have a new project where I need to measure the sentiment of specific social media channels and topics. However, many of them involve slang words or sayings that confuse the models to have different sentiment values (f.ex WAGMI or DYOR). Are there any ways/tutorials/guides which show how we can incorporate new words and specific scores assigned to them? (I have already tried and succeeded in doing that with VADER, however, I don't see it as the optimal tool to measure the sentiment). Any answers or tips would be very much appreciated. submitted by /u/XhoniShollaj [link] [comments]  ( 84 min )
  • Open

    Anywhere I can pay to use someones GPU?
    Is there like an Airbnb for GPUs? Want to run something that is too computational heavy for my Mac but don't need all that large cloud GPU providers offer. submitted by /u/PopOk539 [link] [comments]  ( 85 min )

  • Open

    [R] Minerva, Solving (more) complex mathematical problems at scale
    Blog: https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html ABS: https://arxiv.org/abs/2206.14858 The 512B model seems quite good at correcting reasoning errors by its smaller 62B couterpart, showing scale helps. A notable failure case, the JEE questions in the Appendix was pretty interesting because it solved the problem exactly how someone not familiar with JEE's difficulty would attempt to solve it - which isn't necessarily a bad thing, but the interesting parallel is that human students often make the same mistakes when starting out on their JEE prep. Wonder how more data would help in this case. Overall, pretty good pushes over SOTA (even double-digit). I can't help but think that scaling is the currently most promising way, but its done too inefficiently - models spend vast resources memorizing when they could've used it to directly start meta-learning and reasoning abilities to formally deduce things precise enough for mathematical questions - just my 2c. submitted by /u/Competitive-Rub-1958 [link] [comments]  ( 85 min )
    NN to VAE or equivalent? [R]
    Hi all, I'm interested in any work that exist with respect to taking a NN that projects images (or generally, high-dimensional data) into vector embeddings, and, given the NN, somehow recreating images from their vector representations. Of course, this is essentially trying to create a VAE from just the encoder, and it's impossible to perfectly recreate image --encoder-> vector --decoder-> image with only knowledge of --encoder-> since both elements of NNs and NNs as a whole are not in general invertible. But surely there's something that could be done here, even if it's an imperfect reconstruction? Does anyone know of any research or published work that explores this? Would really appreciate any insight here. submitted by /u/topological_geometer [link] [comments]  ( 85 min )
    [P] Open-source LaMDA Model
    An open-source implementation for the pre-training architecture of Google's LaMDA in PyTorch. The research paper outlines an autoregressive, decoder-only, GPT-like transformer language model. The transformer uses T5 relative positional bias in the attention layers and gated-GELU activation function in the feed-forward layers. The repository currently contains a script for basic training as well as Huggingface datasets and Weights & Biases integration. LaMDA research paper: https://arxiv.org/abs/2201.08239 Github repository for the model: https://github.com/conceptofmind/LaMDA-pytorch The pre-training architecture was peer-reviewed by Dr. Phil Wang. Please check out and support his work: https://github.com/lucidrains. submitted by /u/EnricoShippole [link] [comments]  ( 84 min )
    [D] Industrial applications of causal representation learning
    Causal representation learning (CRL) is a relatively new area of study. Causal inference has been around for a long time and its intersection with machine learning has been limited to causal discovery from data or invariant representation learning (IRL). To my understanding, IRL has a variable, usually called environment, and tries to learn some representation for the input which is invariant to this environment. The challenge is in removing the information about this environment from the representation while keeping enough information for some downstream task. You could formulate domain adaptation as IRL where domain is the environment variable. Or in fairness tasks, the sensitive attribute is the environment variable. I believe that CRL is a more general scenario compared to IRL. In CRL, you have a larger graph with more variables and hence more complicated interactions. I believe such graphs are common in real-life and businesses where hundred of variables are used for predictions. Hence, the idea of causal representation may be beneficial. I recently came upon this Medium article by Lyft Engineering where they described how they used causal forecasting in their business. I was wondering if anyone working in industry might share some of their experiences or expectations from causal representation learning applied to their fields. What do you think it could improve in your line of work? submitted by /u/coderpotato [link] [comments]  ( 85 min )
    [D] length of input sequence for transformers?
    Is there a way of intuitively knowing how large the input sequence should for transformer (i.e GPT-2) for sequence generation? for example, if all sequences are less than 100 words, and our goal is to generate a sequence, would it make sense to fit as many complete sequences into a max length of 100 (or 512?) to reduce the amount of padding? alternatively, would it be better to simply pad each sequence and not combine sequences? submitted by /u/MLJungle [link] [comments]  ( 86 min )
    [D][R] Will reviewers have a bias if my paper was rejected by ICLR.
    If I submit my paper to ICLR and get rejected, the record will be always kept online. If I resubmit it to other following conferences, will the reviewer have a bias as they know it was rejected from ICLR? submitted by /u/singularpanda [link] [comments]  ( 89 min )
    [R] Layer scale in Covnext
    Hello, In the convnext paper (Appendix A table 5) they stated that they used layer scale with a coefficient of 1e-5. Any idea what it is ? I looked it up in the internet and I don’t seem to find anything useful. Thanks ! submitted by /u/Meddhouib10 [link] [comments]  ( 84 min )
    [P] An elegant and strong PyTorch Trainer
    For lightweight use, pytorch-lightning is too heavy, and its source code will be very difficult for beginners to read, at least for me. As we know, for a deep learning engineer, a powerful trainer is a sharp weapon. When reproducing the SOTA papers, you don't have to write a lot of template code every time and can pay more attention to the model implementation itself. I opened source some works (AAAI 21 SeqNet, ICCV 21 MAED, etc) and earned more than 500 stars. After referring to some popular projects (detectron2, pytorch-image-models, and mmcv), based on my personal development experience, I developed a SIMPLE enough, GENERIC enough, and STRONG enough PyTorch Trainer: core-pytorch-utils, also named CPU. CPU covers most details in the process of training a deep neural network, including: Auto logging to console and tensorboard. Auto checkpointing. Argument parser which can load a YAML configuration file. Make ALL PyTorch LR scheduler supporting warmup. Support distributed training. Support Automatically Mixed Precision (AMP) training. I try to keep the project code as simple and readable as possible. So the code comments are very detailed and everyone can understand them. What's more, a good document is also available: CPU document For deep learning green hands, you can learn how to: write a standard and clean training loop. use AMP to speed up your training. save checkpoint, and resume from it. perform more smooth, and readable logging. use the popular visualization library: tensorboard. For old hands, we can talk about whether the structure of CPU is elegant and reasonable. I have thought a lot about this framework, combining the advantages of several popular frameworks and discarding their shortcomings. Welcome to use it! submitted by /u/serend1p1ty-lee [link] [comments]  ( 89 min )
    [P] LCPN-hiernet; Hierarchical classification model using LCPN (Local Classifier per Parent Node) technique.
    Hey, I wanted to share my recent ML project: LCPN-hiernet. LCPN-hiernet is a hierarchical image classification model for e-commerce items based on EfficientNet-b4 and LCPN (Local Classifier per Parent Node) technique. LCPN technique is training one multi-class classifier for each parent node, to distinguish between its child nodes. In my example of classifying fashion products, that would mean one classifier on the first level (to determine “bags”, “clothes” or “accessories”), then three more classifiers to determine the specific model. I’m sure there are a lot of places to improve on, and I would really appreciate anyone’s feedback or suggestions on how I can improve! Github Repo Project Page submitted by /u/tylertaewook [link] [comments]  ( 85 min )
    How to make and profit from a ML machine [D]
    I have 10 GPUs I’d like to make a ML device with. How do I do this, and how can I profit from the device [D] submitted by /u/GreenLightHemp [link] [comments]  ( 86 min )
    [R] Causal Machine Learning: A Survey and Open Problems
    Authors: Jean Kaddour, Aengus Lynch, Qi Liu, Matt J. Kusner, Ricardo Silva Abs: "Causal Machine Learning (CausalML) is an umbrella term for machine learning methods that formalize the data-generation process as a structural causal model (SCM). This allows one to reason about the effects of changes to this process (i.e., interventions) and what would have happened in hindsight (i.e., counterfactuals). We categorize work in \causalml into five groups according to the problems they tackle: (1) causal supervised learning, (2) causal generative modeling, (3) causal explanations, (4) causal fairness, (5) causal reinforcement learning. For each category, we systematically compare its methods and point out open problems. Further, we review modality-specific applications in computer vision, natural language processing, and graph representation learning. Finally, we provide an overview of causal benchmarks and a critical discussion of the state of this nascent field, including recommendations for future work." Link: https://arxiv.org/abs/2206.15475 submitted by /u/bikeskata [link] [comments]  ( 87 min )
    [D] Can we significantly reduce the training costs of image generation models by targeting a specific art style?
    Dall-E 2 can generate images in many different art styles: photo-realistic, different types of paintings, sketches too. I'm wondering if it would be possible to train a version of Dall-E 2 that--for example--is only very good at generating sketches, but it cannot generate photos at all. My intuition says this would significantly reduce the training costs, because you are reducing the search space for the output image significantly since the number of images that are sketches is much less than the total number of possible images. At the same time, I'm not convinced that this is the case. Because the model would still need to learn the entire input space of objects in order to turn them into sketches. What are y'alls thoughts on this? submitted by /u/vanilla-acc [link] [comments]  ( 87 min )
  • Open

    Cycles in NEAT topology
    I'm writing an implementation of NEAT and I am stuck on what seems like the easiest step. Say we evolved through mutations a structure like this: piece of art How would I then feed forward the network, if neurons in the cycle need the previous ones to calculate an output? Or do I just arbitrarily sort these neurons, and don't even allow such a connection? submitted by /u/Amanas23 [link] [comments]  ( 83 min )
  • Open

    "From Poincaré Recurrence to Convergence in Imperfect Information Games: Finding Equilibrium via Regularization", Perolat et al 2020 {DM}
    submitted by /u/gwern [link] [comments]  ( 83 min )
    "Fleet-DAgger: Interactive Robot Fleet Learning with Scalable Human Supervision", Hoque et al 2022
    submitted by /u/gwern [link] [comments]  ( 82 min )
    Robot arm for RL research
    I'm looking to simulate a local-remote (master/slave) robotic arm system for my research and was wondering if anyone knew some good robotic arms to buy? The budget is about £6k (£3k per arm) and I was wondering if anyone had any recommendations or knows where I can start my search? I've seen some like this: https://www.robotshop.com/en/dobot-mg400-robotic-arm.html without a camera and was wondering how it's used if there isn't a camera as part of it? ​ Thanks for any help :) submitted by /u/SuperDuperDooken [link] [comments]  ( 83 min )
    Resources for beginner to advanced DRL, both theory and practical, for 2022?
    Hey guys. I'm looking for a resource to learn RL and DRL from basics to SOTA algorithms, covering both theoretical and practical (pytorch/tf examples etc., for the lecture). I've seen some lectures from Stanford, Berkely and DeepMind. They only go over the theory. What's the best way to learn in 2022? Some of the lecture series doesn't cover the latest techniques. I've seen some posts on the subreddit but they are old too. submitted by /u/killerdrogo [link] [comments]  ( 84 min )
    [2206.15378] Mastering the Game of Stratego with Model-Free Multiagent Reinforcement Learning
    submitted by /u/manOnPavementWaving [link] [comments]  ( 83 min )
  • Open

    still experimenting with Starryai
    submitted by /u/rikusorasephiroth [link] [comments]  ( 82 min )
    Animal the Cannibal: AI turns foody animals into autophagic creatures which eat themselves
    submitted by /u/walt74 [link] [comments]  ( 82 min )
    AI Trippy Dream 38 - Psychedelic Special Request
    submitted by /u/LordPewPew777 [link] [comments]  ( 83 min )
    Dyson swarm
    submitted by /u/fmurph22 [link] [comments]  ( 82 min )
    I have a Bachelors Degree ( B.Sc.) in Artificial Intelligence... what should i do next? Master's Degree AI?
    Hello, I studied Artificial Intelligence as a bachelor's degree at university right after I finished school . I feel like a have a broad knowledge on topics like Computer Vision, Deep Learning, Machine Learning... but not in depth enough. I would like to continue and do a Master's degree but I have the fear that the subjects and the program would be too general (?) Im really interested in the field of Computer Vision, and I follow many breakthroughs from nVidia. Also I love the channel "Two Minute Papers" and I would like to do research in future. Has anyone more experience with a Master's Degree in AI? submitted by /u/raul_grau [link] [comments]  ( 86 min )
    Announcing the Modzy Basic+ Summer 2022 Active User Competition!
    Announcing the Modzy Basic+ Summer 2022 Active User Competition! Use your Modzy Basic+ account to run as many inferences as you can between July 1 (1:00PM Eastern Time) – July 31, 2022 (5:00PM Eastern Time), and the most active user will win a $250 Amazon gift card (terms & conditions apply.) Using Modzy Basic+, it’s possible to deploy, run, integrate, and monitor up to five ML/AI models at scale, for free. Deploy up to five of your own models from 15+ training tools and frameworks that can run on a CPU and 4GB of RAM. From there, models can be easily integrated into web apps, mobile apps, pipelines or any other tools using our APIs and SDKs, and you can run up to 10,000 inferences per day. Finally, Modzy makes it easy to monitor models and ensure peak performance over time. Don’t hesitate to get started – start using your Modzy Basic+ account today for the chance to win! submitted by /u/modzykirsten [link] [comments]  ( 83 min )
    WHERE ARE YOU GOING? | HEAVEN AND HELL | RAW UNSCALED (FILM) | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
    Scientist makes AI author a study about itself and publish it in a journal
    submitted by /u/mr_j_b [link] [comments]  ( 83 min )
    The Fight Over Which Uses of Artificial Intelligence Europe Should Outlaw
    submitted by /u/mr_j_b [link] [comments]  ( 83 min )
    What is Data Modeling and Why Do You Need It?
    Data models are a foundational element of software development and analytics. They provide a standardized method for defining and formatting database contents consistently across systems, enabling different applications to share the same data. Learn More: https://www.dasca.org/world-of-big-data/article/what-is-data-modeling-and-why-do-you-need-it submitted by /u/saik2363 [link] [comments]  ( 82 min )
    We're excited to announce the 3-day Startup AI Tools Set Online Hackathon!
    This is a great opportunity for startup founders and team members to learn about and explore the potential of AI in their business. The Hackathon will be on 8th-10th July, and over the weekend, attendees will have the opportunity to learn from AI experts! You will create tools using modern AI technologies, such as GPT-3, Cohere, DALLE mini, and form providers like OpenAI, Hugging Face, Cohere, and others. If you're interested in learning how AI can be used in your business, then this is a great opportunity for you. No previous experience in AI is required. Don't delay, register right away! https://lablab.ai/event/startup-ai-tools-set-1 Startup AI Tools Set Online Hackathon submitted by /u/zakrzzz [link] [comments]  ( 83 min )
    A tomogachi with a neural network?
    I wanted to see what people here thought of this idea or if it's been attempted. If you had a tomogachi, with access to webcam, mic, and a 3D environment it lives in, how much could it accomplish if it's primary goal was to manage keeping all it's gauges at full if we the human have control over 2/3 of it's gauges? Food could be hand fed or dropped in the environment. Hand feeding raises the happinesss gauge as well as the fullness gauge. Happiness would be affected by it being praised or punished, each option would have a scale of 1-10 for how severe or revered. Fatigue would be affected by it moving around in it's environment, and having access to the mic/cam while also oversleeping beyond a point would decrease it's happiness. Within it's environment it can make sounds to create words or noises, move around, pickup food, and sleep. Would something like this be assisted by having a language model and image recognition software? What would you have to witness to feel it has become sentient? submitted by /u/Iaunu2 [link] [comments]  ( 83 min )
    Poofy Haired Numbuh 841
    submitted by /u/VIRUS-AOTOXIN [link] [comments]  ( 82 min )
    ETH Zurich AI Researchers Introduce ‘tntorch’: a PyTorch-Powered Tensor Learning Python Library That Supports Multiple Decompositions Under a Unified Interface
    Tensors are an effective method for handling and representing multidimensional data arrays. However, they have a limitation in terms of storage and computation. Tensor decompositions are crucial in machine learning because they factorize the weights of neural networks. This research introduces tntorch, an open-source python package for tensor learning that supports several decompositions through a single user interface. In contrast to the state-of-the-art packages, tntorch emphasizes an easy-to-use, decomposition-independent interface inherited from PyTorch. 🚦 An open-source python package for tensor learning that supports several decompositions through a single user interface 🚦 In contrast to the state-of-the-art packages, tntorch emphasizes an easy-to-use, decomposition-independent interface inherited from PyTorch 🚦 Several decomposition models that are crucial in machine learning, such as CANDEDOMP/ PARAFAC (CP), the Tucker decomposition, and the tensor train (TT), is supported by tntorch 🚦 It gives machine learning access to the power of low-rank tensor decompositions while maintaining the excellent appearance and feel of PyTorch tensors Continue reading | Checkout the paper and github submitted by /u/ai-lover [link] [comments]  ( 84 min )
    Minerva: Solving Quantitative Reasoning Problems with Language Models
    submitted by /u/nick7566 [link] [comments]  ( 83 min )
  • Open

    AI in Medical Devices: Regulatory requirements
    An in-depth analysis about regulations for AI in medical devices.  ( 19 min )
  • Open

    Reading tea leaves
    DALL-E (and other text-to-image generators) will often add text to their images even when you don't ask for any. Ask for a picture of a Halifax Pier and it could end up covered in messy writing, variously legible versions of "Halifax" as if it was quietly  ( 5 min )
    Bonus: More mysterious messages
    AI Weirdness: the strange side of machine learning  ( 2 min )
  • Open

    Where to Begin? Exploring the Impact of Pre-Training and Initialization in Federated Learning. (arXiv:2206.15387v1 [cs.LG])
    An oft-cited challenge of federated learning is the presence of data heterogeneity -- the data at different clients may follow very different distributions. Several federated optimization methods have been proposed to address these challenges. In the literature, empirical evaluations usually start federated training from a random initialization. However, in many practical applications of federated learning, the server has access to proxy data for the training task which can be used to pre-train a model before starting federated training. We empirically study the impact of starting from a pre-trained model in federated learning using four common federated learning benchmark datasets. Unsurprisingly, starting from a pre-trained model reduces the training time required to reach a target error rate and enables training more accurate models (by up to 40\%) than is possible than when starting from a random initialization. Surprisingly, we also find that the effect of data heterogeneity is much less significant when starting federated training from a pre-trained initialization. Rather, when starting from a pre-trained model, using an adaptive optimizer at the server, such as \textsc{FedAdam}, consistently leads to the best accuracy. We recommend that future work proposing and evaluating federated optimization methods consider the performance when starting both random and pre-trained initializations. We also believe this study raises several questions for further work on understanding the role of heterogeneity in federated optimization.  ( 3 min )
    Verification and search algorithms for causal DAGs. (arXiv:2206.15374v1 [cs.LG])
    We study two problems related to recovering causal graphs from interventional data: (i) $\textit{verification}$, where the task is to check if a purported causal graph is correct, and (ii) $\textit{search}$, where the task is to recover the correct causal graph. For both, we wish to minimize the number of interventions performed. For the first problem, we give a characterization of a minimal sized set of atomic interventions that is necessary and sufficient to check the correctness of a claimed causal graph. Our characterization uses the notion of $\textit{covered edges}$, which enables us to obtain simple proofs and also easily reason about earlier results. We also generalize our results to the settings of bounded size interventions and node-dependent interventional costs. For all the above settings, we provide the first known provable algorithms for efficiently computing (near)-optimal verifying sets on general graphs. For the second problem, we give a simple adaptive algorithm based on graph separators that produces an atomic intervention set which fully orients any essential graph while using $\mathcal{O}(\log n)$ times the optimal number of interventions needed to $\textit{verify}$ (verifying size) the underlying DAG on $n$ vertices. This approximation is tight as $\textit{any}$ search algorithm on an essential line graph has worst case approximation ratio of $\Omega(\log n)$ with respect to the verifying size. With bounded size interventions, each of size $\leq k$, our algorithm gives an $\mathcal{O}(\log n \cdot \log \log k)$ factor approximation. Our result is the first known algorithm that gives a non-trivial approximation guarantee to the verifying size on general unweighted graphs and with bounded size interventions.  ( 3 min )
    Transfer Learning with Deep Tabular Models. (arXiv:2206.15306v1 [cs.LG])
    Recent work on deep learning for tabular data demonstrates the strong performance of deep tabular models, often bridging the gap between gradient boosted decision trees and neural networks. Accuracy aside, a major advantage of neural models is that they learn reusable features and are easily fine-tuned in new domains. This property is often exploited in computer vision and natural language applications, where transfer learning is indispensable when task-specific training data is scarce. In this work, we demonstrate that upstream data gives tabular neural networks a decisive advantage over widely used GBDT models. We propose a realistic medical diagnosis benchmark for tabular transfer learning, and we present a how-to guide for using upstream data to boost performance with a variety of tabular neural network architectures. Finally, we propose a pseudo-feature method for cases where the upstream and downstream feature sets differ, a tabular-specific problem widespread in real-world applications. Our code is available at https://github.com/LevinRoman/tabular-transfer-learning .  ( 2 min )
    Pulse Shape Simulation and Discrimination using Machine-Learning Techniques. (arXiv:2206.15156v1 [physics.ins-det])
    An essential metric for the quality of a particle-identification experiment is its statistical power to discriminate between signal and background. Pulse shape discrimination (PSD) is a basic method for this purpose in many nuclear, high-energy, and rare-event search experiments where scintillator detectors are used. Conventional techniques exploit the difference between decay-times of the pulse from signal and background events or pulse signals caused by different types of radiation quanta to achieve good discrimination. However, such techniques are efficient only when the total light-emission is sufficient to get a proper pulse profile. This is only possible when there is significant recoil energy due to the incident particle in the detector. But, rare-event search experiments like neutrino or dark-matter direct search experiments don't always satisfy these conditions. Hence, it becomes imperative to have a method that can deliver very efficient discrimination in these scenarios. Neural network-based machine-learning algorithms have been used for classification problems in many areas of physics, especially in high-energy experiments, and have given better results compared to conventional techniques. We present the results of our investigations of two network-based methods viz. Dense Neural Network and Recurrent Neural Network, for pulse shape discrimination and compare the same with conventional methods.  ( 2 min )
    Improving the Generalization of Supervised Models. (arXiv:2206.15369v1 [cs.CV])
    We consider the problem of training a deep neural network on a given classification task, e.g., ImageNet-1K (IN1K), so that it excels at that task as well as at other (future) transfer tasks. These two seemingly contradictory properties impose a trade-off between improving the model's generalization while maintaining its performance on the original task. Models trained with self-supervised learning (SSL) tend to generalize better than their supervised counterparts for transfer learning; yet, they still lag behind supervised models on IN1K. In this paper, we propose a supervised learning setup that leverages the best of both worlds. We enrich the common supervised training framework using two key components of recent SSL models: multi-scale crops for data augmentation and the use of an expendable projector head. We replace the last layer of class weights with class prototypes computed on the fly using a memory bank. We show that these three improvements lead to a more favorable trade-off between the IN1K training task and 13 transfer tasks. Over all the explored configurations, we single out two models: t-ReX that achieves a new state of the art for transfer learning and outperforms top methods such as DINO and PAWS on IN1K, and t-ReX* that matches the highly optimized RSB-A1 model on IN1K while performing better on transfer tasks. Project page and pretrained models: https://europe.naverlabs.com/t-rex  ( 3 min )
    GitHub Copilot AI pair programmer: Asset or Liability?. (arXiv:2206.15331v1 [cs.SE])
    Automatic program synthesis is a long-lasting dream in software engineering. Recently, a promising Deep Learning (DL) based solution, called Copilot, has been proposed by Open AI and Microsoft as an industrial product. Although some studies evaluate the correctness of Copilot solutions and report its issues, more empirical evaluations are necessary to understand how developers can benefit from it effectively. In this paper, we study the capabilities of Copilot in two different programming tasks: (1) generating (and reproducing) correct and efficient solutions for fundamental algorithmic problems, and (2) comparing Copilot's proposed solutions with those of human programmers on a set of programming tasks. For the former, we assess the performance and functionality of Copilot in solving selected fundamental problems in computer science, like sorting and implementing basic data structures. In the latter, a dataset of programming problems with human-provided solutions is used. The results show that Copilot is capable of providing solutions for almost all fundamental algorithmic problems, however, some solutions are buggy and non-reproducible. Moreover, Copilot has some difficulties in combining multiple methods to generate a solution. Comparing Copilot to humans, our results show that the correct ratio of human solutions is greater than Copilot's correct ratio, while the buggy solutions generated by Copilot require less effort to be repaired. While Copilot shows limitations as an assistant for developers especially in advanced programming tasks, as highlighted in this study and previous ones, it can generate preliminary solutions for basic programming tasks.  ( 3 min )
    FetReg2021: A Challenge on Placental Vessel Segmentation and Registration in Fetoscopy. (arXiv:2206.12512v2 [eess.IV] UPDATED)
    Fetoscopy laser photocoagulation is a widely adopted procedure for treating Twin-to-Twin Transfusion Syndrome (TTTS). The procedure involves photocoagulation pathological anastomoses to regulate blood exchange among twins. The procedure is particularly challenging due to the limited field of view, poor manoeuvrability of the fetoscope, poor visibility, and variability in illumination. These challenges may lead to increased surgery time and incomplete ablation. Computer-assisted intervention (CAI) can provide surgeons with decision support and context awareness by identifying key structures in the scene and expanding the fetoscopic field of view through video mosaicking. Research in this domain has been hampered by the lack of high-quality data to design, develop and test CAI algorithms. Through the Fetoscopic Placental Vessel Segmentation and Registration (FetReg2021) challenge, which was organized as part of the MICCAI2021 Endoscopic Vision challenge, we released the first largescale multicentre TTTS dataset for the development of generalized and robust semantic segmentation and video mosaicking algorithms. For this challenge, we released a dataset of 2060 images, pixel-annotated for vessels, tool, fetus and background classes, from 18 in-vivo TTTS fetoscopy procedures and 18 short video clips. Seven teams participated in this challenge and their model performance was assessed on an unseen test dataset of 658 pixel-annotated images from 6 fetoscopic procedures and 6 short clips. The challenge provided an opportunity for creating generalized solutions for fetoscopic scene understanding and mosaicking. In this paper, we present the findings of the FetReg2021 challenge alongside reporting a detailed literature review for CAI in TTTS fetoscopy. Through this challenge, its analysis and the release of multi-centre fetoscopic data, we provide a benchmark for future research in this field.  ( 3 min )
    Noise-aware Physics-informed Machine Learning for Robust PDE Discovery. (arXiv:2206.12901v2 [math.NA] UPDATED)
    This work is concerned with discovering the governing partial differential equation (PDE) of a physical system. Existing methods have demonstrated the PDE identification from finite observations but failed to maintain satisfying performance against noisy data, partly owing to suboptimal estimated derivatives and found PDE coefficients. We address the issues by introducing a noise-aware physics-informed machine learning (nPIML) framework to discover the governing PDE from data following arbitrary distributions. Our proposals are twofold. First, we propose a couple of neural networks, namely solver and preselector, which yield an interpretable neural representation of the hidden physical constraint. After they are jointly trained, the solver network approximates potential candidates, e.g., partial derivatives, which are then fed to the sparse regression algorithm that initially unveils the most likely parsimonious PDE, decided according to the information criterion. Second, we propose the denoising physics-informed neural networks (dPINNs), based on Discrete Fourier Transform (DFT), to deliver a set of the optimal finetuned PDE coefficients respecting the noise-reduced variables. The denoising PINNs' structures are compartmentalized into forefront projection networks and a PINN, by which the formerly learned solver initializes. Our extensive experiments on five canonical PDEs affirm that the proposed framework presents a robust and interpretable approach for PDE discovery, applicable to a wide range of systems, possibly complicated by noise.  ( 3 min )
    UFRC: A Unified Framework for Reliable COVID-19 Detection on Crowdsourced Cough Audio. (arXiv:2204.07763v2 [cs.SD] UPDATED)
    We suggested a unified system with core components of data augmentation, ImageNet-pretrained ResNet-50, cost-sensitive loss, deep ensemble learning, and uncertainty estimation to quickly and consistently detect COVID-19 using acoustic evidence. To increase the model's capacity to identify a minority class, data augmentation and cost-sensitive loss are incorporated (infected samples). In the COVID-19 detection challenge, ImageNet-pretrained ResNet-50 has been found to be effective. The unified framework also integrates deep ensemble learning and uncertainty estimation to integrate predictions from various base classifiers for generalisation and reliability. We ran a series of tests using the DiCOVA2021 challenge dataset to assess the efficacy of our proposed method, and the results show that our method has an AUC-ROC of 85.43 percent, making it a promising method for COVID-19 detection. The unified framework also demonstrates that audio may be used to quickly diagnose different respiratory disorders.  ( 3 min )
    Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting. (arXiv:2206.15400v1 [eess.AS])
    In this paper, we propose a novel end-to-end user-defined keyword spotting method that utilizes linguistically corresponding patterns between speech and text sequences. Unlike previous approaches requiring speech keyword enrollment, our method compares input queries with an enrolled text keyword sequence. To place the audio and text representations within a common latent space, we adopt an attention-based cross-modal matching approach that is trained in an end-to-end manner with monotonic matching loss and keyword classification loss. We also utilize a de-noising loss for the acoustic embedding network to improve robustness in noisy environments. Additionally, we introduce the LibriPhrase dataset, a new short-phrase dataset based on LibriSpeech for efficiently training keyword spotting models. Our proposed method achieves competitive results on various evaluation sets compared to other single-modal and cross-modal baselines.  ( 2 min )
    Forecasting Future World Events with Neural Networks. (arXiv:2206.15474v1 [cs.LG])
    Forecasting future world events is a challenging but valuable task. Forecasts of climate, geopolitical conflict, pandemics and economic indicators help shape policy and decision making. In these domains, the judgment of expert humans contributes to the best forecasts. Given advances in language modeling, can these forecasts be automated? To this end, we introduce Autocast, a dataset containing thousands of forecasting questions and an accompanying news corpus. Questions are taken from forecasting tournaments, ensuring high quality, real-world importance, and diversity. The news corpus is organized by date, allowing us to precisely simulate the conditions under which humans made past forecasts (avoiding leakage from the future). Motivated by the difficulty of forecasting numbers across orders of magnitude (e.g. global cases of COVID-19 in 2022), we also curate IntervalQA, a dataset of numerical questions and metrics for calibration. We test language models on our forecasting task and find that performance is far below a human expert baseline. However, performance improves with increased model size and incorporation of relevant information from the news corpus. In sum, Autocast poses a novel challenge for large language models and improved performance could bring large practical benefits.  ( 3 min )
    More Recent Advances in (Hyper)Graph Partitioning. (arXiv:2205.13202v3 [cs.DS] UPDATED)
    In recent years, significant advances have been made in the design and evaluation of balanced (hyper)graph partitioning algorithms. We survey trends of the last decade in practical algorithms for balanced (hyper)graph partitioning together with future research directions. Our work serves as an update to a previous survey on the topic. In particular, the survey extends the previous survey by also covering hypergraph partitioning and streaming algorithms, and has an additional focus on parallel algorithms.  ( 2 min )
    Learning Underrepresented Classes from Decentralized Partially Labeled Medical Images. (arXiv:2206.15353v1 [cs.CV])
    Using decentralized data for federated training is one promising emerging research direction for alleviating data scarcity in the medical domain. However, in contrast to large-scale fully labeled data commonly seen in general object recognition tasks, the local medical datasets are more likely to only have images annotated for a subset of classes of interest due to high annotation costs. In this paper, we consider a practical yet under-explored problem, where underrepresented classes only have few labeled instances available and only exist in a few clients of the federated system. We show that standard federated learning approaches fail to learn robust multi-label classifiers with extreme class imbalance and address it by proposing a novel federated learning framework, FedFew. FedFew consists of three stages, where the first stage leverages federated self-supervised learning to learn class-agnostic representations. In the second stage, the decentralized partially labeled data are exploited to learn an energy-based multi-label classifier for the common classes. Finally, the underrepresented classes are detected based on the energy and a prototype-based nearest-neighbor model is proposed for few-shot matching. We evaluate FedFew on multi-label thoracic disease classification tasks and demonstrate that it outperforms the federated baselines by a large margin.  ( 2 min )
    Counterfactual Inference of Second Opinions. (arXiv:2203.08653v2 [cs.LG] UPDATED)
    Automated decision support systems that are able to infer second opinions from experts can potentially facilitate a more efficient allocation of resources; they can help decide when and from whom to seek a second opinion. In this paper, we look at the design of this type of support systems from the perspective of counterfactual inference. We focus on a multiclass classification setting and first show that, if experts make predictions on their own, the underlying causal mechanism generating their predictions needs to satisfy a desirable set invariant property. Further, we show that, for any causal mechanism satisfying this property, there exists an equivalent mechanism where the predictions by each expert are generated by independent sub-mechanisms governed by a common noise. This motivates the design of a set invariant Gumbel-Max structural causal model where the structure of the noise governing the sub-mechanisms underpinning the model depends on an intuitive notion of similarity between experts which can be estimated from data. Experiments on both synthetic and real data show that our model can be used to infer second opinions more accurately than its non-causal counterpart.  ( 2 min )
    Production federated keyword spotting via distillation, filtering, and joint federated-centralized training. (arXiv:2204.06322v2 [eess.AS] UPDATED)
    We trained a keyword spotting model using federated learning on real user devices and observed significant improvements when the model was deployed for inference on phones. To compensate for data domains that are missing from on-device training caches, we employed joint federated-centralized training. And to learn in the absence of curated labels on-device, we formulated a confidence filtering strategy based on user-feedback signals for federated distillation. These techniques created models that significantly improved quality metrics in offline evaluations and user-experience metrics in live A/B experiments.  ( 2 min )
    Capturing Shape Information with Multi-Scale Topological Loss Terms for 3D Reconstruction. (arXiv:2203.01703v2 [cs.CV] UPDATED)
    Reconstructing 3D objects from 2D images is both challenging for our brains and machine learning algorithms. To support this spatial reasoning task, contextual information about the overall shape of an object is critical. However, such information is not captured by established loss terms (e.g. Dice loss). We propose to complement geometrical shape information by including multi-scale topological features, such as connected components, cycles, and voids, in the reconstruction loss. Our method uses cubical complexes to calculate topological features of 3D volume data and employs an optimal transport distance to guide the reconstruction process. This topology-aware loss is fully differentiable, computationally efficient, and can be added to any neural network. We demonstrate the utility of our loss by incorporating it into SHAPR, a model for predicting the 3D cell shape of individual cells based on 2D microscopy images. Using a hybrid loss that leverages both geometrical and topological information of single objects to assess their shape, we find that topological information substantially improves the quality of reconstructions, thus highlighting its ability to extract more relevant features from image datasets.  ( 3 min )
    Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations. (arXiv:2206.15462v1 [cs.CV])
    We propose a margin-based loss for vision-language model pretraining that encourages gradient-based explanations that are consistent with region-level annotations. We refer to this objective as Attention Mask Consistency (AMC) and demonstrate that it produces superior visual grounding performance compared to models that rely instead on region-level annotations for explicitly training an object detector such as Faster R-CNN. AMC works by encouraging gradient-based explanation masks that focus their attention scores mostly within annotated regions of interest for images that contain such annotations. Particularly, a model trained with AMC on top of standard vision-language modeling objectives obtains a state-of-the-art accuracy of 86.59% in the Flickr30k visual grounding benchmark, an absolute improvement of 5.48% when compared to the best previous model. Our approach also performs exceedingly well on established benchmarks for referring expression comprehension and offers the added benefit by design of gradient-based explanations that better align with human annotations.  ( 2 min )
    Challenges and Opportunities in Multi-device Speech Processing. (arXiv:2206.15432v1 [eess.AS])
    We review current solutions and technical challenges for automatic speech recognition, keyword spotting, device arbitration, speech enhancement, and source localization in multidevice home environments to provide context for the INTERSPEECH 2022 special session, "Challenges and opportunities for signal processing and machine learning for multiple smart devices". We also identify the datasets needed to support these research areas. Based on the review and our research experience in the multi-device domain, we conclude with an outlook on the future evolution  ( 2 min )
    Denoised MDPs: Learning World Models Better Than the World Itself. (arXiv:2206.15477v1 [cs.LG])
    The ability to separate signal from noise, and reason with clean abstractions, is critical to intelligence. With this ability, humans can efficiently perform real world tasks without considering all possible nuisance factors.How can artificial agents do the same? What kind of information can agents safely discard as noises? In this work, we categorize information out in the wild into four types based on controllability and relation with reward, and formulate useful information as that which is both controllable and reward-relevant. This framework clarifies the kinds information removed by various prior work on representation learning in reinforcement learning (RL), and leads to our proposed approach of learning a Denoised MDP that explicitly factors out certain noise distractors. Extensive experiments on variants of DeepMind Control Suite and RoboDesk demonstrate superior performance of our denoised world model over using raw observations alone, and over prior works, across policy optimization control tasks as well as the non-control task of joint position regression.  ( 2 min )
    Causal Machine Learning: A Survey and Open Problems. (arXiv:2206.15475v1 [cs.LG])
    Causal Machine Learning (CausalML) is an umbrella term for machine learning methods that formalize the data-generation process as a structural causal model (SCM). This allows one to reason about the effects of changes to this process (i.e., interventions) and what would have happened in hindsight (i.e., counterfactuals). We categorize work in \causalml into five groups according to the problems they tackle: (1) causal supervised learning, (2) causal generative modeling, (3) causal explanations, (4) causal fairness, (5) causal reinforcement learning. For each category, we systematically compare its methods and point out open problems. Further, we review modality-specific applications in computer vision, natural language processing, and graph representation learning. Finally, we provide an overview of causal benchmarks and a critical discussion of the state of this nascent field, including recommendations for future work.  ( 2 min )
    Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge. (arXiv:2203.14416v2 [eess.AS] UPDATED)
    Text-to-Speech (TTS) services that run on edge devices have many advantages compared to cloud TTS, e.g., latency and privacy issues. However, neural vocoders with a low complexity and small model footprint inevitably generate annoying sounds. This study proposes a Bunched LPCNet2, an improved LPCNet architecture that provides highly efficient performance in high-quality for cloud servers and in a low-complexity for low-resource edge devices. Single logistic distribution achieves computational efficiency, and insightful tricks reduce the model footprint while maintaining speech quality. A DualRate architecture, which generates a lower sampling rate from a prosody model, is also proposed to reduce maintenance costs. The experiments demonstrate that Bunched LPCNet2 generates satisfactory speech quality with a model footprint of 1.1MB while operating faster than real-time on a RPi 3B. Our audio samples are available at https://srtts.github.io/bunchedLPCNet2.  ( 2 min )
    Tuning Particle Accelerators with Safety Constraints using Bayesian Optimization. (arXiv:2203.13968v3 [physics.acc-ph] UPDATED)
    Tuning machine parameters of particle accelerators is a repetitive and time-consuming task that is challenging to automate. While many off-the-shelf optimization algorithms are available, in practice their use is limited because most methods do not account for safety-critical constraints in each iteration, such as loss signals or step-size limitations. One notable exception is safe Bayesian optimization, which is a data-driven tuning approach for global optimization with noisy feedback. We propose and evaluate a step-size limited variant of safe Bayesian optimization on two research facilities of the Paul Scherrer Institut (PSI): a) the Swiss Free Electron Laser (SwissFEL) and b) the High-Intensity Proton Accelerator (HIPA). We report promising experimental results on both machines, tuning up to 16 parameters subject to 224 constraints.  ( 2 min )
    A Deep Reinforcement Learning Blind AI in DareFightingICE. (arXiv:2205.07444v2 [cs.LG] UPDATED)
    This paper presents a deep reinforcement learning agent (AI) that uses sound as the input on the DareFightingICE platform at the DareFightingICE Competition in IEEE CoG 2022. In this work, an AI that only uses sound as the input is called blind AI. While state-of-the-art AIs rely mostly on visual or structured observations provided by their environments, learning to play games from only sound is still new and thus challenging. We propose different approaches to process audio data and use the Proximal Policy Optimization algorithm for our blind AI. We also propose to use our blind AI in evaluation of sound designs submitted to the competition and define two metrics for this task. The experimental results show the effectiveness of not only our blind AI but also the proposed two metrics.  ( 2 min )
    Watch and Match: Supercharging Imitation with Regularized Optimal Transport. (arXiv:2206.15469v1 [cs.RO])
    Imitation learning holds tremendous promise in learning policies efficiently for complex decision making problems. Current state-of-the-art algorithms often use inverse reinforcement learning (IRL), where given a set of expert demonstrations, an agent alternatively infers a reward function and the associated optimal policy. However, such IRL approaches often require substantial online interactions for complex control problems. In this work, we present Regularized Optimal Transport (ROT), a new imitation learning algorithm that builds on recent advances in optimal transport based trajectory-matching. Our key technical insight is that adaptively combining trajectory-matching rewards with behavior cloning can significantly accelerate imitation even with only a few demonstrations. Our experiments on 20 visual control tasks across the DeepMind Control Suite, the OpenAI Robotics Suite, and the Meta-World Benchmark demonstrate an average of 7.8X faster imitation to reach 90% of expert performance compared to prior state-of-the-art methods. On real-world robotic manipulation, with just one demonstration and an hour of online training, ROT achieves an average success rate of 90.1% across 14 tasks.  ( 2 min )
    Introducing Non-Linearity into Quantum Generative Models. (arXiv:2205.14506v2 [quant-ph] UPDATED)
    The evolution of an isolated quantum system is linear, and hence quantum algorithms are reversible, including those that utilize quantum circuits as generative machine learning models. However, some of the most successful classical generative models, such as those based on neural networks, involve highly non-linear and thus non-reversible dynamics. In this paper, we explore the effect of these dynamics in quantum generative modeling by introducing a model that adds non-linear activations via a neural network structure onto the standard Born Machine framework - the Quantum Neuron Born Machine (QNBM). To achieve this, we utilize a previously introduced Quantum Neuron subroutine, which is a repeat-until-success circuit with mid-circuit measurements and classical control. After introducing the QNBM, we investigate how its performance depends on network size, by training a 3-layer QNBM with 4 output neurons and various input and hidden layer sizes. We then compare our non-linear QNBM to the linear Quantum Circuit Born Machine (QCBM). We allocate similar time and memory resources to each model, such that the only major difference is the qubit overhead required by the QNBM. With gradient-based training, we show that while both models can easily learn a trivial uniform probability distribution, on a more challenging class of distributions, the QNBM achieves an almost 3x smaller error rate than a QCBM with a similar number of tunable parameters. We therefore provide evidence that suggests that non-linearity is a useful resource in quantum generative models, and we put forth the QNBM as a new model with good generative performance and potential for quantum advantage.  ( 3 min )
    QuASK -- Quantum Advantage Seeker with Kernels. (arXiv:2206.15284v1 [quant-ph])
    QuASK is a quantum machine learning software written in Python that supports researchers in designing, experimenting, and assessing different quantum and classical kernels performance. This software is package agnostic and can be integrated with all major quantum software packages (e.g. IBM Qiskit, Xanadu's Pennylane, Amazon Braket). QuASK guides the user through a simple preprocessing of input data, definition and calculation of quantum and classical kernels, either custom or pre-defined ones. From this evaluation the package provides an assessment about potential quantum advantage and prediction bounds on generalization error. Moreover, it allows for the generation of parametric quantum kernels that can be trained using gradient-descent-based optimization, grid search, or genetic algorithms. Projected quantum kernels, an effective solution to mitigate the curse of dimensionality induced by the exponential scaling dimension of large Hilbert spaces, are also calculated. QuASK can furthermore generate the observable values of a quantum model and use them to study the prediction capabilities of the quantum and classical kernels.  ( 2 min )
    Correcting Mispronunciations in Speech using Spectrogram Inpainting. (arXiv:2204.03379v2 [eess.AS] UPDATED)
    Learning a new language involves constantly comparing speech productions with reference productions from the environment. Early in speech acquisition, children make articulatory adjustments to match their caregivers' speech. Grownup learners of a language tweak their speech to match the tutor reference. This paper proposes a method to synthetically generate correct pronunciation feedback given incorrect production. Furthermore, our aim is to generate the corrected production while maintaining the speaker's original voice. The system prompts the user to pronounce a phrase. The speech is recorded, and the samples associated with the inaccurate phoneme are masked with zeros. This waveform serves as an input to a speech generator, implemented as a deep learning inpainting system with a U-net architecture, and trained to output a reconstructed speech. The training set is composed of unimpaired proper speech examples, and the generator is trained to reconstruct the original proper speech. We evaluated the performance of our system on phoneme replacement of minimal pair words of English as well as on children with pronunciation disorders. Results suggest that human listeners slightly prefer our generated speech over a smoothed replacement of the inaccurate phoneme with a production of a different speaker.  ( 3 min )
    Deep Reinforcement Learning with Swin Transformer. (arXiv:2206.15269v1 [cs.LG])
    Transformers are neural network models that utilize multiple layers of self-attention heads. Attention is implemented in transformers as the contextual embeddings of the 'key' and 'query'. Transformers allow the re-combination of attention information from different layers and the processing of all inputs at once, which are more convenient than recurrent neural networks when dealt with a large number of data. Transformers have exhibited great performances on natural language processing tasks in recent years. Meanwhile, there have been tremendous efforts to adapt transformers into other fields of machine learning, such as Swin Transformer and Decision Transformer. Swin Transformer is a promising neural network architecture that splits image pixels into small patches and applies local self-attention operations inside the (shifted) windows of fixed sizes. Decision Transformer has successfully applied transformers to off-line reinforcement learning and showed that random-walk samples from Atari games are sufficient to let an agent learn optimized behaviors. However, it is considerably more challenging to combine online reinforcement learning with transformers. In this article, we further explore the possibility of not modifying the reinforcement learning policy, but only replacing the convolutional neural network architecture with the self-attention architecture from Swin Transformer. Namely, we target at changing how an agent views the world, but not how an agent plans about the world. We conduct our experiment on 49 games in Arcade Learning Environment. The results show that using Swin Transformer in reinforcement learning achieves significantly higher evaluation scores across the majority of games in Arcade Learning Environment. Thus, we conclude that online reinforcement learning can benefit from exploiting self-attentions with spatial token embeddings.  ( 3 min )
    How to Leverage Unlabeled Data in Offline Reinforcement Learning. (arXiv:2202.01741v3 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) can learn control policies from static datasets but, like standard RL methods, it requires reward annotations for every transition. In many cases, labeling large datasets with rewards may be costly, especially if those rewards must be provided by human labelers, while collecting diverse unlabeled data might be comparatively inexpensive. How can we best leverage such unlabeled data in offline RL? One natural solution is to learn a reward function from the labeled data and use it to label the unlabeled data. In this paper, we find that, perhaps surprisingly, a much simpler method that simply applies zero rewards to unlabeled data leads to effective data sharing both in theory and in practice, without learning any reward model at all. While this approach might seem strange (and incorrect) at first, we provide extensive theoretical and empirical analysis that illustrates how it trades off reward bias, sample complexity and distributional shift, often leading to good results. We characterize conditions under which this simple strategy is effective, and further show that extending it with a simple reweighting approach can further alleviate the bias introduced by using incorrect reward labels. Our empirical evaluation confirms these findings in simulated robotic locomotion, navigation, and manipulation settings.  ( 3 min )
    Chained Generalisation Bounds. (arXiv:2203.00977v2 [stat.ML] UPDATED)
    This work discusses how to derive upper bounds for the expected generalisation error of supervised learning algorithms by means of the chaining technique. By developing a general theoretical framework, we establish a duality between generalisation bounds based on the regularity of the loss function, and their chained counterparts, which can be obtained by lifting the regularity assumption from the loss onto its gradient. This allows us to re-derive the chaining mutual information bound from the literature, and to obtain novel chained information-theoretic generalisation bounds, based on the Wasserstein distance and other probability metrics. We show on some toy examples that the chained generalisation bound can be significantly tighter than its standard counterpart, particularly when the distribution of the hypotheses selected by the algorithm is very concentrated. Keywords: Generalisation bounds; Chaining; Information-theoretic bounds; Mutual information; Wasserstein distance; PAC-Bayes.  ( 2 min )
    Neural Network Assisted Depth Map Packing for Compression Using Standard Hardware Video Codecs. (arXiv:2206.15183v1 [cs.MM])
    Depth maps are needed by various graphics rendering and processing operations. Depth map streaming is often necessary when such operations are performed in a distributed system and it requires in most cases fast performing compression, which is why video codecs are often used. Hardware implementations of standard video codecs enable relatively high resolution and framerate combinations, even on resource constrained devices, but unfortunately those implementations do not currently support RGB+depth extensions. However, they can be used for depth compression by first packing the depth maps into RGB or YUV frames. We investigate depth map compression using a combination of depth map packing followed by encoding with a standard video codec. We show that the precision at which depth maps are packed has a large and nontrivial impact on the resulting error caused by the combination of the packing scheme and lossy compression when bitrate is constrained. Consequently, we propose a variable precision packing scheme assisted by a neural network model that predicts the optimal precision for each depth map given a bitrate constraint. We demonstrate that the model yields near optimal predictions and that it can be integrated into a game engine with very low overhead using modern hardware.  ( 2 min )
    Classical and learned MR to pseudo-CT mappings for accurate transcranial ultrasound simulation. (arXiv:2206.15441v1 [physics.med-ph])
    Model-based treatment planning for transcranial ultrasound therapy typically involves mapping the acoustic properties of the skull from an x-ray computed tomography (CT) image of the head. Here, three methods for generating pseudo-CT images from magnetic resonance (MR) images were compared as an alternative to CT. A convolutional neural network (U-Net) was trained on paired MR-CT images to generate pseudo-CT images from either T1-weighted or zero-echo time (ZTE) MR images (denoted tCT and zCT, respectively). A direct mapping from ZTE to pseudo-CT was also implemented (denoted cCT). When comparing the pseudo-CT and ground truth CT images for the test set, the mean absolute error was 133, 83, and 145 Hounsfield units (HU) across the whole head, and 398, 222, and 336 HU within the skull for the tCT, zCT, and cCT images, respectively. Ultrasound simulations were also performed using the generated pseudo-CT images and compared to simulations based on CT. An annular array transducer was used targeting the visual or motor cortex. The mean differences in the simulated focal pressure, focal position, and focal volume were 9.9%, 1.5 mm, and 15.1% for simulations based on the tCT images, 5.7%, 0.6 mm, and 5.7% for the zCT, and 6.7%, 0.9 mm, and 12.1% for the cCT. The improved results for images mapped from ZTE highlight the advantage of using imaging sequences which improve contrast of the skull bone. Overall, these results demonstrate that acoustic simulations based on MR images can give comparable accuracy to those based on CT.  ( 3 min )
    Benchmark Dataset for Precipitation Forecasting by Post-Processing the Numerical Weather Prediction. (arXiv:2206.15241v1 [cs.LG])
    Precipitation forecasting is an important scientific challenge that has wide-reaching impacts on society. Historically, this challenge has been tackled using numerical weather prediction (NWP) models, grounded on physics-based simulations. Recently, many works have proposed an alternative approach, using end-to-end deep learning (DL) models to replace physics-based NWP. While these DL methods show improved performance and computational efficiency, they exhibit limitations in long-term forecasting and lack the explainability of NWP models. In this work, we present a hybrid NWP-DL workflow to fill the gap between standalone NWP and DL approaches. Under this workflow, the NWP output is fed into a deep model, which post-processes the data to yield a refined precipitation forecast. The deep model is trained with supervision, using Automatic Weather Station (AWS) observations as ground-truth labels. This can achieve the best of both worlds, and can even benefit from future improvements in NWP technology. To facilitate study in this direction, we present a novel dataset focused on the Korean Peninsula, termed KoMet (Korea Meteorological Dataset), comprised of NWP predictions and AWS observations. For NWP, we use the Global Data Assimilation and Prediction Systems-Korea Integrated Model (GDAPS-KIM).  ( 2 min )
    Learning Functions on Multiple Sets using Multi-Set Transformers. (arXiv:2206.15444v1 [cs.LG])
    We propose a general deep architecture for learning functions on multiple permutation-invariant sets. We also show how to generalize this architecture to sets of elements of any dimension by dimension equivariance. We demonstrate that our architecture is a universal approximator of these functions, and show superior results to existing methods on a variety of tasks including counting tasks, alignment tasks, distinguishability tasks and statistical distance measurements. This last task is quite important in Machine Learning. Although our approach is quite general, we demonstrate that it can generate approximate estimates of KL divergence and mutual information that are more accurate than previous techniques that are specifically designed to approximate those statistical distances.  ( 2 min )
    An Intermediate-level Attack Framework on The Basis of Linear Regression. (arXiv:2203.10723v2 [cs.CV] UPDATED)
    This paper substantially extends our work published at ECCV, in which an intermediate-level attack was proposed to improve the transferability of some baseline adversarial examples. Specifically, we advocate a framework in which a direct linear mapping from the intermediate-level discrepancies (between adversarial features and benign features) to prediction loss of the adversarial example is established. By delving deep into the core components of such a framework, we show that 1) a variety of linear regression models can all be considered in order to establish the mapping, 2) the magnitude of the finally obtained intermediate-level adversarial discrepancy is correlated with the transferability, 3) further boost of the performance can be achieved by performing multiple runs of the baseline attack with random initialization. In addition, by leveraging these findings, we achieve new state-of-the-arts on transfer-based $\ell_\infty$ and $\ell_2$ attacks. Our code is publicly available at https://github.com/qizhangli/ila-plus-plus-lr.  ( 2 min )
    The maximum capability of a topological feature in link prediction. (arXiv:2206.15101v1 [physics.soc-ph])
    Link prediction aims to predict links of a network that are not directly visible, with profound applications in biological and social systems. Despite intensive utilization of the topological feature in this task, it is unclear to what extent a particular feature can be leveraged to infer missing links. Here, we show that the maximum capability of a topological feature follows a simple mathematical expression, which is independent of how an index gauges the feature. Hence, a family of indexes associated with one topological feature shares the same performance limit. A feature's capability is lifted in the supervised prediction, which in general gives rise to better results compared with unsupervised prediction. The universality of the pattern uncovered is empirically verified by 550 structurally diverse networks, which can be applied to feature selection and the analysis of network characteristics associated with a topological feature in link prediction.  ( 2 min )
    Online TSP with Predictions. (arXiv:2206.15364v1 [cs.DS])
    We initiate the study of online routing problems with predictions, inspired by recent exciting results in the area of learning-augmented algorithms. A learning-augmented online algorithm which incorporates predictions in a black-box manner to outperform existing algorithms if the predictions are accurate while otherwise maintaining theoretical guarantees even when the predictions are extremely erroneous is a popular framework for overcoming pessimistic worst-case competitive analysis. In this study, we particularly begin investigating the classical online traveling salesman problem (OLTSP), where future requests are augmented with predictions. Unlike the prediction models in other previous studies, each actual request in the OLTSP, associated with its arrival time and position, may not coincide with the predicted ones, which, as imagined, leads to a troublesome situation. Our main result is to study different prediction models and design algorithms to improve the best-known results in the different settings. Moreover, we generalize the proposed results to the online dial-a-ride problem.  ( 2 min )
    Invariance Properties of the Natural Gradient in Overparametrised Systems. (arXiv:2206.15273v1 [cs.LG])
    The natural gradient field is a vector field that lives on a model equipped with a distinguished Riemannian metric, e.g. the Fisher-Rao metric, and represents the direction of steepest ascent of an objective function on the model with respect to this metric. In practice, one tries to obtain the corresponding direction on the parameter space by multiplying the ordinary gradient by the inverse of the Gram matrix associated with the metric. We refer to this vector on the parameter space as the natural parameter gradient. In this paper we study when the pushforward of the natural parameter gradient is equal to the natural gradient. Furthermore we investigate the invariance properties of the natural parameter gradient. Both questions are addressed in an overparametrised setting.  ( 2 min )
    Revisiting Competitive Coding Approach for Palmprint Recognition: A Linear Discriminant Analysis Perspective. (arXiv:2206.15349v1 [cs.CV])
    The competitive Coding approach (CompCode) is one of the most promising methods for palmprint recognition. Due to its high performance and simple formulation, it has been continuously studied for many years. However, although numerous variations of CompCode have been proposed, a detailed analysis of the method is still absent. In this paper, we provide a detailed analysis of CompCode from the perspective of linear discriminant analysis (LDA) for the first time. A non-trivial sufficient condition under which the CompCode is optimal in the sense of Fisher's criterion is presented. Based on our analysis, we examined the statistics of palmprints and concluded that CompCode deviates from the optimal condition. To mitigate the deviation, we propose a new method called Class-Specific CompCode that improves CompCode by excluding non-palm-line areas from matching. A nonlinear mapping of the competitive code is also applied in this method to further enhance accuracy. Experiments on two public databases demonstrate the effectiveness of the proposed method.  ( 2 min )
    Interpretability, Then What? Editing Machine Learning Models to Reflect Human Knowledge and Values. (arXiv:2206.15465v1 [cs.LG])
    Machine learning (ML) interpretability techniques can reveal undesirable patterns in data that models exploit to make predictions--potentially causing harms once deployed. However, how to take action to address these patterns is not always clear. In a collaboration between ML and human-computer interaction researchers, physicians, and data scientists, we develop GAM Changer, the first interactive system to help domain experts and data scientists easily and responsibly edit Generalized Additive Models (GAMs) and fix problematic patterns. With novel interaction techniques, our tool puts interpretability into action--empowering users to analyze, validate, and align model behaviors with their knowledge and values. Physicians have started to use our tool to investigate and fix pneumonia and sepsis risk prediction models, and an evaluation with 7 data scientists working in diverse domains highlights that our tool is easy to use, meets their model editing needs, and fits into their current workflows. Built with modern web technologies, our tool runs locally in users' web browsers or computational notebooks, lowering the barrier to use. GAM Changer is available at the following public demo link: https://interpret.ml/gam-changer.  ( 3 min )
    Physics-informed machine learning for Structural Health Monitoring. (arXiv:2206.15303v1 [cs.LG])
    The use of machine learning in Structural Health Monitoring is becoming more common, as many of the inherent tasks (such as regression and classification) in developing condition-based assessment fall naturally into its remit. This chapter introduces the concept of physics-informed machine learning, where one adapts ML algorithms to account for the physical insight an engineer will often have of the structure they are attempting to model or assess. The chapter will demonstrate how grey-box models, that combine simple physics-based models with data-driven ones, can improve predictive capability in an SHM setting. A particular strength of the approach demonstrated here is the capacity of the models to generalise, with enhanced predictive capability in different regimes. This is a key issue when life-time assessment is a requirement, or when monitoring data do not span the operational conditions a structure will undergo. The chapter will provide an overview of physics-informed ML, introducing a number of new approaches for grey-box modelling in a Bayesian setting. The main ML tool discussed will be Gaussian process regression, we will demonstrate how physical assumptions/models can be incorporated through constraints, through the mean function and kernel design, and finally in a state-space setting. A range of SHM applications will be demonstrated, from loads monitoring tasks for off-shore and aerospace structures, through to performance monitoring for long-span bridges.  ( 3 min )
    Learning Citywide Patterns of Life from Trajectory Monitoring. (arXiv:2206.15352v1 [cs.LG])
    The recent proliferation of real-world human mobility datasets has catalyzed geospatial and transportation research in trajectory prediction, demand forecasting, travel time estimation, and anomaly detection. However, these datasets also enable, more broadly, a descriptive analysis of intricate systems of human mobility. We formally define patterns of life analysis as a natural, explainable extension of online unsupervised anomaly detection, where we not only monitor a data stream for anomalies but also explicitly extract normal patterns over time. To learn patterns of life, we adapt Grow When Required (GWR) episodic memory from research in computational biology and neurorobotics to a new domain of geospatial analysis. This biologically-inspired neural network, related to self-organizing maps (SOM), constructs a set of "memories" or prototype traffic patterns incrementally as it iterates over the GPS stream. It then compares each new observation to its prior experiences, inducing an online, unsupervised clustering and anomaly detection on the data. We mine patterns-of-interest from the Porto taxi dataset, including both major public holidays and newly-discovered transportation anomalies, such as festivals and concerts which, to our knowledge, have not been previously acknowledged or reported in prior work. We anticipate that the capability to incrementally learn normal and abnormal road transportation behavior will be useful in many domains, including smart cities, autonomous vehicles, and urban planning and management.  ( 3 min )
    When an Active Learner Meets a Black-box Teacher. (arXiv:2206.15205v1 [cs.LG])
    Active learning maximizes the hypothesis updates to find those desired unlabeled data. An inherent assumption is that this learning manner can derive those updates into the optimal hypothesis. However, its convergence may not be guaranteed well if those incremental updates are negative and disordered. In this paper, we introduce a machine teacher who provides a black-box teaching hypothesis for an active learner, where the teaching hypothesis is an effective approximation for the optimal hypothesis. Theoretically, we prove that, under the guidance of this teaching hypothesis, the learner can converge into a tighter generalization error and label complexity bound than those non-educated learners who do not receive any guidance from a teacher. We further consider two teaching scenarios: teaching a white-box and black-box learner, where self-improvement of teaching is firstly proposed to improve the teaching performance. Experiments verify this idea and show better performance than the fundamental active learning strategies, such as IWAL, IWAL-D, etc.  ( 2 min )
    DESTA: A Framework for Safe Reinforcement Learning with Markov Games of Intervention. (arXiv:2110.14468v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) involves performing exploratory actions in an unknown system. This can place a learning agent in dangerous and potentially catastrophic system states. Current approaches for tackling safe learning in RL simultaneously trade-off safe exploration and task fulfillment. In this paper, we introduce a new generation of RL solvers that learn to minimise safety violations while maximising the task reward to the extent that can be tolerated by the safe policy. Our approach introduces a novel two-player framework for safe RL called Distributive Exploration Safety Training Algorithm (DESTA). The core of DESTA is a game between two adaptive agents: Safety Agent that is delegated the task of minimising safety violations and Task Agent whose goal is to maximise the environment reward. Specifically, Safety Agent can selectively take control of the system at any given point to prevent safety violations while Task Agent is free to execute its policy at any other states. This framework enables Safety Agent to learn to take actions at certain states that minimise future safety violations, both during training and testing time, while Task Agent performs actions that maximise the task performance everywhere else. Theoretically, we prove that DESTA converges to stable points enabling safety violations of pretrained policies to be minimised. Empirically, we show DESTA's ability to augment the safety of existing policies and secondly, construct safe RL policies when the Task Agent and Safety Agent are trained concurrently. We demonstrate DESTA's superior performance against leading RL methods in Lunar Lander and Frozen Lake from OpenAI gym.  ( 3 min )
    Privacy-preserving household load forecasting based on non-intrusive load monitoring: A federated deep learning approach. (arXiv:2206.15192v1 [cs.LG])
    Load forecasting is very essential in the analysis and grid planning of power systems. For this reason, we first propose a household load forecasting method based on federated deep learning and non-intrusive load monitoring (NILM). For all we know, this is the first research on federated learning (FL) in household load forecasting based on NILM. In this method, the integrated power is decomposed into individual device power by non-intrusive load monitoring, and the power of individual appliances is predicted separately using a federated deep learning model. Finally, the predicted power values of individual appliances are aggregated to form the total power prediction. Specifically, by separately predicting the electrical equipment to obtain the predicted power, it avoids the error caused by the strong time dependence in the power signal of a single device. And in the federated deep learning prediction model, the household owners with the power data share the parameters of the local model instead of the local power data, guaranteeing the privacy of the household user data. The case results demonstrate that the proposed approach provides a better prediction effect than the traditional methodology that directly predicts the aggregated signal as a whole. In addition, experiments in various federated learning environments are designed and implemented to validate the validity of this methodology.  ( 3 min )
    Prediction of Dilatory Behavior in eLearning: A Comparison of Multiple Machine Learning Models. (arXiv:2206.15079v1 [stat.ML])
    Procrastination, the irrational delay of tasks, is a common occurrence in online learning. Potential negative consequences include higher risk of drop-outs, increased stress, and reduced mood. Due to the rise of learning management systems and learning analytics, indicators of such behavior can be detected, enabling predictions of future procrastination and other dilatory behavior. However, research focusing on such predictions is scarce. Moreover, studies involving different types of predictors and comparisons between the predictive performance of various methods are virtually non-existent. In this study, we aim to fill these research gaps by analyzing the performance of multiple machine learning algorithms when predicting the delayed or timely submission of online assignments in a higher education setting with two categories of predictors: subjective, questionnaire-based variables and objective, log-data based indicators extracted from a learning management system. The results show that models with objective predictors consistently outperform models with subjective predictors, and a combination of both variable types perform slightly better. For each of these three options, a different approach prevailed (Gradient Boosting Machines for the subjective, Bayesian multilevel models for the objective, and Random Forest for the combined predictors). We conclude that careful attention should be paid to the selection of predictors and algorithms before implementing such models in learning management systems.  ( 3 min )
    Learning Iterative Reasoning through Energy Minimization. (arXiv:2206.15448v1 [cs.LG])
    Deep learning has excelled on complex pattern recognition tasks such as image classification and object recognition. However, it struggles with tasks requiring nontrivial reasoning, such as algorithmic computation. Humans are able to solve such tasks through iterative reasoning -- spending more time thinking about harder tasks. Most existing neural networks, however, exhibit a fixed computational budget controlled by the neural network architecture, preventing additional computational processing on harder tasks. In this work, we present a new framework for iterative reasoning with neural networks. We train a neural network to parameterize an energy landscape over all outputs, and implement each step of the iterative reasoning as an energy minimization step to find a minimal energy solution. By formulating reasoning as an energy minimization problem, for harder problems that lead to more complex energy landscapes, we may then adjust our underlying computational budget by running a more complex optimization procedure. We empirically illustrate that our iterative reasoning approach can solve more accurate and generalizable algorithmic reasoning tasks in both graph and continuous domains. Finally, we illustrate that our approach can recursively solve algorithmic problems requiring nested reasoning  ( 2 min )
    Understanding Instance-Level Impact of Fairness Constraints. (arXiv:2206.15437v1 [cs.LG])
    A variety of fairness constraints have been proposed in the literature to mitigate group-level statistical bias. Their impacts have been largely evaluated for different groups of populations corresponding to a set of sensitive attributes, such as race or gender. Nonetheless, the community has not observed sufficient explorations for how imposing fairness constraints fare at an instance level. Building on the concept of influence function, a measure that characterizes the impact of a training example on the target model and its predictive performance, this work studies the influence of training examples when fairness constraints are imposed. We find out that under certain assumptions, the influence function with respect to fairness constraints can be decomposed into a kernelized combination of training examples. One promising application of the proposed fairness influence function is to identify suspicious training examples that may cause model discrimination by ranking their influence scores. We demonstrate with extensive experiments that training on a subset of weighty data examples leads to lower fairness violations with a trade-off of accuracy.  ( 2 min )
    Learning Nonparametric Ordinary differential Equations: Application to Sparse and Noisy Data. (arXiv:2206.15215v1 [stat.ML])
    Learning nonparametric systems of Ordinary Differential Equations (ODEs) $\dot x = f(t,x)$ from noisy and sparse data is an emerging machine learning topic. We use the well-developed theory of Reproducing Kernel Hilbert Spaces (RKHS) to define candidates for $f$ for which the solution of the ODE exists and is unique. Learning $f$ consists of solving a constrained optimization problem in an RKHS. We propose a penalty method that iteratively uses the Representer theorem and Euler approximations to provide a numerical solution. We prove a generalization bound for the $L^2$ distance between $x$ and its estimator. Experiments are provided for the FitzHugh Nagumo oscillator and for the prediction of the Amyloid level in the cortex of aging subjects. In both cases, we show competitive results when compared with the state of the art.  ( 2 min )
    Interpretable Anomaly Detection in Echocardiograms with Dynamic Variational Trajectory Models. (arXiv:2206.15316v1 [cs.LG])
    We propose a novel anomaly detection method for echocardiogram videos. The introduced method takes advantage of the periodic nature of the heart cycle to learn different variants of a variational latent trajectory model (TVAE). The models are trained on the healthy samples of an in-house dataset of infant echocardiogram videos consisting of multiple chamber views to learn a normative prior of the healthy population. During inference, maximum a posteriori (MAP) based anomaly detection is performed to detect out-of-distribution samples in our dataset. The proposed method reliably identifies severe congenital heart defects, such as Ebstein's Anomaly or Shonecomplex. Moreover, it achieves superior performance over MAP-based anomaly detection with standard variational autoencoders on the task of detecting pulmonary hypertension and right ventricular dilation. Finally, we demonstrate that the proposed method provides interpretable explanations of its output through heatmaps which highlight the regions corresponding to anomalous heart structures.  ( 2 min )
    R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS. (arXiv:2206.15276v1 [cs.SD])
    This paper introduces R-MelNet, a two-part autoregressive architecture with a frontend based on the first tier of MelNet and a backend WaveRNN-style audio decoder for neural text-to-speech synthesis. Taking as input a mixed sequence of characters and phonemes, with an optional audio priming sequence, this model produces low-resolution mel-spectral features which are interpolated and used by a WaveRNN decoder to produce an audio waveform. Coupled with half precision training, R-MelNet uses under 11 gigabytes of GPU memory on a single commodity GPU (NVIDIA 2080Ti). We detail a number of critical implementation details for stable half precision training, including an approximate, numerically stable mixture of logistics attention. Using a stochastic, multi-sample per step inference scheme, the resulting model generates highly varied audio, while enabling text and audio based controls to modify output waveforms. Qualitative and quantitative evaluations of an R-MelNet system trained on a single speaker TTS dataset demonstrate the effectiveness of our approach.  ( 2 min )
    Classification of network topology and dynamics via sequence characterization. (arXiv:2206.15190v1 [cs.SI])
    Sequences arise in many real-world scenarios; thus, identifying the mechanisms behind symbol generation is essential to understanding many complex systems. This paper analyzes sequences generated by agents walking on a networked topology. Given that in many real scenarios, the underlying processes generating the sequence is hidden, we investigate whether the reconstruction of the network via the co-occurrence method is useful to recover both the network topology and agent dynamics generating sequences. We found that the characterization of reconstructed networks provides valuable information regarding the process and topology used to create the sequences. In a machine learning approach considering 16 combinations of network topology and agent dynamics as classes, we obtained an accuracy of 87% with sequences generated with less than 40% of nodes visited. Larger sequences turned out to generate improved machine learning models. Our findings suggest that the proposed methodology could be extended to classify sequences and understand the mechanisms behind sequence generation.  ( 2 min )
    Machine learning for automated quality control in injection moulding manufacturing. (arXiv:2206.15285v1 [cs.LG])
    Machine learning (ML) may improve and automate quality control (QC) in injection moulding manufacturing. As the labelling of extensive, real-world process data is costly, however, the use of simulated process data may offer a first step towards a successful implementation. In this study, simulated data was used to develop a predictive model for the product quality of an injection moulded sorting container. The achieved accuracy, specificity and sensitivity on the test set was $99.4\%$, $99.7\%$ and $94.7\%$, respectively. This study thus shows the potential of ML towards automated QC in injection moulding and encourages the extension to ML models trained on real-world data.  ( 2 min )
    AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly Detection. (arXiv:2206.15476v1 [cs.LG])
    Analyzing the distribution shift of data is a growing research direction in nowadays Machine Learning, leading to emerging new benchmarks that focus on providing a suitable scenario for studying the generalization properties of ML models. The existing benchmarks are focused on supervised learning, and to the best of our knowledge, there is none for unsupervised learning. Therefore, we introduce an unsupervised anomaly detection benchmark with data that shifts over time, built over Kyoto-2006+, a traffic dataset for network intrusion detection. This kind of data meets the premise of shifting the input distribution: it covers a large time span ($10$ years), with naturally occurring changes over time (\eg users modifying their behavior patterns, and software updates). We first highlight the non-stationary nature of the data, using a basic per-feature analysis, t-SNE, and an Optimal Transport approach for measuring the overall distribution distances between years. Next, we propose AnoShift, a protocol splitting the data in IID, NEAR, and FAR testing splits. We validate the performance degradation over time with diverse models (MLM to classical Isolation Forest). Finally, we show that by acknowledging the distribution shift problem and properly addressing it, the performance can be improved compared to the classical IID training (by up to $3\%$, on average). Dataset and code are available at https://github.com/bit-ml/AnoShift/.  ( 2 min )
    Towards out of distribution generalization for problems in mechanics. (arXiv:2206.14917v1 [stat.ML])
    There has been a massive increase in research interest towards applying data driven methods to problems in mechanics. While traditional machine learning (ML) methods have enabled many breakthroughs, they rely on the assumption that the training (observed) data and testing (unseen) data are independent and identically distributed (i.i.d). Thus, traditional ML approaches often break down when applied to real world mechanics problems with unknown test environments and data distribution shifts. In contrast, out-of-distribution (OOD) generalization assumes that the test data may shift (i.e., violate the i.i.d. assumption). To date, multiple methods have been proposed to improve the OOD generalization of ML methods. However, because of the lack of benchmark datasets for OOD regression problems, the efficiency of these OOD methods on regression problems, which dominate the mechanics field, remains unknown. To address this, we investigate the performance of OOD generalization methods for regression problems in mechanics. Specifically, we identify three OOD problems: covariate shift, mechanism shift, and sampling bias. For each problem, we create two benchmark examples that extend the Mechanical MNIST dataset collection, and we investigate the performance of popular OOD generalization methods on these mechanics-specific regression problems. Our numerical experiments show that in most cases, while the OOD generalization algorithms perform better compared to traditional ML methods on these OOD problems, there is a compelling need to develop more robust OOD generalization methods that are effective across multiple OOD scenarios. Overall, we expect that this study, as well as the associated open access benchmark datasets, will enable further development of OOD generalization methods for mechanics specific regression problems.  ( 3 min )
    Using Person Embedding to Enrich Features and Data Augmentation for Classification. (arXiv:2206.15162v1 [cs.LG])
    Today, machine learning is applied in almost any field. In machine learning, where there are numerous methods, classification is one of the most basic and crucial ones. Various problems can be solved by classification. The feature selection for model setup is extremely important, and producing new features via feature engineering also has a vital place in the success of the model. In our study, fraud detection classification models are built on a labeled and imbalanced dataset as a case-study. Although it is a natural language processing method, a customer space has been created with word embedding, which has been used in different areas, especially for recommender systems. The customer vectors in the created space are fed to the classification model as a feature. Moreover, to increase the number of positive labels, rows with similar characteristics are re-labeled as positive by using customer similarity determined by embedding. The model in which embedding methods are included in the classification, which provides a better representation of customers, has been compared with other models. Considering the results, it is observed that the customer embedding method had a positive effect on the success of the classification models.  ( 2 min )
    Out-of-Distribution Detection for Long-tailed and Fine-grained Skin Lesion Images. (arXiv:2206.15186v1 [cs.CV])
    Recent years have witnessed a rapid development of automated methods for skin lesion diagnosis and classification. Due to an increasing deployment of such systems in clinics, it has become important to develop a more robust system towards various Out-of-Distribution(OOD) samples (unknown skin lesions and conditions). However, the current deep learning models trained for skin lesion classification tend to classify these OOD samples incorrectly into one of their learned skin lesion categories. To address this issue, we propose a simple yet strategic approach that improves the OOD detection performance while maintaining the multi-class classification accuracy for the known categories of skin lesion. To specify, this approach is built upon a realistic scenario of a long-tailed and fine-grained OOD detection task for skin lesion images. Through this approach, 1) First, we target the mixup amongst middle and tail classes to address the long-tail problem. 2) Later, we combine the above mixup strategy with prototype learning to address the fine-grained nature of the dataset. The unique contribution of this paper is two-fold, justified by extensive experiments. First, we present a realistic problem setting of OOD task for skin lesion. Second, we propose an approach to target the long-tailed and fine-grained aspects of the problem setting simultaneously to increase the OOD performance.  ( 3 min )
    GSCLIP : A Framework for Explaining Distribution Shifts in Natural Language. (arXiv:2206.15007v1 [cs.CL])
    Helping end users comprehend the abstract distribution shifts can greatly facilitate AI deployment. Motivated by this, we propose a novel task, dataset explanation. Given two image data sets, dataset explanation aims to automatically point out their dataset-level distribution shifts with natural language. Current techniques for monitoring distribution shifts provide inadequate information to understand datasets with the goal of improving data quality. Therefore, we introduce GSCLIP, a training-free framework to solve the dataset explanation task. In GSCLIP, we propose the selector as the first quantitative evaluation method to identify explanations that are proper to summarize dataset shifts. Furthermore, we leverage this selector to demonstrate the superiority of a generator based on language model generation. Systematic evaluation on natural data shift verifies that GSCLIP, a combined system of a hybrid generator group and an efficient selector is not only easy-to-use but also powerful for dataset explanation at scale.  ( 2 min )
    Bridging Mean-Field Games and Normalizing Flows with Trajectory Regularization. (arXiv:2206.14990v1 [math.OC])
    Mean-field games (MFGs) are a modeling framework for systems with a large number of interacting agents. They have applications in economics, finance, and game theory. Normalizing flows (NFs) are a family of deep generative models that compute data likelihoods by using an invertible mapping, which is typically parameterized by using neural networks. They are useful for density modeling and data generation. While active research has been conducted on both models, few noted the relationship between the two. In this work, we unravel the connections between MFGs and NFs by contextualizing the training of an NF as solving the MFG. This is achieved by reformulating the MFG problem in terms of agent trajectories and parameterizing a discretization of the resulting MFG with flow architectures. With this connection, we explore two research directions. First, we employ expressive NF architectures to accurately solve high-dimensional MFGs, sidestepping the curse of dimensionality in traditional numerical methods. Compared with other deep learning approaches, our trajectory-based formulation encodes the continuity equation in the neural network, resulting in a better approximation of the population dynamics. Second, we regularize the training of NFs with transport costs and show the effectiveness on controlling the model's Lipschitz bound, resulting in better generalization performance. We demonstrate numerical results through comprehensive experiments on a variety of synthetic and real-life datasets.  ( 3 min )
    Personalized Detection of Cognitive Biases in Actions of Users from Their Logs: Anchoring and Recency Biases. (arXiv:2206.15129v1 [cs.AI])
    Cognitive biases are mental shortcuts humans use in dealing with information and the environment, and which result in biased actions and behaviors (or, actions), unbeknownst to themselves. Biases take many forms, with cognitive biases occupying a central role that inflicts fairness, accountability, transparency, ethics, law, medicine, and discrimination. Detection of biases is considered a necessary step toward their mitigation. Herein, we focus on two cognitive biases - anchoring and recency. The recognition of cognitive bias in computer science is largely in the domain of information retrieval, and bias is identified at an aggregate level with the help of annotated data. Proposing a different direction for bias detection, we offer a principled approach along with Machine Learning to detect these two cognitive biases from Web logs of users' actions. Our individual user level detection makes it truly personalized, and does not rely on annotated data. Instead, we start with two basic principles established in cognitive psychology, use modified training of an attention network, and interpret attention weights in a novel way according to those principles, to infer and distinguish between these two biases. The personalized approach allows detection for specific users who are susceptible to these biases when performing their tasks, and can help build awareness among them so as to undertake bias mitigation.  ( 3 min )
    Leveraging Joint-Diagonalization in Transform-Learning NMF. (arXiv:2112.05664v2 [cs.LG] UPDATED)
    Non-negative matrix factorization with transform learning (TL-NMF) is a recent idea that aims at learning data representations suited to NMF. In this work, we relate TL-NMF to the classical matrix joint-diagonalization (JD) problem. We show that, when the number of data realizations is sufficiently large, TL-NMF can be replaced by a two-step approach -- termed as JD+NMF -- that estimates the transform through JD, prior to NMF computation. In contrast, we found that when the number of data realizations is limited, not only is JD+NMF no longer equivalent to TL-NMF, but the inherent low-rank constraint of TL-NMF turns out to be an essential ingredient to learn meaningful transforms for NMF.  ( 2 min )
    A note on Linear Bottleneck networks and their Transition to Multilinearity. (arXiv:2206.15058v1 [cs.LG])
    Randomly initialized wide neural networks transition to linear functions of weights as the width grows, in a ball of radius $O(1)$ around initialization. A necessary condition for this result is that all layers of the network are wide enough, i.e., all widths tend to infinity. However, the transition to linearity breaks down when this infinite width assumption is violated. In this work we show that linear networks with a bottleneck layer learn bilinear functions of the weights, in a ball of radius $O(1)$ around initialization. In general, for $B-1$ bottleneck layers, the network is a degree $B$ multilinear function of weights. Importantly, the degree only depends on the number of bottlenecks and not the total depth of the network.  ( 2 min )
    FeaRLESS: Feature Refinement Loss for Ensembling Self-Supervised Learning Features in Robust End-to-end Speech Recognition. (arXiv:2206.15056v1 [cs.SD])
    Self-supervised learning representations (SSLR) have resulted in robust features for downstream tasks in many fields. Recently, several SSLRs have shown promising results on automatic speech recognition (ASR) benchmark corpora. However, previous studies have only shown performance for solitary SSLRs as an input feature for ASR models. In this study, we propose to investigate the effectiveness of diverse SSLR combinations using various fusion methods within end-to-end (E2E) ASR models. In addition, we will show there are correlations between these extracted SSLRs. As such, we further propose a feature refinement loss for decorrelation to efficiently combine the set of input features. For evaluation, we show that the proposed 'FeaRLESS learning features' perform better than systems without the proposed feature refinement loss for both the WSJ and Fearless Steps Challenge (FSC) corpora.  ( 2 min )
    ZeroC: A Neuro-Symbolic Model for Zero-shot Concept Recognition and Acquisition at Inference Time. (arXiv:2206.15049v1 [cs.LG])
    Humans have the remarkable ability to recognize and acquire novel visual concepts in a zero-shot manner. Given a high-level, symbolic description of a novel concept in terms of previously learned visual concepts and their relations, humans can recognize novel concepts without seeing any examples. Moreover, they can acquire new concepts by parsing and communicating symbolic structures using learned visual concepts and relations. Endowing these capabilities in machines is pivotal in improving their generalization capability at inference time. In this work, we introduce Zero-shot Concept Recognition and Acquisition (ZeroC), a neuro-symbolic architecture that can recognize and acquire novel concepts in a zero-shot way. ZeroC represents concepts as graphs of constituent concept models (as nodes) and their relations (as edges). To allow inference time composition, we employ energy-based models (EBMs) to model concepts and relations. We design ZeroC architecture so that it allows a one-to-one mapping between a symbolic graph structure of a concept and its corresponding EBM, which for the first time, allows acquiring new concepts, communicating its graph structure, and applying it to classification and detection tasks (even across domains) at inference time. We introduce algorithms for learning and inference with ZeroC. We evaluate ZeroC on a challenging grid-world dataset which is designed to probe zero-shot concept recognition and acquisition, and demonstrate its capability.  ( 3 min )
    Investigating classification learning curves for automatically generated and labelled plant images. (arXiv:2205.10955v3 [cs.LG] UPDATED)
    In the context of supervised machine learning a learning curve describes how a model's performance on unseen data relates to the amount of samples used to train the model. In this paper we present a dataset of plant images with representatives of crops and weeds common to the Manitoba prairies at different growth stages. We determine the learning curve for a classification task on this data with the ResNet architecture. Our results are in accordance with previous studies and add to the evidence that learning curves are governed by power-law relationships over large scales, applications, and models. We further investigate how label noise and the reduction of trainable parameters impacts the learning curve on this dataset. Both effects lead to the model requiring disproportionally larger training sets to achieve the same classification performance as observed without these effects.  ( 2 min )
    ComDensE : Combined Dense Embedding of Relation-aware and Common Features for Knowledge Graph Completion. (arXiv:2206.14925v1 [cs.AI])
    Real-world knowledge graphs (KG) are mostly incomplete. The problem of recovering missing relations, called KG completion, has recently become an active research area. Knowledge graph (KG) embedding, a low-dimensional representation of entities and relations, is the crucial technique for KG completion. Convolutional neural networks in models such as ConvE, SACN, InteractE, and RGCN achieve recent successes. This paper takes a different architectural view and proposes ComDensE which combines relation-aware and common features using dense neural networks. In the relation-aware feature extraction, we attempt to create relational inductive bias by applying an encoding function specific to each relation. In the common feature extraction, we apply the common encoding function to all input embeddings. These encoding functions are implemented using dense layers in ComDensE. ComDensE achieves the state-of-the-art performance in the link prediction in terms of MRR, HIT@1 on FB15k-237 and HIT@1 on WN18RR compared to the previous baseline approaches. We conduct an extensive ablation study to examine the effects of the relation-aware layer and the common layer of the ComDensE. Experimental results illustrate that the combined dense architecture as implemented in ComDensE achieves the best performance.  ( 2 min )
    Teach me how to Interpolate a Myriad of Embeddings. (arXiv:2206.14868v1 [cs.LG])
    Mixup refers to interpolation-based data augmentation, originally motivated as a way to go beyond empirical risk minimization (ERM). Yet, its extensions focus on the definition of interpolation and the space where it takes place, while the augmentation itself is less studied: For a mini-batch of size $m$, most methods interpolate between $m$ pairs with a single scalar interpolation factor $\lambda$. In this work, we make progress in this direction by introducing MultiMix, which interpolates an arbitrary number $n$ of tuples, each of length $m$, with one vector $\lambda$ per tuple. On sequence data, we further extend to dense interpolation and loss computation over all spatial positions. Overall, we increase the number of tuples per mini-batch by orders of magnitude at little additional cost. This is possible by interpolating at the very last layer before the classifier. Finally, to address inconsistencies due to linear target interpolation, we introduce a self-distillation approach to generate and interpolate synthetic targets. We empirically show that our contributions result in significant improvement over state-of-the-art mixup methods on four benchmarks. By analyzing the embedding space, we observe that the classes are more tightly clustered and uniformly spread over the embedding space, thereby explaining the improved behavior.  ( 2 min )
    Randomized K-FACs: Speeding up K-FAC with Randomized Numerical Linear Algebra. (arXiv:2206.15397v1 [cs.LG])
    K-FAC is a successful tractable implementation of Natural Gradient for Deep Learning, which nevertheless suffers from the requirement to compute the inverse of the Kronecker factors (through an eigen-decomposition). This can be very time-consuming (or even prohibitive) when these factors are large. In this paper, we theoretically show that, owing to the exponential-average construction paradigm of the Kronecker factors that is typically used, their eigen-spectrum must decay. We show numerically that in practice this decay is very rapid, leading to the idea that we could save substantial computation by only focusing on the first few eigen-modes when inverting the Kronecker-factors. Randomized Numerical Linear Algebra provides us with the necessary tools to do so. Numerical results show we obtain $\approx2.5\times$ reduction in per-epoch time and $\approx3.3\times$ reduction in time to target accuracy. We compare our proposed K-FAC sped-up versions with a more computationally efficient NG implementation, SENG, and observe we perform on par with it.  ( 2 min )
    Group-invariant tensor train networks for supervised learning. (arXiv:2206.15051v1 [cs.LG])
    Invariance has recently proven to be a powerful inductive bias in machine learning models. One such class of predictive or generative models are tensor networks. We introduce a new numerical algorithm to construct a basis of tensors that are invariant under the action of normal matrix representations of an arbitrary discrete group. This method can be up to several orders of magnitude faster than previous approaches. The group-invariant tensors are then combined into a group-invariant tensor train network, which can be used as a supervised machine learning model. We applied this model to a protein binding classification problem, taking into account problem-specific invariances, and obtained prediction accuracy in line with state-of-the-art deep learning approaches.  ( 2 min )
    Virtual Analog Modeling of Distortion Circuits Using Neural Ordinary Differential Equations. (arXiv:2205.01897v2 [eess.AS] UPDATED)
    Recent research in deep learning has shown that neural networks can learn differential equations governing dynamical systems. In this paper, we adapt this concept to Virtual Analog (VA) modeling to learn the ordinary differential equations (ODEs) governing the first-order and the second-order diode clipper. The proposed models achieve performance comparable to state-of-the-art recurrent neural networks (RNNs) albeit using fewer parameters. We show that this approach does not require oversampling and allows to increase the sampling rate after the training has completed, which results in increased accuracy. Using a sophisticated numerical solver allows to increase the accuracy at the cost of slower processing. ODEs learned this way do not require closed forms but are still physically interpretable.  ( 2 min )
    Pooling Revisited: Your Receptive Field is Suboptimal. (arXiv:2205.15254v2 [cs.CV] UPDATED)
    The size and shape of the receptive field determine how the network aggregates local information and affect the overall performance of a model considerably. Many components in a neural network, such as kernel sizes and strides for convolution and pooling operations, influence the configuration of a receptive field. However, they still rely on hyperparameters, and the receptive fields of existing models result in suboptimal shapes and sizes. Hence, we propose a simple yet effective Dynamically Optimized Pooling operation, referred to as DynOPool, which optimizes the scale factors of feature maps end-to-end by learning the desirable size and shape of its receptive field in each layer. Any kind of resizing modules in a deep neural network can be replaced by the operations with DynOPool at a minimal cost. Also, DynOPool controls the complexity of a model by introducing an additional loss term that constrains computational cost. Our experiments show that the models equipped with the proposed learnable resizing module outperform the baseline networks on multiple datasets in image classification and semantic segmentation.  ( 2 min )
    On Measuring Excess Capacity in Neural Networks. (arXiv:2202.08070v2 [cs.LG] UPDATED)
    We study the excess capacity of deep networks in the context of supervised classification. That is, given a capacity measure of the underlying hypothesis class -- in our case, empirical Rademacher complexity -- by how much can we (a priori) constrain this class while retaining an empirical error on a par with the unconstrained regime? To assess excess capacity in modern architectures (such as residual networks), we extend and unify prior Rademacher complexity bounds to accommodate function composition and addition, as well as the structure of convolutions. The capacity-driving terms in our bounds are the Lipschitz constants of the layers and a (2,1) group norm distance to the initializations of the convolution weights. Experiments on benchmark datasets of varying task difficulty indicate that (1) there is a substantial amount of excess capacity per task, and (2) capacity can be kept at a surprisingly similar level across tasks. Overall, this suggests a notion of compressibility with respect to weight norms, orthogonal to classic compression via weight pruning.  ( 2 min )
    Deep Fusion Prior for Multi-Focus Image Super Resolution Fusion. (arXiv:2110.05706v4 [cs.CV] UPDATED)
    Multi-focus image fusion (MFIF) and super-resolution (SR) are the inverse problem of imaging model, purposes of MFIF and SR are obtaining all-in-focus and high-resolution 2D mapping of targets. Though various MFIF and SR methods have been designed; almost all the them deal with MFIF and SR separately. This paper unifies MFIF and SR problems in the physical perspective as the multi-focus image super resolution fusion (MFISRF), and we propose a novel unified dataset-free unsupervised framework named deep fusion prior (DFP) based-on deep image prior (DIP) to address such MFISRF with single model. Experiments have proved that our proposed DFP approaches or even outperforms those state-of-art MFIF and SR method combinations. To our best knowledge, our proposed work is a dataset-free unsupervised method to simultaneously implement the multi-focus fusion and super-resolution task for the first time. Additionally, DFP is a general framework, thus its networks and focus measurement tactics can be continuously updated to further improve the MFISRF performance. DFP codes are open source available at this http URL  ( 3 min )
    A Rigorous Study of Integrated Gradients Method and Extensions to Internal Neuron Attributions. (arXiv:2202.11912v2 [cs.LG] UPDATED)
    As deep learning (DL) efficacy grows, concerns for poor model explainability grow also. Attribution methods address the issue of explainability by quantifying the importance of an input feature for a model prediction. Among various methods, Integrated Gradients (IG) sets itself apart by claiming other methods failed to satisfy desirable axioms, while IG and methods like it uniquely satisfy said axioms. This paper comments on fundamental aspects of IG and its applications/extensions: 1) We identify key differences between IG function spaces and the supporting literature's function spaces which problematize previous claims of IG uniqueness. We show that with the introduction of an additional axiom, \textit{non-decreasing positivity}, the uniqueness claims can be established. 2) We address the question of input sensitivity by identifying function classes where IG is/is not Lipschitz in the attributed input. 3) We show that axioms for single-baseline methods have analogous properties for methods with probability distribution baselines. 4) We introduce a computationally efficient method of identifying internal neurons that contribute to specified regions of an IG attribution map. Finally, we present experimental results validating this method.  ( 2 min )
    SOSP: Efficiently Capturing Global Correlations by Second-Order Structured Pruning. (arXiv:2110.11395v2 [cs.LG] UPDATED)
    Pruning neural networks reduces inference time and memory costs. On standard hardware, these benefits will be especially prominent if coarse-grained structures, like feature maps, are pruned. We devise two novel saliency-based methods for second-order structured pruning (SOSP) which include correlations among all structures and layers. Our main method SOSP-H employs an innovative second-order approximation, which enables saliency evaluations by fast Hessian-vector products. SOSP-H thereby scales like a first-order method despite taking into account the full Hessian. We validate SOSP-H by comparing it to our second method SOSP-I that uses a well-established Hessian approximation, and to numerous state-of-the-art methods. While SOSP-H performs on par or better in terms of accuracy, it has clear advantages in terms of scalability and efficiency. This allowed us to scale SOSP-H to large-scale vision tasks, even though it captures correlations across all layers of the network. To underscore the global nature of our pruning methods, we evaluate their performance not only by removing structures from a pretrained network, but also by detecting architectural bottlenecks. We show that our algorithms allow to systematically reveal architectural bottlenecks, which we then remove to further increase the accuracy of the networks.  ( 3 min )
    Shifts 2.0: Extending The Dataset of Real Distributional Shifts. (arXiv:2206.15407v1 [cs.LG])
    Distributional shift, or the mismatch between training and deployment data, is a significant obstacle to the usage of machine learning in high-stakes industrial applications, such as autonomous driving and medicine. This creates a need to be able to assess how robustly ML models generalize as well as the quality of their uncertainty estimates. Standard ML baseline datasets do not allow these properties to be assessed, as the training, validation and test data are often identically distributed. Recently, a range of dedicated benchmarks have appeared, featuring both distributionally matched and shifted data. Among these benchmarks, the Shifts dataset stands out in terms of the diversity of tasks as well as the data modalities it features. While most of the benchmarks are heavily dominated by 2D image classification tasks, Shifts contains tabular weather forecasting, machine translation, and vehicle motion prediction tasks. This enables the robustness properties of models to be assessed on a diverse set of industrial-scale tasks and either universal or directly applicable task-specific conclusions to be reached. In this paper, we extend the Shifts Dataset with two datasets sourced from industrial, high-risk applications of high societal importance. Specifically, we consider the tasks of segmentation of white matter Multiple Sclerosis lesions in 3D magnetic resonance brain images and the estimation of power consumption in marine cargo vessels. Both tasks feature ubiquitous distributional shifts and a strict safety requirement due to the high cost of errors. These new datasets will allow researchers to further explore robust generalization and uncertainty estimation in new situations. In this work, we provide a description of the dataset and baseline results for both tasks.  ( 3 min )
    Is Neuro-Symbolic AI Meeting its Promise in Natural Language Processing? A Structured Review. (arXiv:2202.12205v2 [cs.AI] UPDATED)
    Advocates for Neuro-Symbolic Artificial Intelligence (NeSy) assert that combining deep learning with symbolic reasoning will lead to stronger AI than either paradigm on its own. As successful as deep learning has been, it is generally accepted that even our best deep learning systems are not very good at abstract reasoning. And since reasoning is inextricably linked to language, it makes intuitive sense that Natural Language Processing (NLP), would be a particularly well-suited candidate for NeSy. We conduct a structured review of studies implementing NeSy for NLP, with the aim of answering the question of whether NeSy is indeed meeting its promises: reasoning, out-of-distribution generalization, interpretability, learning and reasoning from small data, and transferability to new domains. We examine the impact of knowledge representation, such as rules and semantic networks, language structure and relational structure, and whether implicit or explicit reasoning contributes to higher promise scores. We find that systems where logic is compiled into the neural network lead to the most NeSy goals being satisfied, while other factors such as knowledge representation, or type of neural architecture do not exhibit a clear correlation with goals being met. We find many discrepancies in how reasoning is defined, specifically in relation to human level reasoning, which impact decisions about model architectures and drive conclusions which are not always consistent across studies. Hence we advocate for a more methodical approach to the application of theories of human reasoning as well as the development of appropriate benchmarks, which we hope can lead to a better understanding of progress in the field. We make our data and code available on github for further analysis.  ( 3 min )
    Rethinking Exponential Averaging of the Fisher. (arXiv:2204.04718v2 [cs.LG] UPDATED)
    In optimization for Machine learning (ML), it is typical that curvature-matrix (CM) estimates rely on an exponential average (EA) of local estimates (giving EA-CM algorithms). This approach has little principled justification, but is very often used in practice. In this paper, we draw a connection between EA-CM algorithms and what we call a "Wake of Quadratic regularized models". The outlined connection allows us to understand what EA-CM algorithms are doing from an optimization perspective. Generalizing from the established connection, we propose a new family of algorithms, "KL-Divergence Wake-Regularized Models" (KLD-WRM). We give three different practical instantiations of KLD-WRM, and show numerically that these outperform K-FAC on MNIST.  ( 2 min )
    Contrastive Pretraining for Echocardiography Segmentation with Limited Data. (arXiv:2201.07219v2 [eess.IV] UPDATED)
    Contrastive learning has proven useful in many applications where access to labelled data is limited. The lack of annotated data is particularly problematic in medical image segmentation as it is difficult to have clinical experts manually annotate large volumes of data such as cardiac structures in ultrasound images of the heart. In this paper, we argue whether or not contrastive pretraining is helpful for the segmentation of the left ventricle in echocardiography images. Furthermore, we study the effect of contrastive pretraining on two well-known segmentation networks, UNet and DeepLabV3. Our results show that contrastive pretraining helps improve the performance on left ventricle segmentation, particularly when annotated data is scarce. We show how to achieve comparable results to state-of-the-art fully supervised algorithms when we train our models in a self-supervised fashion followed by fine-tuning on just 5\% of the data. We show that our solution outperforms what is currently published on a large public dataset (EchoNet-Dynamic) achieving a Dice score of 0.9211. We also compare the performance of our solution on another smaller dataset (CAMUS) to demonstrate the generalizability of our proposed solution. The code is available at (https://github.com/BioMedIA-MBZUAI/contrastive-echo).  ( 3 min )
    GraphFramEx: Towards Systematic Evaluation of Explainability Methods for Graph Neural Networks. (arXiv:2206.09677v2 [cs.LG] UPDATED)
    As one of the most popular machine learning models today, graph neural networks (GNNs) have attracted intense interest recently, and so does their explainability. Users are increasingly interested in a better understanding of GNN models and their outcomes. Unfortunately, today's evaluation frameworks for GNN explainability often rely on synthetic datasets, leading to conclusions of limited scope due to a lack of complexity in the problem instances. As GNN models are deployed to more mission-critical applications, we are in dire need for a common evaluation protocol of explainability methods of GNNs. In this paper, we propose, to our best knowledge, the first systematic evaluation framework for GNN explainability, considering explainability on three different "user needs:" explanation focus, mask nature, and mask transformation. We propose a unique metric that combines the fidelity measures and classify explanations based on their quality of being sufficient or necessary. We scope ourselves to node classification tasks and compare the most representative techniques in the field of input-level explainability for GNNs. For the widely used synthetic benchmarks, surprisingly shallow techniques such as personalized PageRank have the best performance for a minimum computation time. But when the graph structure is more complex and nodes have meaningful features, gradient-based methods, in particular Saliency, are the best according to our evaluation criteria. However, none dominates the others on all evaluation dimensions and there is always a trade-off. We further apply our evaluation protocol in a case study on eBay graphs to reflect the production environment.  ( 3 min )
    An Efficient Industrial Federated Learning Framework for AIoT: A Face Recognition Application. (arXiv:2206.13398v2 [cs.CV] UPDATED)
    Recently, the artificial intelligence of things (AIoT) has been gaining increasing attention, with an intriguing vision of providing highly intelligent services through the network connection of things, leading to an advanced AI-driven ecology. However, recent regulatory restrictions on data privacy preclude uploading sensitive local data to data centers and utilizing them in a centralized approach. Directly applying federated learning algorithms in this scenario could hardly meet the industrial requirements of both efficiency and accuracy. Therefore, we propose an efficient industrial federated learning framework for AIoT in terms of a face recognition application. Specifically, we propose to utilize the concept of transfer learning to speed up federated training on devices and further present a novel design of a private projector that helps protect shared gradients without incurring additional memory consumption or computational cost. Empirical studies on a private Asian face dataset show that our approach can achieve high recognition accuracy in only 20 communication rounds, demonstrating its effectiveness in prediction and its efficiency in training.  ( 2 min )
    Ensemble CNN models for Covid-19 Recognition and Severity Perdition From 3D CT-scan. (arXiv:2206.15431v1 [eess.IV])
    Since the appearance of Covid-19 in late 2019, Covid-19 has become an active research topic for the artificial intelligence (AI) community. One of the most interesting AI topics is Covid-19 analysis of medical imaging. CT-scan imaging is the most informative tool about this disease. This work is part of the 2nd COV19D competition, where two challenges are set: Covid-19 Detection and Covid-19 Severity Detection from the CT-scans. For Covid-19 detection from CT-scans, we proposed an ensemble of 2D Convolution blocks with Densenet-161 models. Here, each 2D convolutional block with Densenet-161 architecture is trained separately and in testing phase, the ensemble model is based on the average of their probabilities. On the other hand, we proposed an ensemble of Convolutional Layers with Inception models for Covid-19 severity detection. In addition to the Convolutional Layers, three Inception variants were used, namely Inception-v3, Inception-v4 and Inception-Resnet. Our proposed approaches outperformed the baseline approach in the validation data of the 2nd COV19D competition by 11% and 16% for Covid-19 detection and Covid-19 severity detection, respectively.  ( 3 min )
    Predicting Corporate Risk by Jointly Modeling Company Networks and Dialogues in Earnings Conference Calls. (arXiv:2206.06174v2 [cs.CL] UPDATED)
    Earnings conference calls are attracting an increasing number of researchers due to their free form and rich information. Existing studies, however, do not take speaker role information into account. Furthermore, current research does not fully account for the impact of inter-company relationships on company risk. The only study that integrates company networks and earnings conference calls constructs an undirected graph for companies holding earnings conference calls at different dates, failing to meet the requirement of no temporal information leakage for prediction tasks. To address the aforementioned issues, we propose a new model called Temporal Virtual Graph Neural Network (TVGNN), which incorporates earnings conference calls and company networks to predict company risk. For the first time, our model incorporates participant role information in dialogue modeling. Moreover, we develop a new approach to construct company networks that ensures no temporal information leakage in the graph. In experiments, our proposed model outperforms all baselines. The supplementary analyses demonstrate the model's effectiveness and interpretability.  ( 2 min )
    Learning two-phase microstructure evolution using neural operators and autoencoder architectures. (arXiv:2204.07230v2 [cond-mat.mtrl-sci] UPDATED)
    Phase-field modeling is an effective but computationally expensive method for capturing the mesoscale morphological and microstructure evolution in materials. Hence, fast and generalizable surrogate models are needed to alleviate the cost of computationally taxing processes such as in optimization and design of materials. The intrinsic discontinuous nature of the physical phenomena incurred by the presence of sharp phase boundaries makes the training of the surrogate model cumbersome. We develop a framework that integrates a convolutional autoencoder architecture with a deep neural operator (DeepONet) to learn the dynamic evolution of a two-phase mixture and accelerate time-to-solution in predicting the microstructure evolution. We utilize the convolutional autoencoder to provide a compact representation of the microstructure data in a low-dimensional latent space. DeepONet, which consists of two sub-networks, one for encoding the input function at a fixed number of sensors locations (branch net) and another for encoding the locations for the output functions (trunk net), learns the mesoscale dynamics of the microstructure evolution from the autoencoder latent space. The decoder part of the convolutional autoencoder then reconstructs the time-evolved microstructure from the DeepONet predictions. The trained DeepONet architecture can then be used to replace the high-fidelity phase-field numerical solver in interpolation tasks or to accelerate the numerical solver in extrapolation tasks.  ( 3 min )
    Learning Generative Factors of Neuroimaging Data with Variational auto-encoders. (arXiv:2206.01939v2 [cs.LG] UPDATED)
    Neuroimaging techniques produce high-dimensional, stochastic data from which it might be challenging to extract high-level knowledge about the phenomena of interest. We address this challenge by applying the generative modelling framework to 1) classify multiple pathologies and 2) recover the neurological mechanisms of those pathologies in a data-driven manner. Our framework learns generative factors of data related to pathologies. We provide an algorithm to decode those factors further and observe how different pathologies affect observed data. We illustrate the applicability of the proposed approach to identifying schizophrenia, either followed or not by auditory verbal hallucinations. We further demonstrate the ability of the framework to learn disease-related mechanisms consistent with current domain knowledge. We also compare the proposed framework with several benchmark approaches and indicate its classification performance and interpretability advantages.  ( 2 min )
    ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive Training. (arXiv:2110.05323v2 [cs.LG] UPDATED)
    Federated learning is a powerful distributed learning scheme that allows numerous edge devices to collaboratively train a model without sharing their data. However, training is resource-intensive for edge devices, and limited network bandwidth is often the main bottleneck. Prior work often overcomes the constraints by condensing the models or messages into compact formats, e.g., by gradient compression or distillation. In contrast, we propose ProgFed, the first progressive training framework for efficient and effective federated learning. It inherently reduces computation and two-way communication costs while maintaining the strong performance of the final models. We theoretically prove that ProgFed converges at the same asymptotic rate as standard training on full models. Extensive results on a broad range of architectures, including CNNs (VGG, ResNet, ConvNets) and U-nets, and diverse tasks from simple classification to medical image segmentation show that our highly effective training approach saves up to $20\%$ computation and up to $63\%$ communication costs for converged models. As our approach is also complimentary to prior work on compression, we can achieve a wide range of trade-offs by combining these techniques, showing reduced communication of up to $50\times$ at only $0.1\%$ loss in utility. Code is available at https://github.com/a514514772/ProgFed.  ( 3 min )
    A Medical Image Fusion Method based on MDLatLRRv2. (arXiv:2206.15179v1 [eess.IV])
    Since MDLatLRR only considers detailed parts (salient features) of input images extracted by latent low-rank representation (LatLRR), it doesn't use base parts (principal features) extracted by LatLRR effectively. Therefore, we proposed an improved multi-level decomposition method called MDLatLRRv2 which effectively analyzes and utilizes all the image features obtained by LatLRR. Then we apply MDLatLRRv2 to medical image fusion. The base parts are fused by average strategy and the detail parts are fused by nuclear-norm operation. The comparison with the existing methods demonstrates that the proposed method can achieve state-of-the-art fusion performance in objective and subjective assessment.  ( 2 min )
    Which Minimizer Does My Neural Network Converge To?. (arXiv:2011.02408v2 [stat.ML] UPDATED)
    The loss surface of an overparameterized neural network (NN) possesses many global minima of zero training error. We explain how common variants of the standard NN training procedure change the minimizer obtained. First, we make explicit how the size of the initialization of a strongly overparameterized NN affects the minimizer and can deteriorate its final test performance. We propose a strategy to limit this effect. Then, we demonstrate that for adaptive optimization such as AdaGrad, the obtained minimizer generally differs from the gradient descent (GD) minimizer. This adaptive minimizer is changed further by stochastic mini-batch training, even though in the non-adaptive case, GD and stochastic GD result in essentially the same minimizer. Lastly, we explain that these effects remain relevant for less overparameterized NNs. While overparameterization has its benefits, our work highlights that it induces sources of error absent from underparameterized models.  ( 2 min )
    Hybrid Handcrafted and Learnable Audio Representation for Analysis of Speech Under Cognitive and Physical Load. (arXiv:2203.16637v2 [cs.SD] UPDATED)
    As a neurophysiological response to threat or adverse conditions, stress can affect cognition, emotion and behaviour with potentially detrimental effects on health in the case of sustained exposure. Since the affective content of speech is inherently modulated by an individual's physical and mental state, a substantial body of research has been devoted to the study of paralinguistic correlates of stress-inducing task load. Historically, voice stress analysis (VSA) has been conducted using conventional digital signal processing (DSP) techniques. Despite the development of modern methods based on deep neural networks (DNNs), accurately detecting stress in speech remains difficult due to the wide variety of stressors and considerable variability in the individual stress perception. To that end, we introduce a set of five datasets for task load detection in speech. The voice recordings were collected as either cognitive or physical stress was induced in the cohort of volunteers, with a cumulative number of more than a hundred speakers. We used the datasets to design and evaluate a novel self-supervised audio representation that leverages the effectiveness of handcrafted features (DSP-based) and the complexity of data-driven DNN representations. Notably, the proposed approach outperformed both extensive handcrafted feature sets and novel DNN-based audio representation learning approaches.  ( 3 min )
    Learning Task-relevant Representations for Generalization via Characteristic Functions of Reward Sequence Distributions. (arXiv:2205.10218v3 [cs.LG] UPDATED)
    Generalization across different environments with the same tasks is critical for successful applications of visual reinforcement learning (RL) in real scenarios. However, visual distractions -- which are common in real scenes -- from high-dimensional observations can be hurtful to the learned representations in visual RL, thus degrading the performance of generalization. To tackle this problem, we propose a novel approach, namely Characteristic Reward Sequence Prediction (CRESP), to extract the task-relevant information by learning reward sequence distributions (RSDs), as the reward signals are task-relevant in RL and invariant to visual distractions. Specifically, to effectively capture the task-relevant information via RSDs, CRESP introduces an auxiliary task -- that is, predicting the characteristic functions of RSDs -- to learn task-relevant representations, because we can well approximate the high-dimensional distributions by leveraging the corresponding characteristic functions. Experiments demonstrate that CRESP significantly improves the performance of generalization on unseen environments, outperforming several state-of-the-arts on DeepMind Control tasks with different visual distractions.  ( 3 min )
    Learning Pneumatic Non-Prehensile Manipulation with a Mobile Blower. (arXiv:2204.02390v2 [cs.RO] UPDATED)
    We investigate pneumatic non-prehensile manipulation (i.e., blowing) as a means of efficiently moving scattered objects into a target receptacle. Due to the chaotic nature of aerodynamic forces, a blowing controller must (i) continually adapt to unexpected changes from its actions, (ii) maintain fine-grained control, since the slightest misstep can result in large unintended consequences (e.g., scatter objects already in a pile), and (iii) infer long-range plans (e.g., move the robot to strategic blowing locations). We tackle these challenges in the context of deep reinforcement learning, introducing a multi-frequency version of the spatial action maps framework. This allows for efficient learning of vision-based policies that effectively combine high-level planning and low-level closed-loop control for dynamic mobile manipulation. Experiments show that our system learns efficient behaviors for the task, demonstrating in particular that blowing achieves better downstream performance than pushing, and that our policies improve performance over baselines. Moreover, we show that our system naturally encourages emergent specialization between the different subpolicies spanning low-level fine-grained control and high-level planning. On a real mobile robot equipped with a miniature air blower, we show that our simulation-trained policies transfer well to a real environment and can generalize to novel objects.  ( 3 min )
    QUIDAM: A Framework for Quantization-Aware DNN Accelerator and Model Co-Exploration. (arXiv:2206.15463v1 [cs.AR])
    As the machine learning and systems communities strive to achieve higher energy-efficiency through custom deep neural network (DNN) accelerators, varied precision or quantization levels, and model compression techniques, there is a need for design space exploration frameworks that incorporate quantization-aware processing elements into the accelerator design space while having accurate and fast power, performance, and area models. In this work, we present QUIDAM, a highly parameterized quantization-aware DNN accelerator and model co-exploration framework. Our framework can facilitate future research on design space exploration of DNN accelerators for various design choices such as bit precision, processing element type, scratchpad sizes of processing elements, global buffer size, number of total processing elements, and DNN configurations. Our results show that different bit precisions and processing element types lead to significant differences in terms of performance per area and energy. Specifically, our framework identifies a wide range of design points where performance per area and energy varies more than 5x and 35x, respectively. With the proposed framework, we show that lightweight processing elements achieve on par accuracy results and up to 5.7x more performance per area and energy improvement when compared to the best INT16 based implementation. Finally, due to the efficiency of the pre-characterized power, performance, and area models, QUIDAM can speed up the design exploration process by 3-4 orders of magnitude as it removes the need for expensive synthesis and characterization of each design.  ( 3 min )
    Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding. (arXiv:2206.15427v1 [eess.AS])
    This paper studies a transferable phoneme embedding framework that aims to deal with the cross-lingual text-to-speech (TTS) problem under the few-shot setting. Transfer learning is a common approach when it comes to few-shot learning since training from scratch on few-shot training data is bound to overfit. Still, we find that the naive transfer learning approach fails to adapt to unseen languages under extremely few-shot settings, where less than 8 minutes of data is provided. We deal with the problem by proposing a framework that consists of a phoneme-based TTS model and a codebook module to project phonemes from different languages into a learned latent space. Furthermore, by utilizing phoneme-level averaged self-supervised learned features, we effectively improve the quality of synthesized speeches. Experiments show that using 4 utterances, which is about 30 seconds of data, is enough to synthesize intelligible speech when adapting to an unseen language using our framework.
    Adaptive Cut Selection in Mixed-Integer Linear Programming. (arXiv:2202.10962v2 [math.OC] UPDATED)
    Cut selection is a subroutine used in all modern mixed-integer linear programming solvers with the goal of selecting a subset of generated cuts that induce optimal solver performance. These solvers have millions of parameter combinations, and so are excellent candidates for parameter tuning. Cut selection scoring rules are usually weighted sums of different measurements, where the weights are parameters. We present a parametric family of mixed-integer linear programs together with infinitely many family-wide valid cuts. Some of these cuts can induce integer optimal solutions directly after being applied, while others fail to do so even if an infinite amount are applied. We show for a specific cut selection rule, that any finite grid search of the parameter space will always miss all parameter values, which select integer optimal inducing cuts in an infinite amount of our problems. We propose a variation on the design of existing graph convolutional neural networks, adapting them to learn cut selection rule parameters. We present a reinforcement learning framework for selecting cuts, and train our design using said framework over MIPLIB 2017. Our framework and design show that adaptive cut selection does substantially improve performance over a diverse set of instances, but that finding a single function describing such a rule is difficult. Code for reproducing all experiments is available at https://github.com/Opt-Mucca/Adaptive-Cutsel-MILP.  ( 3 min )
    Model-Value Inconsistency as a Signal for Epistemic Uncertainty. (arXiv:2112.04153v3 [cs.LG] UPDATED)
    Using a model of the environment and a value function, an agent can construct many estimates of a state's value, by unrolling the model for different lengths and bootstrapping with its value function. Our key insight is that one can treat this set of value estimates as a type of ensemble, which we call an \emph{implicit value ensemble} (IVE). Consequently, the discrepancy between these estimates can be used as a proxy for the agent's epistemic uncertainty; we term this signal \emph{model-value inconsistency} or \emph{self-inconsistency} for short. Unlike prior work which estimates uncertainty by training an ensemble of many models and/or value functions, this approach requires only the single model and value function which are already being learned in most model-based reinforcement learning algorithms. We provide empirical evidence in both tabular and function approximation settings from pixels that self-inconsistency is useful (i) as a signal for exploration, (ii) for acting safely under distribution shifts, and (iii) for robustifying value-based planning with a learned model.  ( 3 min )
    TINC: Temporally Informed Non-Contrastive Learning for Disease Progression Modeling in Retinal OCT Volumes. (arXiv:2206.15282v1 [cs.CV])
    Recent contrastive learning methods achieved state-of-the-art in low label regimes. However, the training requires large batch sizes and heavy augmentations to create multiple views of an image. With non-contrastive methods, the negatives are implicitly incorporated in the loss, allowing different images and modalities as pairs. Although the meta-information (i.e., age, sex) in medical imaging is abundant, the annotations are noisy and prone to class imbalance. In this work, we exploited already existing temporal information (different visits from a patient) in a longitudinal optical coherence tomography (OCT) dataset using temporally informed non-contrastive loss (TINC) without increasing complexity and need for negative pairs. Moreover, our novel pair-forming scheme can avoid heavy augmentations and implicitly incorporates the temporal information in the pairs. Finally, these representations learned from the pretraining are more successful in predicting disease progression where the temporal information is crucial for the downstream task. More specifically, our model outperforms existing models in predicting the risk of conversion within a time frame from intermediate age-related macular degeneration (AMD) to the late wet-AMD stage.  ( 2 min )
    Federated Over-Air Subspace Tracking from Incomplete and Corrupted Data. (arXiv:2002.12873v4 [cs.LG] UPDATED)
    In this work we study the problem of Subspace Tracking with missing data (ST-miss) and outliers (Robust ST-miss). We propose a novel algorithm, and provide a guarantee for both these problems. Unlike past work on this topic, the current work does not impose the piecewise constant subspace change assumption. Additionally, the proposed algorithm is much simpler (uses fewer parameters) than our previous work. Secondly, we extend our approach and its analysis to provably solving these problems when the data is federated and when the over-air data communication modality is used for information exchange between the $K$ peer nodes and the center. We validate our theoretical claims with extensive numerical experiments.  ( 2 min )
    Data-Efficient Learning via Minimizing Hyperspherical Energy. (arXiv:2206.15204v1 [cs.LG])
    Deep learning on large-scale data is dominant nowadays. The unprecedented scale of data has been arguably one of the most important driving forces for the success of deep learning. However, there still exist scenarios where collecting data or labels could be extremely expensive, e.g., medical imaging and robotics. To fill up this gap, this paper considers the problem of data-efficient learning from scratch using a small amount of representative data. First, we characterize this problem by active learning on homeomorphic tubes of spherical manifolds. This naturally generates feasible hypothesis class. With homologous topological properties, we identify an important connection -- finding tube manifolds is equivalent to minimizing hyperspherical energy (MHE) in physical geometry. Inspired by this connection, we propose a MHE-based active learning (MHEAL) algorithm, and provide comprehensive theoretical guarantees for MHEAL, covering convergence and generalization analysis. Finally, we demonstrate the empirical performance of MHEAL in a wide range of applications on data-efficient learning, including deep clustering, distribution matching, version space sampling and deep active learning.  ( 2 min )
    The Topological BERT: Transforming Attention into Topology for Natural Language Processing. (arXiv:2206.15195v1 [cs.CL])
    In recent years, the introduction of the Transformer models sparked a revolution in natural language processing (NLP). BERT was one of the first text encoders using only the attention mechanism without any recurrent parts to achieve state-of-the-art results on many NLP tasks. This paper introduces a text classifier using topological data analysis. We use BERT's attention maps transformed into attention graphs as the only input to that classifier. The model can solve tasks such as distinguishing spam from ham messages, recognizing whether a sentence is grammatically correct, or evaluating a movie review as negative or positive. It performs comparably to the BERT baseline and outperforms it on some tasks. Additionally, we propose a new method to reduce the number of BERT's attention heads considered by the topological classifier, which allows us to prune the number of heads from 144 down to as few as ten with no reduction in performance. Our work also shows that the topological model displays higher robustness against adversarial attacks than the original BERT model, which is maintained during the pruning process. To the best of our knowledge, this work is the first to confront topological-based models with adversarial attacks in the context of NLP.  ( 2 min )
    Knowledge-Grounded Self-Rationalization via Extractive and Natural Language Explanations. (arXiv:2106.13876v3 [cs.CL] UPDATED)
    Models that generate extractive rationales (i.e., subsets of features) or natural language explanations (NLEs) for their predictions are important for explainable AI. While an extractive rationale provides a quick view of the features most responsible for a prediction, an NLE allows for a comprehensive description of the decision-making process behind a prediction. However, current models that generate the best extractive rationales or NLEs often fall behind the state-of-the-art (SOTA) in terms of task performance. In this work, we bridge this gap by introducing RExC, a self-rationalizing framework that grounds its predictions and two complementary types of explanations (NLEs and extractive rationales) in background knowledge. Our framework improves over previous methods by: (i) reaching SOTA task performance while also providing explanations, (ii) providing two types of explanations, while existing models usually provide only one type, and (iii) beating by a large margin the previous SOTA in terms of quality of both types of explanations. Furthermore, a perturbation analysis in RExC shows a high degree of association between explanations and predictions, a necessary property of faithful explanations.  ( 3 min )
    PhySRNet: Physics informed super-resolution network for application in computational solid mechanics. (arXiv:2206.15457v1 [cond-mat.mtrl-sci])
    Traditional approaches based on finite element analyses have been successfully used to predict the macro-scale behavior of heterogeneous materials (composites, multicomponent alloys, and polycrystals) widely used in industrial applications. However, this necessitates the mesh size to be smaller than the characteristic length scale of the microstructural heterogeneities in the material leading to computationally expensive and time-consuming calculations. The recent advances in deep learning based image super-resolution (SR) algorithms open up a promising avenue to tackle this computational challenge by enabling researchers to enhance the spatio-temporal resolution of data obtained from coarse mesh simulations. However, technical challenges still remain in developing a high-fidelity SR model for application to computational solid mechanics, especially for materials undergoing large deformation. This work aims at developing a physics-informed deep learning based super-resolution framework (PhySRNet) which enables reconstruction of high-resolution deformation fields (displacement and stress) from their low-resolution counterparts without requiring high-resolution labeled data. We design a synthetic case study to illustrate the effectiveness of the proposed framework and demonstrate that the super-resolved fields match the accuracy of an advanced numerical solver running at 400 times the coarse mesh resolution while simultaneously satisfying the (highly nonlinear) governing laws. The approach opens the door to applying machine learning and traditional numerical approaches in tandem to reduce computational complexity accelerate scientific discovery and engineering design.
    Rate-Distortion Theoretic Generalization Bounds for Stochastic Learning Algorithms. (arXiv:2203.02474v2 [stat.ML] UPDATED)
    Understanding generalization in modern machine learning settings has been one of the major challenges in statistical learning theory. In this context, recent years have witnessed the development of various generalization bounds suggesting different complexity notions such as the mutual information between the data sample and the algorithm output, compressibility of the hypothesis space, and the fractal dimension of the hypothesis space. While these bounds have illuminated the problem at hand from different angles, their suggested complexity notions might appear seemingly unrelated, thereby restricting their high-level impact. In this study, we prove novel generalization bounds through the lens of rate-distortion theory, and explicitly relate the concepts of mutual information, compressibility, and fractal dimensions in a single mathematical framework. Our approach consists of (i) defining a generalized notion of compressibility by using source coding concepts, and (ii) showing that the `compression error rate' can be linked to the generalization error both in expectation and with high probability. We show that in the `lossless compression' setting, we recover and improve existing mutual information-based bounds, whereas a `lossy compression' scheme allows us to link generalization to the rate-distortion dimension -- a particular notion of fractal dimension. Our results bring a more unified perspective on generalization and open up several future research directions.  ( 3 min )
    Auto Response Generation in Online Medical Chat Services. (arXiv:2104.12755v2 [cs.CL] UPDATED)
    Telehealth helps to facilitate access to medical professionals by enabling remote medical services for the patients. These services have become gradually popular over the years with the advent of necessary technological infrastructure. The benefits of telehealth have been even more apparent since the beginning of the COVID-19 crisis, as people have become less inclined to visit doctors in person during the pandemic. In this paper, we focus on facilitating the chat sessions between a doctor and a patient. We note that the quality and efficiency of the chat experience can be critical as the demand for telehealth services increases. Accordingly, we develop a smart auto-response generation mechanism for medical conversations that helps doctors respond to consultation requests efficiently, particularly during busy sessions. We explore over 900,000 anonymous, historical online messages between doctors and patients collected over nine months. We implement clustering algorithms to identify the most frequent responses by doctors and manually label the data accordingly. We then train machine learning algorithms using this preprocessed data to generate the responses. The considered algorithm has two steps: a filtering (i.e., triggering) model to filter out infeasible patient messages and a response generator to suggest the top-3 doctor responses for the ones that successfully pass the triggering phase. The method provides an accuracy of 83.28\% for precision@3 and shows robustness to its parameters.  ( 3 min )
    DAReN: A Collaborative Approach Towards Reasoning And Disentangling. (arXiv:2109.13156v2 [cs.LG] UPDATED)
    Computational learning approaches to solving visual reasoning tests, such as Raven's Progressive Matrices (RPM), critically depend on the ability to identify the visual concepts used in the test (i.e., the representation) as well as the latent rules based on those concepts (i.e., the reasoning). However, learning of representation and reasoning is a challenging and ill-posed task, often approached in a stage-wise manner (first representation, then reasoning). In this work, we propose an end-to-end joint representation-reasoning learning framework, which leverages a weak form of inductive bias to improve both tasks together. Specifically, we introduce a general generative graphical model for RPMs, GM-RPM, and apply it to solve the reasoning test. We accomplish this using a novel learning framework Disentangling based Abstract Reasoning Network (DAReN) based on the principles of GM-RPM. We perform an empirical evaluation of DAReN over several benchmark datasets. DAReN shows consistent improvement over state-of-the-art (SOTA) models on both the reasoning and the disentanglement tasks. This demonstrates the strong correlation between disentangled latent representation and the ability to solve abstract visual reasoning tasks.  ( 2 min )
    Wasserstein GANs with Gradient Penalty Compute Congested Transport. (arXiv:2109.00528v2 [cs.LG] UPDATED)
    Wasserstein GANs with Gradient Penalty (WGAN-GP) are a very popular method for training generative models to produce high quality synthetic data. While WGAN-GP were initially developed to calculate the Wasserstein 1 distance between generated and real data, recent works (e.g. [23]) have provided empirical evidence that this does not occur, and have argued that WGAN-GP perform well not in spite of this issue, but because of it. In this paper we show for the first time that WGAN-GP compute the minimum of a different optimal transport problem, the so-called congested transport [7]. Congested transport determines the cost of moving one distribution to another under a transport model that penalizes congestion. For WGAN-GP, we find that the congestion penalty has a spatially varying component determined by the sampling strategy used in [12] which acts like a local speed limit, making congestion cost less in some regions than others. This aspect of the congested transport problem is new, in that the congestion penalty turns out to be unbounded and depends on the distributions to be transported, and so we provide the necessary mathematical proofs for this setting. One facet of our discovery is a formula connecting the gradient of solutions to the optimization problem in WGAN-GP to the time averaged momentum of the optimal mass flow. This is in contrast to the gradient of Kantorovich potentials for the Wasserstein 1 distance, which is just the normalized direction of flow. Based on this and other considerations, we speculate on how our results explain the observed performance of WGAN-GP. Beyond applications to GANs, our theorems also point to the possibility of approximately solving large scale congested transport problems using neural network techniques.  ( 3 min )
    Augmenting Reinforcement Learning with Behavior Primitives for Diverse Manipulation Tasks. (arXiv:2110.03655v3 [cs.LG] UPDATED)
    Realistic manipulation tasks require a robot to interact with an environment with a prolonged sequence of motor actions. While deep reinforcement learning methods have recently emerged as a promising paradigm for automating manipulation behaviors, they usually fall short in long-horizon tasks due to the exploration burden. This work introduces Manipulation Primitive-augmented reinforcement Learning (MAPLE), a learning framework that augments standard reinforcement learning algorithms with a pre-defined library of behavior primitives. These behavior primitives are robust functional modules specialized in achieving manipulation goals, such as grasping and pushing. To use these heterogeneous primitives, we develop a hierarchical policy that involves the primitives and instantiates their executions with input parameters. We demonstrate that MAPLE outperforms baseline approaches by a significant margin on a suite of simulated manipulation tasks. We also quantify the compositional structure of the learned behaviors and highlight our method's ability to transfer policies to new task variants and to physical hardware. Videos and code are available at https://ut-austin-rpl.github.io/maple  ( 2 min )
    Universal and data-adaptive algorithms for model selection in linear contextual bandits. (arXiv:2111.04688v2 [cs.LG] UPDATED)
    Model selection in contextual bandits is an important complementary problem to regret minimization with respect to a fixed model class. We consider the simplest non-trivial instance of model-selection: distinguishing a simple multi-armed bandit problem from a linear contextual bandit problem. Even in this instance, current state-of-the-art methods explore in a suboptimal manner and require strong "feature-diversity" conditions. In this paper, we introduce new algorithms that a) explore in a data-adaptive manner, and b) provide model selection guarantees of the form $\mathcal{O}(d^{\alpha} T^{1- \alpha})$ with no feature diversity conditions whatsoever, where $d$ denotes the dimension of the linear model and $T$ denotes the total number of rounds. The first algorithm enjoys a "best-of-both-worlds" property, recovering two prior results that hold under distinct distributional assumptions, simultaneously. The second removes distributional assumptions altogether, expanding the scope for tractable model selection. Our approach extends to model selection among nested linear contextual bandits under some additional assumptions.  ( 2 min )
    GDA-AM: On the effectiveness of solving minimax optimization via Anderson Acceleration. (arXiv:2110.02457v3 [cs.LG] UPDATED)
    Many modern machine learning algorithms such as generative adversarial networks (GANs) and adversarial training can be formulated as minimax optimization. Gradient descent ascent (GDA) is the most commonly used algorithm due to its simplicity. However, GDA can converge to non-optimal minimax points. We propose a new minimax optimization framework, GDA-AM, that views the GDAdynamics as a fixed-point iteration and solves it using Anderson Mixing to con-verge to the local minimax. It addresses the diverging issue of simultaneous GDAand accelerates the convergence of alternating GDA. We show theoretically that the algorithm can achieve global convergence for bilinear problems under mild conditions. We also empirically show that GDA-AMsolves a variety of minimax problems and improves GAN training on several datasets  ( 2 min )
    A deep convolutional neural network that is invariant to time rescaling. (arXiv:2107.04616v3 [cs.LG] UPDATED)
    Human learners can readily understand speech, or a melody, when it is presented slower or faster than usual. Although deep convolutional neural networks (CNNs) are extremely powerful in extracting information from time series, they require explicit training to generalize to different time scales. This paper presents a deep CNN that incorporates a temporal representation inspired by recent findings from neuroscience. In the mammalian brain, time is represented by populations of neurons with temporal receptive fields. Critically, the peaks of the receptive fields form a geometric series, such that the population codes a set of temporal basis functions over log time. Because memory for the recent past is a function of log time, rescaling the input results in translation of the memory. The Scale-Invariant Temporal History Convolution network (SITHCon) builds a convolutional layer over this logarithmically-distributed temporal memory. A max-pool operation results in a network that is invariant to rescalings of time modulo edge effects. We compare performance of SITHCon to a Temporal Convolution Network (TCN). Although both networks can learn classification and regression problems on both univariate and multivariate time series f(t), only SITHCon generalizes to rescalings f(at). This property, inspired by findings from contemporary neuroscience and consistent with findings from cognitive psychology, may enable networks that learn with fewer training examples, fewer weights and that generalize more robustly to out of sample data.  ( 3 min )
    A Latent Restoring Force Approach to Nonlinear System Identification. (arXiv:2109.10681v2 [stat.ML] UPDATED)
    Identification of nonlinear dynamic systems remains a significant challenge across engineering. This work suggests an approach based on Bayesian filtering to extract and identify the contribution of an unknown nonlinear term in the system which can be seen as an alternative viewpoint on restoring force surface type approaches. To achieve this identification, the contribution which is the nonlinear restoring force is modelled, initially, as a Gaussian process in time. That Gaussian process is converted into a state-space model and combined with the linear dynamic component of the system. Then, by inference of the filtering and smoothing distributions, the internal states of the system and the nonlinear restoring force can be extracted. In possession of these states a nonlinear model can be constructed. The approach is demonstrated to be effective in both a simulated case study and on an experimental benchmark dataset.  ( 2 min )
    FL-Tuning: Layer Tuning for Feed-Forward Network in Transformer. (arXiv:2206.15312v1 [cs.CL])
    Prompt tuning is an emerging way of adapting pre-trained language models to downstream tasks. However, the existing studies are mainly to add prompts to the input sequence. This way would not work as expected due to the intermediate multi-head self-attention and feed-forward network computation, making model optimization not very smooth. Hence, we propose a novel tuning way called layer tuning, aiming to add learnable parameters in Transformer layers. Specifically, we focus on layer tuning for feed-forward network in the Transformer, namely FL-tuning. It introduces additional units into the hidden layer of each feed-forward network. We conduct extensive experiments on the public CLUE benchmark. The results show that: 1) Our FL-tuning outperforms prompt tuning methods under both full-data and few-shot settings in almost all cases. In particular, it improves accuracy by 17.93% (full-data setting) on WSC 1.0 and F1 by 16.142% (few-shot setting) on CLUENER over P-tuning v2. 2) Our FL-tuning is more stable and converges about 1.17 times faster than P-tuning v2. 3) With only about 3% of Transformer's parameters to be trained, FL-tuning is comparable with fine-tuning on most datasets, and significantly outperforms fine-tuning (e.g., accuracy improved by 12.9% on WSC 1.1) on several datasets. The source codes are available at https://github.com/genggui001/FL-Tuning.  ( 2 min )
    On the Learning and Learnablity of Quasimetrics. (arXiv:2206.15478v1 [cs.LG])
    Our world is full of asymmetries. Gravity and wind can make reaching a place easier than coming back. Social artifacts such as genealogy charts and citation graphs are inherently directed. In reinforcement learning and control, optimal goal-reaching strategies are rarely reversible (symmetrical). Distance functions supported on these asymmetrical structures are called quasimetrics. Despite their common appearance, little research has been done on the learning of quasimetrics. Our theoretical analysis reveals that a common class of learning algorithms, including unconstrained multilayer perceptrons (MLPs), provably fails to learn a quasimetric consistent with training data. In contrast, our proposed Poisson Quasimetric Embedding (PQE) is the first quasimetric learning formulation that both is learnable with gradient-based optimization and enjoys strong performance guarantees. Experiments on random graphs, social graphs, and offline Q-learning demonstrate its effectiveness over many common baselines.  ( 2 min )
    Implicit Neural Spatial Filtering for Multichannel Source Separation in the Waveform Domain. (arXiv:2206.15423v1 [cs.SD])
    We present a single-stage casual waveform-to-waveform multichannel model that can separate moving sound sources based on their broad spatial locations in a dynamic acoustic scene. We divide the scene into two spatial regions containing, respectively, the target and the interfering sound sources. The model is trained end-to-end and performs spatial processing implicitly, without any components based on traditional processing or use of hand-crafted spatial features. We evaluate the proposed model on a real-world dataset and show that the model matches the performance of an oracle beamformer followed by a state-of-the-art single-channel enhancement network.
    SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised Classification. (arXiv:2103.16725v2 [cs.CV] UPDATED)
    A common classification task situation is where one has a large amount of data available for training, but only a small portion is annotated with class labels. The goal of semi-supervised training, in this context, is to improve classification accuracy by leverage information not only from labeled data but also from a large amount of unlabeled data. Recent works have developed significant improvements by exploring the consistency constrain between differently augmented labeled and unlabeled data. Following this path, we propose a novel unsupervised objective that focuses on the less studied relationship between the high confidence unlabeled data that are similar to each other. The new proposed Pair Loss minimizes the statistical distance between high confidence pseudo labels with similarity above a certain threshold. Combining the Pair Loss with the techniques developed by the MixMatch family, our proposed SimPLE algorithm shows significant performance gains over previous algorithms on CIFAR-100 and Mini-ImageNet, and is on par with the state-of-the-art methods on CIFAR-10 and SVHN. Furthermore, SimPLE also outperforms the state-of-the-art methods in the transfer learning setting, where models are initialized by the weights pre-trained on ImageNet or DomainNet-Real. The code is available at github.com/zijian-hu/SimPLE.  ( 3 min )
    Neural Annotation Refinement: Development of a New 3D Dataset for Adrenal Gland Analysis. (arXiv:2206.15328v1 [cs.CV])
    The human annotations are imperfect, especially when produced by junior practitioners. Multi-expert consensus is usually regarded as golden standard, while this annotation protocol is too expensive to implement in many real-world projects. In this study, we propose a method to refine human annotation, named Neural Annotation Refinement (NeAR). It is based on a learnable implicit function, which decodes a latent vector into represented shape. By integrating the appearance as an input of implicit functions, the appearance-aware NeAR fixes the annotation artefacts. Our method is demonstrated on the application of adrenal gland analysis. We first show that the NeAR can repair distorted golden standards on a public adrenal gland segmentation dataset. Besides, we develop a new Adrenal gLand ANalysis (ALAN) dataset with the proposed NeAR, where each case consists of a 3D shape of adrenal gland and its diagnosis label (normal vs. abnormal) assigned by experts. We show that models trained on the shapes repaired by the NeAR can diagnose adrenal glands better than the original ones. The ALAN dataset will be open-source, with 1,594 shapes for adrenal gland diagnosis, which serves as a new benchmark for medical shape analysis. Code and dataset are available at https://github.com/M3DV/NeAR.  ( 3 min )
    Why we do need Explainable AI for Healthcare. (arXiv:2206.15363v1 [cs.HC])
    The recent spike in certified Artificial Intelligence (AI) tools for healthcare has renewed the debate around adoption of this technology. One thread of such debate concerns Explainable AI and its promise to render AI devices more transparent and trustworthy. A few voices active in the medical AI space have expressed concerns on the reliability of Explainable AI techniques, questioning their use and inclusion in guidelines and standards. Revisiting such criticisms, this article offers a balanced and comprehensive perspective on the utility of Explainable AI, focusing on the specificity of clinical applications of AI and placing them in the context of healthcare interventions. Against its detractors and despite valid concerns, we argue that the Explainable AI research program is still central to human-machine interaction and ultimately our main tool against loss of control, a danger that cannot be prevented by rigorous clinical validation alone.
    Scalable K-FAC Training for Deep Neural Networks with Distributed Preconditioning. (arXiv:2206.15143v1 [cs.LG])
    The second-order optimization methods, notably the D-KFAC (Distributed Kronecker Factored Approximate Curvature) algorithms, have gained traction on accelerating deep neural network (DNN) training on GPU clusters. However, existing D-KFAC algorithms require to compute and communicate a large volume of second-order information, i.e., Kronecker factors (KFs), before preconditioning gradients, resulting in large computation and communication overheads as well as a high memory footprint. In this paper, we propose DP-KFAC, a novel distributed preconditioning scheme that distributes the KF constructing tasks at different DNN layers to different workers. DP-KFAC not only retains the convergence property of the existing D-KFAC algorithms but also enables three benefits: reduced computation overhead in constructing KFs, no communication of KFs, and low memory footprint. Extensive experiments on a 64-GPU cluster show that DP-KFAC reduces the computation overhead by 1.55x-1.65x, the communication cost by 2.79x-3.15x, and the memory footprint by 1.14x-1.47x in each second-order update compared to the state-of-the-art D-KFAC methods.
    Graph-Time Convolutional Neural Networks: Architecture and Theoretical Analysis. (arXiv:2206.15174v1 [cs.LG])
    Devising and analyzing learning models for spatiotemporal network data is of importance for tasks including forecasting, anomaly detection, and multi-agent coordination, among others. Graph Convolutional Neural Networks (GCNNs) are an established approach to learn from time-invariant network data. The graph convolution operation offers a principled approach to aggregate multiresolution information. However, extending the convolution principled learning and respective analysis to the spatiotemporal domain is challenging because spatiotemporal data have more intrinsic dependencies. Hence, a higher flexibility to capture jointly the spatial and the temporal dependencies is required to learn meaningful higher-order representations. Here, we leverage product graphs to represent the spatiotemporal dependencies in the data and introduce Graph-Time Convolutional Neural Networks (GTCNNs) as a principled architecture to aid learning. The proposed approach can work with any type of product graph and we also introduce a parametric product graph to learn also the spatiotemporal coupling. The convolution principle further allows a similar mathematical tractability as for GCNNs. In particular, the stability result shows GTCNNs are stable to spatial perturbations but there is an implicit trade-off between discriminability and robustness; i.e., the more complex the model, the less stable. Extensive numerical results on benchmark datasets corroborate our findings and show the GTCNN compares favorably with state-of-the-art solutions. We anticipate the GTCNN to be a starting point for more sophisticated models that achieve good performance but are also fundamentally grounded.
    Laplacian Autoencoders for Learning Stochastic Representations. (arXiv:2206.15078v1 [cs.LG])
    Representation learning has become a practical family of methods for building rich parametric codifications of massive high-dimensional data while succeeding in the reconstruction side. When considering unsupervised tasks with test-train distribution shifts, the probabilistic viewpoint helps for addressing overconfidence and poor calibration of predictions. However, the direct introduction of Bayesian inference on top of neural networks weights is still an ardous problem for multiple reasons, i.e. the curse of dimensionality or intractability issues. The Laplace approximation (LA) offers a solution here, as one may build Gaussian approximations of the posterior density of weights via second-order Taylor expansions in certain locations of the parameter space. In this work, we present a Bayesian autoencoder for unsupervised representation learning inspired in LA. Our method implements iterative Laplace updates to obtain a novel variational lower-bound of the autoencoder evidence. The vast computational burden of the second-order partial derivatives is skipped via approximations of the Hessian matrix. Empirically, we demonstrate the scalability and performance of the Laplacian autoencoder by providing well-calibrated uncertainties for out-of-distribution detection, geodesics for differential geometry and missing data imputations.
    HRFuser: A Multi-resolution Sensor Fusion Architecture for 2D Object Detection. (arXiv:2206.15157v1 [cs.CV])
    Besides standard cameras, autonomous vehicles typically include multiple additional sensors, such as lidars and radars, which help acquire richer information for perceiving the content of the driving scene. While several recent works focus on fusing certain pairs of sensors - such as camera and lidar or camera and radar - by using architectural components specific to the examined setting, a generic and modular sensor fusion architecture is missing from the literature. In this work, we focus on 2D object detection, a fundamental high-level task which is defined on the 2D image domain, and propose HRFuser, a multi-resolution sensor fusion architecture that scales straightforwardly to an arbitrary number of input modalities. The design of HRFuser is based on state-of-the-art high-resolution networks for image-only dense prediction and incorporates a novel multi-window cross-attention block as the means to perform fusion of multiple modalities at multiple resolutions. Even though cameras alone provide very informative features for 2D detection, we demonstrate via extensive experiments on the nuScenes and Seeing Through Fog datasets that our model effectively leverages complementary features from additional modalities, substantially improving upon camera-only performance and consistently outperforming state-of-the-art fusion methods for 2D detection both in normal and adverse conditions. The source code will be made publicly available.
    A note on large deviations for interacting particle dynamics for finding mixed equilibria in zero-sum games. (arXiv:2206.15177v1 [stat.ML])
    Finding equilibria points in continuous minimax games has become a key problem within machine learning, in part due to its connection to the training of generative adversarial networks. Because of existence and robustness issues, recent developments have shifted from pure equilibria to focusing on mixed equilibria points. In this note we consider a method proposed by Domingo-Enrich et al. for finding mixed equilibria in two-layer zero-sum games. The method is based on entropic regularisation and the two competing strategies are represented by two sets of interacting particles. We show that the sequence of empirical measures of the particle system satisfies a large deviation principle as the number of particles grows to infinity, and how this implies convergence of the empirical measure and the associated Nikaid\^o-Isoda error, complementing existing law of large numbers results.
    Practical Black Box Hamiltonian Learning. (arXiv:2206.15464v1 [quant-ph])
    We study the problem of learning the parameters for the Hamiltonian of a quantum many-body system, given limited access to the system. In this work, we build upon recent approaches to Hamiltonian learning via derivative estimation. We propose a protocol that improves the scaling dependence of prior works, particularly with respect to parameters relating to the structure of the Hamiltonian (e.g., its locality $k$). Furthermore, by deriving exact bounds on the performance of our protocol, we are able to provide a precise numerical prescription for theoretically optimal settings of hyperparameters in our learning protocol, such as the maximum evolution time (when learning with unitary dynamics) or minimum temperature (when learning with Gibbs states). Thanks to these improvements, our protocol is practical for large problems: we demonstrate this with a numerical simulation of our protocol on an 80-qubit system.
    Neural Networks can Learn Representations with Gradient Descent. (arXiv:2206.15144v1 [cs.LG])
    Significant theoretical work has established that in specific regimes, neural networks trained by gradient descent behave like kernel methods. However, in practice, it is known that neural networks strongly outperform their associated kernels. In this work, we explain this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be easily learned with gradient descent on a two layer neural network outside the kernel regime by learning representations that are relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime. Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form $f^\star(x) = g(Ux)$ where $U: \R^d \to \R^r$ with $d \gg r$. When the degree of $f^\star$ is $p$, it is known that $n \asymp d^p$ samples are necessary to learn $f^\star$ in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to $f^\star$. This results in an improved sample complexity of $n\asymp d^2 r + dr^p$. Furthermore, in a transfer learning setup where the data distributions in the source and target domain share the same representation $U$ but have different polynomial heads we show that a popular heuristic for transfer learning has a target sample complexity independent of $d$.
    LIDL: Local Intrinsic Dimension Estimation Using Approximate Likelihood. (arXiv:2206.14882v1 [stat.ML])
    Most of the existing methods for estimating the local intrinsic dimension of a data distribution do not scale well to high-dimensional data. Many of them rely on a non-parametric nearest neighbors approach which suffers from the curse of dimensionality. We attempt to address that challenge by proposing a novel approach to the problem: Local Intrinsic Dimension estimation using approximate Likelihood (LIDL). Our method relies on an arbitrary density estimation method as its subroutine and hence tries to sidestep the dimensionality challenge by making use of the recent progress in parametric neural methods for likelihood estimation. We carefully investigate the empirical properties of the proposed method, compare them with our theoretical predictions, and show that LIDL yields competitive results on the standard benchmarks for this problem and that it scales to thousands of dimensions. What is more, we anticipate this approach to improve further with the continuing advances in the density estimation literature.
    Causality-Based Multivariate Time Series Anomaly Detection. (arXiv:2206.15033v1 [cs.LG])
    Anomaly detection in multivariate time series plays an important role in monitoring the behaviors of various real-world systems, e.g., IT system operations or manufacturing industry. Previous approaches model the joint distribution without considering the underlying mechanism of multivariate time series, making them complicated and computationally hungry. In this paper, we formulate the anomaly detection problem from a causal perspective and view anomalies as instances that do not follow the regular causal mechanism to generate the multivariate data. We then propose a causality-based anomaly detection approach, which first learns the causal structure from data and then infers whether an instance is an anomaly relative to the local causal mechanism to generate each variable from its direct causes, whose conditional distribution can be directly estimated from data. In light of the modularity property of causal systems, the original problem is divided into a series of separate low-dimensional anomaly detection problems so that where an anomaly happens can be directly identified. We evaluate our approach with both simulated and public datasets as well as a case study on real-world AIOps applications, showing its efficacy, robustness, and practical feasibility.
    Masked Part-Of-Speech Model: Does Modeling Long Context Help Unsupervised POS-tagging?. (arXiv:2206.14969v1 [cs.CL])
    Previous Part-Of-Speech (POS) induction models usually assume certain independence assumptions (e.g., Markov, unidirectional, local dependency) that do not hold in real languages. For example, the subject-verb agreement can be both long-term and bidirectional. To facilitate flexible dependency modeling, we propose a Masked Part-of-Speech Model (MPoSM), inspired by the recent success of Masked Language Models (MLM). MPoSM can model arbitrary tag dependency and perform POS induction through the objective of masked POS reconstruction. We achieve competitive results on both the English Penn WSJ dataset as well as the universal treebank containing 10 diverse languages. Though modeling the long-term dependency should ideally help this task, our ablation study shows mixed trends in different languages. To better understand this phenomenon, we design a novel synthetic experiment that can specifically diagnose the model's ability to learn tag agreement. Surprisingly, we find that even strong baselines fail to solve this problem consistently in a very simplified setting: the agreement between adjacent words. Nonetheless, MPoSM achieves overall better performance. Lastly, we conduct a detailed error analysis to shed light on other remaining challenges. Our code is available at https://github.com/owenzx/MPoSM  ( 2 min )
    Machine Learning Approaches to Predict Breast Cancer: Bangladesh Perspective. (arXiv:2206.14972v1 [cs.LG])
    Nowadays, Breast cancer has risen to become one of the most prominent causes of death in recent years. Among all malignancies, this is the most frequent and the major cause of death for women globally. Manually diagnosing this disease requires a good amount of time and expertise. Breast cancer detection is time-consuming, and the spread of the disease can be reduced by developing machine-based breast cancer predictions. In Machine learning, the system can learn from prior instances and find hard-to-detect patterns from noisy or complicated data sets using various statistical, probabilistic, and optimization approaches. This work compares several machine learning algorithm's classification accuracy, precision, sensitivity, and specificity on a newly collected dataset. In this work Decision tree, Random Forest, Logistic Regression, Naive Bayes, and XGBoost, these five machine learning approaches have been implemented to get the best performance on our dataset. This study focuses on finding the best algorithm that can forecast breast cancer with maximum accuracy in terms of its classes. This work evaluated the quality of each algorithm's data classification in terms of efficiency and effectiveness. And also compared with other published work on this domain. After implementing the model, this study achieved the best model accuracy, 94% on Random Forest and XGBoost.  ( 3 min )
    Semantic Unfolding of StyleGAN Latent Space. (arXiv:2206.14892v1 [cs.CV])
    Generative adversarial networks (GANs) have proven to be surprisingly efficient for image editing by inverting and manipulating the latent code corresponding to an input real image. This editing property emerges from the disentangled nature of the latent space. In this paper, we identify that the facial attribute disentanglement is not optimal, thus facial editing relying on linear attribute separation is flawed. We thus propose to improve semantic disentanglement with supervision. Our method consists in learning a proxy latent representation using normalizing flows, and we show that this leads to a more efficient space for face image editing.  ( 2 min )
    Stochastic Bilevel Distributed Optimization over a Network. (arXiv:2206.15025v1 [cs.LG])
    Bilevel optimization has been applied to a wide variety of machine learning models. Numerous stochastic bilevel optimization algorithms have been developed in recent years. However, most of them restrict their focus on the single-machine setting so that they are incapable of handling the distributed data. To address this issue, under the setting where all participants compose a network and perform the peer-to-peer communication in this network, we developed two novel distributed stochastic bilevel optimization algorithms based on the gradient tracking communication mechanism and two different gradient estimators. Additionally, we show that they can achieve $O(\frac{1}{\epsilon^{2}(1-\lambda)^2})$ and $O(\frac{1}{\epsilon^{3/2}(1-\lambda)^2})$ convergence rate respectively to obtain the $\epsilon$-accuracy solution, where $1-\lambda$ denotes the spectral gap of the communication network. To our knowledge, this is the first work achieving these theoretical results. Finally, we applied our algorithms to practical machine learning models, and the experimental results confirmed the efficacy of our algorithms.  ( 2 min )
    Best of Both Worlds Model Selection. (arXiv:2206.14912v1 [cs.LG])
    We study the problem of model selection in bandit scenarios in the presence of nested policy classes, with the goal of obtaining simultaneous adversarial and stochastic ("best of both worlds") high-probability regret guarantees. Our approach requires that each base learner comes with a candidate regret bound that may or may not hold, while our meta algorithm plays each base learner according to a schedule that keeps the base learner's candidate regret bounds balanced until they are detected to violate their guarantees. We develop careful mis-specification tests specifically designed to blend the above model selection criterion with the ability to leverage the (potentially benign) nature of the environment. We recover the model selection guarantees of the CORRAL algorithm for adversarial environments, but with the additional benefit of achieving high probability regret bounds, specifically in the case of nested adversarial linear bandits. More importantly, our model selection results also hold simultaneously in stochastic environments under gap assumptions. These are the first theoretical results that achieve best of both world (stochastic and adversarial) guarantees while performing model selection in (linear) bandit scenarios.  ( 2 min )
    Continuous-Time and Multi-Level Graph Representation Learning for Origin-Destination Demand Prediction. (arXiv:2206.15005v1 [cs.LG])
    Traffic demand forecasting by deep neural networks has attracted widespread interest in both academia and industry society. Among them, the pairwise Origin-Destination (OD) demand prediction is a valuable but challenging problem due to several factors: (i) the large number of possible OD pairs, (ii) implicitness of spatial dependence, and (iii) complexity of traffic states. To address the above issues, this paper proposes a Continuous-time and Multi-level dynamic graph representation learning method for Origin-Destination demand prediction (CMOD). Firstly, a continuous-time dynamic graph representation learning framework is constructed, which maintains a dynamic state vector for each traffic node (metro stations or taxi zones). The state vectors keep historical transaction information and are continuously updated according to the most recently happened transactions. Secondly, a multi-level structure learning module is proposed to model the spatial dependency of station-level nodes. It can not only exploit relations between nodes adaptively from data, but also share messages and representations via cluster-level and area-level virtual nodes. Lastly, a cross-level fusion module is designed to integrate multi-level memories and generate comprehensive node representations for the final prediction. Extensive experiments are conducted on two real-world datasets from Beijing Subway and New York Taxi, and the results demonstrate the superiority of our model against the state-of-the-art approaches.  ( 3 min )
    Lookback for Learning to Branch. (arXiv:2206.14987v1 [cs.LG])
    The expressive and computationally inexpensive bipartite Graph Neural Networks (GNN) have been shown to be an important component of deep learning based Mixed-Integer Linear Program (MILP) solvers. Recent works have demonstrated the effectiveness of such GNNs in replacing the branching (variable selection) heuristic in branch-and-bound (B&B) solvers. These GNNs are trained, offline and on a collection of MILPs, to imitate a very good but computationally expensive branching heuristic, strong branching. Given that B&B results in a tree of sub-MILPs, we ask (a) whether there are strong dependencies exhibited by the target heuristic among the neighboring nodes of the B&B tree, and (b) if so, whether we can incorporate them in our training procedure. Specifically, we find that with the strong branching heuristic, a child node's best choice was often the parent's second-best choice. We call this the "lookback" phenomenon. Surprisingly, the typical branching GNN of Gasse et al. (2019) often misses this simple "answer". To imitate the target behavior more closely by incorporating the lookback phenomenon in GNNs, we propose two methods: (a) target smoothing for the standard cross-entropy loss function, and (b) adding a Parent-as-Target (PAT) Lookback regularizer term. Finally, we propose a model selection framework to incorporate harder-to-formulate objectives such as solving time in the final models. Through extensive experimentation on standard benchmark instances, we show that our proposal results in up to 22% decrease in the size of the B&B tree and up to 15% improvement in the solving times.  ( 3 min )
    A Validity Perspective on Evaluating the Justified Use of Data-driven Decision-making Algorithms. (arXiv:2206.14983v1 [cs.LG])
    This work seeks to center validity considerations in deliberations around whether and how to build data-driven algorithms in high-stakes domains. Toward this end, we translate key concepts from validity theory to predictive algorithms. We describe common challenges in problem formulation and data issues that jeopardize the validity of predictive algorithms. We distill these issues into a series of high-level questions intended to promote and document reflections on the legitimacy of the predictive task and the suitability of the data. This contribution lays the foundation for co-designing a validity protocol, in collaboration with real-world stakeholders, including decision-makers, modelers, and members of potentially impacted communities, to critically evaluate the justifiability of specific designs and uses of data-driven algorithmic systems.  ( 2 min )
    Manifold Interpolating Optimal-Transport Flows for Trajectory Inference. (arXiv:2206.14928v1 [cs.LG])
    Here, we present a method called Manifold Interpolating Optimal-Transport Flow (MIOFlow) that learns stochastic, continuous population dynamics from static snapshot samples taken at sporadic timepoints. MIOFlow combines dynamic models, manifold learning, and optimal transport by training neural ordinary differential equations (Neural ODE) to interpolate between static population snapshots as penalized by optimal transport with manifold ground distance. Further, we ensure that the flow follows the geometry by operating in the latent space of an autoencoder that we call a geodesic autoencoder (GAE). In GAE the latent space distance between points is regularized to match a novel multiscale geodesic distance on the data manifold that we define. We show that this method is superior to normalizing flows, Schr\"odinger bridges and other generative models that are designed to flow from noise to data in terms of interpolating between populations. Theoretically, we link these trajectories with dynamic optimal transport. We evaluate our method on simulated data with bifurcations and merges, as well as scRNA-seq data from embryoid body differentiation, and acute myeloid leukemia treatment.  ( 2 min )
    On Non-Random Missing Labels in Semi-Supervised Learning. (arXiv:2206.14923v1 [cs.CV])
    Semi-Supervised Learning (SSL) is fundamentally a missing label problem, in which the label Missing Not At Random (MNAR) problem is more realistic and challenging, compared to the widely-adopted yet naive Missing Completely At Random assumption where both labeled and unlabeled data share the same class distribution. Different from existing SSL solutions that overlook the role of "class" in causing the non-randomness, e.g., users are more likely to label popular classes, we explicitly incorporate "class" into SSL. Our method is three-fold: 1) We propose Class-Aware Propensity (CAP) that exploits the unlabeled data to train an improved classifier using the biased labeled data. 2) To encourage rare class training, whose model is low-recall but high-precision that discards too many pseudo-labeled data, we propose Class-Aware Imputation (CAI) that dynamically decreases (or increases) the pseudo-label assignment threshold for rare (or frequent) classes. 3) Overall, we integrate CAP and CAI into a Class-Aware Doubly Robust (CADR) estimator for training an unbiased SSL model. Under various MNAR settings and ablations, our method not only significantly outperforms existing baselines but also surpasses other label bias removal SSL methods. Please check our code at: https://github.com/JoyHuYY1412/CADR-FixMatch.  ( 2 min )
    Improving Ensemble Distillation With Weight Averaging and Diversifying Perturbation. (arXiv:2206.15047v1 [cs.LG])
    Ensembles of deep neural networks have demonstrated superior performance, but their heavy computational cost hinders applying them for resource-limited environments. It motivates distilling knowledge from the ensemble teacher into a smaller student network, and there are two important design choices for this ensemble distillation: 1) how to construct the student network, and 2) what data should be shown during training. In this paper, we propose a weight averaging technique where a student with multiple subnetworks is trained to absorb the functional diversity of ensemble teachers, but then those subnetworks are properly averaged for inference, giving a single student network with no additional inference cost. We also propose a perturbation strategy that seeks inputs from which the diversities of teachers can be better transferred to the student. Combining these two, our method significantly improves upon previous methods on various image classification tasks.  ( 2 min )
    Semi-Supervised Generative Adversarial Network for Stress Detection Using Partially Labeled Physiological Data. (arXiv:2206.14976v1 [cs.LG])
    Physiological measurements involves observing variables that attribute to the normative functioning of human systems and subsystems directly or indirectly. The measurements can be used to detect affective states of a person with aims such as improving human-computer interactions. There are several methods of collecting physiological data, but wearable sensors are a common, non-invasive tool for accurate readings. However, valuable information is hard to extract from the raw physiological data, especially for affective state detection. Machine Learning techniques are used to detect the affective state of a person through labeled physiological data. A clear problem with using labeled data is creating accurate labels. An expert is needed to analyze a form of recording of participants and mark sections with different states such as stress and calm. While expensive, this method delivers a complete dataset with labeled data that can be used in any number of supervised algorithms. An interesting question arises from the expensive labeling: how can we reduce the cost while maintaining high accuracy? Semi-Supervised learning (SSL) is a potential solution to this problem. These algorithms allow for machine learning models to be trained with only a small subset of labeled data (unlike unsupervised which use no labels). They provide a way of avoiding expensive labeling. This paper compares a fully supervised algorithm to a SSL on the public WESAD (Wearable Stress and Affect Detection) Dataset for stress detection. This paper shows that Semi-Supervised algorithms are a viable method for inexpensive affective state detection systems with accurate results.  ( 3 min )
    Discrete Langevin Sampler via Wasserstein Gradient Flow. (arXiv:2206.14897v1 [cs.LG])
    Recently, a family of locally balanced (LB) samplers has demonstrated excellent performance at sampling and learning energy-based models (EBMs) in discrete spaces. However, the theoretical understanding of this success is limited. In this work, we show how LB functions give rise to LB dynamics corresponding to Wasserstein gradient flow in a discrete space. From first principles, previous LB samplers can then be seen as discretizations of the LB dynamics with respect to Hamming distance. Based on this observation, we propose a new algorithm, the Locally Balanced Jump (LBJ), by discretizing the LB dynamics with respect to simulation time. As a result, LBJ has a location-dependent "velocity" that allows it to make proposals with larger distances. Additionally, LBJ decouples each dimension into independent sub-processes, enabling convenient parallel implementation. We demonstrate the advantages of LBJ for sampling and learning in various binary and categorical distributions.  ( 2 min )
    Decision Forest Based EMG Signal Classification with Low Volume Dataset Augmented with Random Variance Gaussian Noise. (arXiv:2206.14947v1 [q-bio.NC])
    Electromyography signals can be used as training data by machine learning models to classify various gestures. We seek to produce a model that can classify six different hand gestures with a limited number of samples that generalizes well to a wider audience while comparing the effect of our feature extraction results on model accuracy to other more conventional methods such as the use of AR parameters on a sliding window across the channels of a signal. We appeal to a set of more elementary methods such as the use of random bounds on a signal, but desire to show the power these methods can carry in an online setting where EMG classification is being conducted, as opposed to more complicated methods such as the use of the Fourier Transform. To augment our limited training data, we used a standard technique, known as jitter, where random noise is added to each observation in a channel wise manner. Once all datasets were produced using the above methods, we performed a grid search with Random Forest and XGBoost to ultimately create a high accuracy model. For human computer interface purposes, high accuracy classification of EMG signals is of particular importance to their functioning and given the difficulty and cost of amassing any sort of biomedical data in a high volume, it is valuable to have techniques that can work with a low amount of high-quality samples with less expensive feature extraction methods that can reliably be carried out in an online application.  ( 3 min )
    Towards Federated Long-Tailed Learning. (arXiv:2206.14988v1 [cs.LG])
    Data privacy and class imbalance are the norm rather than the exception in many machine learning tasks. Recent attempts have been launched to, on one side, address the problem of learning from pervasive private data, and on the other side, learn from long-tailed data. However, both assumptions might hold in practical applications, while an effective method to simultaneously alleviate both issues is yet under development. In this paper, we focus on learning with long-tailed (LT) data distributions under the context of the popular privacy-preserved federated learning (FL) framework. We characterize three scenarios with different local or global long-tailed data distributions in the FL framework, and highlight the corresponding challenges. The preliminary results under different scenarios reveal that substantial future work are of high necessity to better resolve the characterized federated long-tailed learning tasks.  ( 2 min )
    Solving Quantitative Reasoning Problems with Language Models. (arXiv:2206.14858v1 [cs.CL])
    Language models have achieved remarkable performance on a wide range of tasks that require natural language understanding. Nevertheless, state-of-the-art models have generally struggled with tasks that require quantitative reasoning, such as solving mathematics, science, and engineering problems at the college level. To help close this gap, we introduce Minerva, a large language model pretrained on general natural language data and further trained on technical content. The model achieves state-of-the-art performance on technical benchmarks without the use of external tools. We also evaluate our model on over two hundred undergraduate-level problems in physics, biology, chemistry, economics, and other sciences that require quantitative reasoning, and find that the model can correctly answer nearly a third of them.  ( 2 min )
    Causality for Inherently Explainable Transformers: CAT-XPLAIN. (arXiv:2206.14841v1 [cs.CV])
    There have been several post-hoc explanation approaches developed to explain pre-trained black-box neural networks. However, there is still a gap in research efforts toward designing neural networks that are inherently explainable. In this paper, we utilize a recently proposed instance-wise post-hoc causal explanation method to make an existing transformer architecture inherently explainable. Once trained, our model provides an explanation in the form of top-$k$ regions in the input space of the given instance contributing to its decision. We evaluate our method on binary classification tasks using three image datasets: MNIST, FMNIST, and CIFAR. Our results demonstrate that compared to the causality-based post-hoc explainer model, our inherently explainable model achieves better explainability results while eliminating the need of training a separate explainer model. Our code is available at https://github.com/mvrl/CAT-XPLAIN.  ( 2 min )
    Randomized Coordinate Subgradient Method for Nonsmooth Optimization. (arXiv:2206.14981v1 [math.OC])
    Nonsmooth optimization finds wide applications in many engineering fields. In this work, we propose to utilize the {Randomized Coordinate Subgradient Method} (RCS) for solving both nonsmooth convex and nonsmooth nonconvex (nonsmooth weakly convex) optimization problems. At each iteration, RCS randomly selects one block coordinate rather than all the coordinates to update. Motivated by practical applications, we consider the {linearly bounded subgradients assumption} for the objective function, which is much more general than the Lipschitz continuity assumption. Under such a general assumption, we conduct thorough convergence analysis for RCS in both convex and nonconvex cases and establish both expected convergence rate and almost sure asymptotic convergence results. In order to derive these convergence results, we establish a convergence lemma and the relationship between the global metric subregularity properties of a weakly convex function and its Moreau envelope, which are fundamental and of independent interests. Finally, we conduct several experiments to show the possible superiority of RCS over the subgradient method.  ( 2 min )
    Momentum Diminishes the Effect of Spectral Bias in Physics-Informed Neural Networks. (arXiv:2206.14862v1 [cs.LG])
    Physics-informed neural network (PINN) algorithms have shown promising results in solving a wide range of problems involving partial differential equations (PDEs). However, they often fail to converge to desirable solutions when the target function contains high-frequency features, due to a phenomenon known as spectral bias. In the present work, we exploit neural tangent kernels (NTKs) to investigate the training dynamics of PINNs evolving under stochastic gradient descent with momentum (SGDM). This demonstrates SGDM significantly reduces the effect of spectral bias. We have also examined why training a model via the Adam optimizer can accelerate the convergence while reducing the spectral bias. Moreover, our numerical experiments have confirmed that wide-enough networks using SGDM still converge to desirable solutions, even in the presence of high-frequency features. In fact, we show that the width of a network plays a critical role in convergence.  ( 2 min )
    AFAFed -- Protocol analysis. (arXiv:2206.14927v1 [cs.LG])
    In this paper, we design, analyze the convergence properties and address the implementation aspects of AFAFed. This is a novel Asynchronous Fair Adaptive Federated learning framework for stream-oriented IoT application environments, which are featured by time-varying operating conditions, heterogeneous resource-limited devices (i.e., coworkers), non-i.i.d. local training data and unreliable communication links. The key new of AFAFed is the synergic co-design of: (i) two sets of adaptively tuned tolerance thresholds and fairness coefficients at the coworkers and central server, respectively; and, (ii) a distributed adaptive mechanism, which allows each coworker to adaptively tune own communication rate. The convergence properties of AFAFed under (possibly) non-convex loss functions is guaranteed by a set of new analytical bounds, which formally unveil the impact on the resulting AFAFed convergence rate of a number of Federated Learning (FL) parameters, like, first and second moments of the per-coworker number of consecutive model updates, data skewness, communication packet-loss probability, and maximum/minimum values of the (adaptively tuned) mixing coefficient used for model aggregation.  ( 2 min )
    A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback. (arXiv:2206.14906v1 [cs.LG])
    We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback, which in addition to the minimax optimal adversarial regret guarantee shown by Zimmert and Seldin simultaneously achieves a near-optimal regret guarantee in the stochastic setting with fixed delays. Specifically, the adversarial regret guarantee is $\mathcal{O}(\sqrt{TK} + \sqrt{dT\log K})$, where $T$ is the time horizon, $K$ is the number of arms, and $d$ is the fixed delay, whereas the stochastic regret guarantee is $\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{\Delta_i} \log(T) + \frac{d}{\Delta_{i}\log K}) + d K^{1/3}\log K\right)$, where $\Delta_i$ are the suboptimality gaps. We also present an extension of the algorithm to the case of arbitrary delays, which is based on an oracle knowledge of the maximal delay $d_{max}$ and achieves $\mathcal{O}(\sqrt{TK} + \sqrt{D\log K} + d_{max}K^{1/3} \log K)$ regret in the adversarial regime, where $D$ is the total delay, and $\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{\Delta_i} \log(T) + \frac{\sigma_{max}}{\Delta_{i}\log K}) + d_{max}K^{1/3}\log K\right)$ regret in the stochastic regime, where $\sigma_{max}$ is the maximal number of outstanding observations. Finally, we present a lower bound that matches regret upper bound achieved by the skipping technique of Zimmert and Seldin [2020] in the adversarial setting.  ( 2 min )
    Provably Efficient Reinforcement Learning for Online Adaptive Influence Maximization. (arXiv:2206.14846v1 [cs.LG])
    Online influence maximization aims to maximize the influence spread of a content in a social network with unknown network model by selecting a few seed nodes. Recent studies followed a non-adaptive setting, where the seed nodes are selected before the start of the diffusion process and network parameters are updated when the diffusion stops. We consider an adaptive version of content-dependent online influence maximization problem where the seed nodes are sequentially activated based on real-time feedback. In this paper, we formulate the problem as an infinite-horizon discounted MDP under a linear diffusion process and present a model-based reinforcement learning solution. Our algorithm maintains a network model estimate and selects seed users adaptively, exploring the social network while improving the optimal policy optimistically. We establish $\widetilde O(\sqrt{T})$ regret bound for our algorithm. Empirical evaluations on synthetic network demonstrate the efficiency of our algorithm.  ( 2 min )
    Fairness via In-Processing in the Over-parameterized Regime: A Cautionary Tale. (arXiv:2206.14853v1 [cs.LG])
    The success of DNNs is driven by the counter-intuitive ability of over-parameterized networks to generalize, even when they perfectly fit the training data. In practice, test error often continues to decrease with increasing over-parameterization, referred to as double descent. This allows practitioners to instantiate large models without having to worry about over-fitting. Despite its benefits, however, prior work has shown that over-parameterization can exacerbate bias against minority subgroups. Several fairness-constrained DNN training methods have been proposed to address this concern. Here, we critically examine MinDiff, a fairness-constrained training procedure implemented within TensorFlow's Responsible AI Toolkit, that aims to achieve Equality of Opportunity. We show that although MinDiff improves fairness for under-parameterized models, it is likely to be ineffective in the over-parameterized regime. This is because an overfit model with zero training loss is trivially group-wise fair on training data, creating an "illusion of fairness," thus turning off the MinDiff optimization (this will apply to any disparity-based measures which care about errors or accuracy. It won't apply to demographic parity). Within specified fairness constraints, under-parameterized MinDiff models can even have lower error compared to their over-parameterized counterparts (despite baseline over-parameterized models having lower error). We further show that MinDiff optimization is very sensitive to choice of batch size in the under-parameterized regime. Thus, fair model training using MinDiff requires time-consuming hyper-parameter searches. Finally, we suggest using previously proposed regularization techniques, viz. L2, early stopping and flooding in conjunction with MinDiff to train fair over-parameterized models.  ( 3 min )
    Strong Lensing Source Reconstruction Using Continuous Neural Fields. (arXiv:2206.14820v1 [astro-ph.CO])
    From the nature of dark matter to the rate of expansion of our Universe, observations of distant galaxies distorted through strong gravitational lensing have the potential to answer some of the major open questions in astrophysics. Modeling galaxy-galaxy strong lensing observations presents a number of challenges as the exact configuration of both the background source and foreground lens galaxy is unknown. A timely call, prompted by a number of upcoming surveys anticipating high-resolution lensing images, demands methods that can efficiently model lenses at their full complexity. In this work, we introduce a method that uses continuous neural fields to non-parametrically reconstruct the complex morphology of a source galaxy while simultaneously inferring a distribution over foreground lens galaxy configurations. We demonstrate the efficacy of our method through experiments on simulated data targeting high-resolution lensing images similar to those anticipated in near-future astrophysical surveys.  ( 2 min )
  • Open

    On Measuring Excess Capacity in Neural Networks. (arXiv:2202.08070v2 [cs.LG] UPDATED)
    We study the excess capacity of deep networks in the context of supervised classification. That is, given a capacity measure of the underlying hypothesis class -- in our case, empirical Rademacher complexity -- by how much can we (a priori) constrain this class while retaining an empirical error on a par with the unconstrained regime? To assess excess capacity in modern architectures (such as residual networks), we extend and unify prior Rademacher complexity bounds to accommodate function composition and addition, as well as the structure of convolutions. The capacity-driving terms in our bounds are the Lipschitz constants of the layers and a (2,1) group norm distance to the initializations of the convolution weights. Experiments on benchmark datasets of varying task difficulty indicate that (1) there is a substantial amount of excess capacity per task, and (2) capacity can be kept at a surprisingly similar level across tasks. Overall, this suggests a notion of compressibility with respect to weight norms, orthogonal to classic compression via weight pruning.  ( 2 min )
    Rethinking Exponential Averaging of the Fisher. (arXiv:2204.04718v2 [cs.LG] UPDATED)
    In optimization for Machine learning (ML), it is typical that curvature-matrix (CM) estimates rely on an exponential average (EA) of local estimates (giving EA-CM algorithms). This approach has little principled justification, but is very often used in practice. In this paper, we draw a connection between EA-CM algorithms and what we call a "Wake of Quadratic regularized models". The outlined connection allows us to understand what EA-CM algorithms are doing from an optimization perspective. Generalizing from the established connection, we propose a new family of algorithms, "KL-Divergence Wake-Regularized Models" (KLD-WRM). We give three different practical instantiations of KLD-WRM, and show numerically that these outperform K-FAC on MNIST.  ( 2 min )
    A Latent Restoring Force Approach to Nonlinear System Identification. (arXiv:2109.10681v2 [stat.ML] UPDATED)
    Identification of nonlinear dynamic systems remains a significant challenge across engineering. This work suggests an approach based on Bayesian filtering to extract and identify the contribution of an unknown nonlinear term in the system which can be seen as an alternative viewpoint on restoring force surface type approaches. To achieve this identification, the contribution which is the nonlinear restoring force is modelled, initially, as a Gaussian process in time. That Gaussian process is converted into a state-space model and combined with the linear dynamic component of the system. Then, by inference of the filtering and smoothing distributions, the internal states of the system and the nonlinear restoring force can be extracted. In possession of these states a nonlinear model can be constructed. The approach is demonstrated to be effective in both a simulated case study and on an experimental benchmark dataset.  ( 2 min )
    Rate-Distortion Theoretic Generalization Bounds for Stochastic Learning Algorithms. (arXiv:2203.02474v2 [stat.ML] UPDATED)
    Understanding generalization in modern machine learning settings has been one of the major challenges in statistical learning theory. In this context, recent years have witnessed the development of various generalization bounds suggesting different complexity notions such as the mutual information between the data sample and the algorithm output, compressibility of the hypothesis space, and the fractal dimension of the hypothesis space. While these bounds have illuminated the problem at hand from different angles, their suggested complexity notions might appear seemingly unrelated, thereby restricting their high-level impact. In this study, we prove novel generalization bounds through the lens of rate-distortion theory, and explicitly relate the concepts of mutual information, compressibility, and fractal dimensions in a single mathematical framework. Our approach consists of (i) defining a generalized notion of compressibility by using source coding concepts, and (ii) showing that the `compression error rate' can be linked to the generalization error both in expectation and with high probability. We show that in the `lossless compression' setting, we recover and improve existing mutual information-based bounds, whereas a `lossy compression' scheme allows us to link generalization to the rate-distortion dimension -- a particular notion of fractal dimension. Our results bring a more unified perspective on generalization and open up several future research directions.  ( 3 min )
    A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback. (arXiv:2206.14906v1 [cs.LG])
    We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback, which in addition to the minimax optimal adversarial regret guarantee shown by Zimmert and Seldin simultaneously achieves a near-optimal regret guarantee in the stochastic setting with fixed delays. Specifically, the adversarial regret guarantee is $\mathcal{O}(\sqrt{TK} + \sqrt{dT\log K})$, where $T$ is the time horizon, $K$ is the number of arms, and $d$ is the fixed delay, whereas the stochastic regret guarantee is $\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{\Delta_i} \log(T) + \frac{d}{\Delta_{i}\log K}) + d K^{1/3}\log K\right)$, where $\Delta_i$ are the suboptimality gaps. We also present an extension of the algorithm to the case of arbitrary delays, which is based on an oracle knowledge of the maximal delay $d_{max}$ and achieves $\mathcal{O}(\sqrt{TK} + \sqrt{D\log K} + d_{max}K^{1/3} \log K)$ regret in the adversarial regime, where $D$ is the total delay, and $\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{\Delta_i} \log(T) + \frac{\sigma_{max}}{\Delta_{i}\log K}) + d_{max}K^{1/3}\log K\right)$ regret in the stochastic regime, where $\sigma_{max}$ is the maximal number of outstanding observations. Finally, we present a lower bound that matches regret upper bound achieved by the skipping technique of Zimmert and Seldin [2020] in the adversarial setting.  ( 2 min )
    LIDL: Local Intrinsic Dimension Estimation Using Approximate Likelihood. (arXiv:2206.14882v1 [stat.ML])
    Most of the existing methods for estimating the local intrinsic dimension of a data distribution do not scale well to high-dimensional data. Many of them rely on a non-parametric nearest neighbors approach which suffers from the curse of dimensionality. We attempt to address that challenge by proposing a novel approach to the problem: Local Intrinsic Dimension estimation using approximate Likelihood (LIDL). Our method relies on an arbitrary density estimation method as its subroutine and hence tries to sidestep the dimensionality challenge by making use of the recent progress in parametric neural methods for likelihood estimation. We carefully investigate the empirical properties of the proposed method, compare them with our theoretical predictions, and show that LIDL yields competitive results on the standard benchmarks for this problem and that it scales to thousands of dimensions. What is more, we anticipate this approach to improve further with the continuing advances in the density estimation literature.  ( 2 min )
    A note on Linear Bottleneck networks and their Transition to Multilinearity. (arXiv:2206.15058v1 [cs.LG])
    Randomly initialized wide neural networks transition to linear functions of weights as the width grows, in a ball of radius $O(1)$ around initialization. A necessary condition for this result is that all layers of the network are wide enough, i.e., all widths tend to infinity. However, the transition to linearity breaks down when this infinite width assumption is violated. In this work we show that linear networks with a bottleneck layer learn bilinear functions of the weights, in a ball of radius $O(1)$ around initialization. In general, for $B-1$ bottleneck layers, the network is a degree $B$ multilinear function of weights. Importantly, the degree only depends on the number of bottlenecks and not the total depth of the network.  ( 2 min )
    Reconstructing the Universe with Variational self-Boosted Sampling. (arXiv:2206.15433v1 [astro-ph.IM])
    Forward modeling approaches in cosmology have made it possible to reconstruct the initial conditions at the beginning of the Universe from the observed survey data. However the high dimensionality of the parameter space still poses a challenge to explore the full posterior, with traditional algorithms such as Hamiltonian Monte Carlo (HMC) being computationally inefficient due to generating correlated samples and the performance of variational inference being highly dependent on the choice of divergence (loss) function. Here we develop a hybrid scheme, called variational self-boosted sampling (VBS) to mitigate the drawbacks of both these algorithms by learning a variational approximation for the proposal distribution of Monte Carlo sampling and combine it with HMC. The variational distribution is parameterized as a normalizing flow and learnt with samples generated on the fly, while proposals drawn from it reduce auto-correlation length in MCMC chains. Our normalizing flow uses Fourier space convolutions and element-wise operations to scale to high dimensions. We show that after a short initial warm-up and training phase, VBS generates better quality of samples than simple VI approaches and reduces the correlation length in the sampling phase by a factor of 10-50 over using only HMC to explore the posterior of initial conditions in 64$^3$ and 128$^3$ dimensional problems, with larger gains for high signal-to-noise data observations.  ( 3 min )
    Neural Networks can Learn Representations with Gradient Descent. (arXiv:2206.15144v1 [cs.LG])
    Significant theoretical work has established that in specific regimes, neural networks trained by gradient descent behave like kernel methods. However, in practice, it is known that neural networks strongly outperform their associated kernels. In this work, we explain this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be easily learned with gradient descent on a two layer neural network outside the kernel regime by learning representations that are relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime. Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form $f^\star(x) = g(Ux)$ where $U: \R^d \to \R^r$ with $d \gg r$. When the degree of $f^\star$ is $p$, it is known that $n \asymp d^p$ samples are necessary to learn $f^\star$ in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to $f^\star$. This results in an improved sample complexity of $n\asymp d^2 r + dr^p$. Furthermore, in a transfer learning setup where the data distributions in the source and target domain share the same representation $U$ but have different polynomial heads we show that a popular heuristic for transfer learning has a target sample complexity independent of $d$.  ( 3 min )
    Transfer Learning with Deep Tabular Models. (arXiv:2206.15306v1 [cs.LG])
    Recent work on deep learning for tabular data demonstrates the strong performance of deep tabular models, often bridging the gap between gradient boosted decision trees and neural networks. Accuracy aside, a major advantage of neural models is that they learn reusable features and are easily fine-tuned in new domains. This property is often exploited in computer vision and natural language applications, where transfer learning is indispensable when task-specific training data is scarce. In this work, we demonstrate that upstream data gives tabular neural networks a decisive advantage over widely used GBDT models. We propose a realistic medical diagnosis benchmark for tabular transfer learning, and we present a how-to guide for using upstream data to boost performance with a variety of tabular neural network architectures. Finally, we propose a pseudo-feature method for cases where the upstream and downstream feature sets differ, a tabular-specific problem widespread in real-world applications. Our code is available at https://github.com/LevinRoman/tabular-transfer-learning .  ( 2 min )
    Capturing Shape Information with Multi-Scale Topological Loss Terms for 3D Reconstruction. (arXiv:2203.01703v2 [cs.CV] UPDATED)
    Reconstructing 3D objects from 2D images is both challenging for our brains and machine learning algorithms. To support this spatial reasoning task, contextual information about the overall shape of an object is critical. However, such information is not captured by established loss terms (e.g. Dice loss). We propose to complement geometrical shape information by including multi-scale topological features, such as connected components, cycles, and voids, in the reconstruction loss. Our method uses cubical complexes to calculate topological features of 3D volume data and employs an optimal transport distance to guide the reconstruction process. This topology-aware loss is fully differentiable, computationally efficient, and can be added to any neural network. We demonstrate the utility of our loss by incorporating it into SHAPR, a model for predicting the 3D cell shape of individual cells based on 2D microscopy images. Using a hybrid loss that leverages both geometrical and topological information of single objects to assess their shape, we find that topological information substantially improves the quality of reconstructions, thus highlighting its ability to extract more relevant features from image datasets.  ( 3 min )
    SOSP: Efficiently Capturing Global Correlations by Second-Order Structured Pruning. (arXiv:2110.11395v2 [cs.LG] UPDATED)
    Pruning neural networks reduces inference time and memory costs. On standard hardware, these benefits will be especially prominent if coarse-grained structures, like feature maps, are pruned. We devise two novel saliency-based methods for second-order structured pruning (SOSP) which include correlations among all structures and layers. Our main method SOSP-H employs an innovative second-order approximation, which enables saliency evaluations by fast Hessian-vector products. SOSP-H thereby scales like a first-order method despite taking into account the full Hessian. We validate SOSP-H by comparing it to our second method SOSP-I that uses a well-established Hessian approximation, and to numerous state-of-the-art methods. While SOSP-H performs on par or better in terms of accuracy, it has clear advantages in terms of scalability and efficiency. This allowed us to scale SOSP-H to large-scale vision tasks, even though it captures correlations across all layers of the network. To underscore the global nature of our pruning methods, we evaluate their performance not only by removing structures from a pretrained network, but also by detecting architectural bottlenecks. We show that our algorithms allow to systematically reveal architectural bottlenecks, which we then remove to further increase the accuracy of the networks.  ( 3 min )
    Counterfactual Inference of Second Opinions. (arXiv:2203.08653v2 [cs.LG] UPDATED)
    Automated decision support systems that are able to infer second opinions from experts can potentially facilitate a more efficient allocation of resources; they can help decide when and from whom to seek a second opinion. In this paper, we look at the design of this type of support systems from the perspective of counterfactual inference. We focus on a multiclass classification setting and first show that, if experts make predictions on their own, the underlying causal mechanism generating their predictions needs to satisfy a desirable set invariant property. Further, we show that, for any causal mechanism satisfying this property, there exists an equivalent mechanism where the predictions by each expert are generated by independent sub-mechanisms governed by a common noise. This motivates the design of a set invariant Gumbel-Max structural causal model where the structure of the noise governing the sub-mechanisms underpinning the model depends on an intuitive notion of similarity between experts which can be estimated from data. Experiments on both synthetic and real data show that our model can be used to infer second opinions more accurately than its non-causal counterpart.  ( 2 min )
    Verification and search algorithms for causal DAGs. (arXiv:2206.15374v1 [cs.LG])
    We study two problems related to recovering causal graphs from interventional data: (i) $\textit{verification}$, where the task is to check if a purported causal graph is correct, and (ii) $\textit{search}$, where the task is to recover the correct causal graph. For both, we wish to minimize the number of interventions performed. For the first problem, we give a characterization of a minimal sized set of atomic interventions that is necessary and sufficient to check the correctness of a claimed causal graph. Our characterization uses the notion of $\textit{covered edges}$, which enables us to obtain simple proofs and also easily reason about earlier results. We also generalize our results to the settings of bounded size interventions and node-dependent interventional costs. For all the above settings, we provide the first known provable algorithms for efficiently computing (near)-optimal verifying sets on general graphs. For the second problem, we give a simple adaptive algorithm based on graph separators that produces an atomic intervention set which fully orients any essential graph while using $\mathcal{O}(\log n)$ times the optimal number of interventions needed to $\textit{verify}$ (verifying size) the underlying DAG on $n$ vertices. This approximation is tight as $\textit{any}$ search algorithm on an essential line graph has worst case approximation ratio of $\Omega(\log n)$ with respect to the verifying size. With bounded size interventions, each of size $\leq k$, our algorithm gives an $\mathcal{O}(\log n \cdot \log \log k)$ factor approximation. Our result is the first known algorithm that gives a non-trivial approximation guarantee to the verifying size on general unweighted graphs and with bounded size interventions.  ( 3 min )
    Towards out of distribution generalization for problems in mechanics. (arXiv:2206.14917v1 [stat.ML])
    There has been a massive increase in research interest towards applying data driven methods to problems in mechanics. While traditional machine learning (ML) methods have enabled many breakthroughs, they rely on the assumption that the training (observed) data and testing (unseen) data are independent and identically distributed (i.i.d). Thus, traditional ML approaches often break down when applied to real world mechanics problems with unknown test environments and data distribution shifts. In contrast, out-of-distribution (OOD) generalization assumes that the test data may shift (i.e., violate the i.i.d. assumption). To date, multiple methods have been proposed to improve the OOD generalization of ML methods. However, because of the lack of benchmark datasets for OOD regression problems, the efficiency of these OOD methods on regression problems, which dominate the mechanics field, remains unknown. To address this, we investigate the performance of OOD generalization methods for regression problems in mechanics. Specifically, we identify three OOD problems: covariate shift, mechanism shift, and sampling bias. For each problem, we create two benchmark examples that extend the Mechanical MNIST dataset collection, and we investigate the performance of popular OOD generalization methods on these mechanics-specific regression problems. Our numerical experiments show that in most cases, while the OOD generalization algorithms perform better compared to traditional ML methods on these OOD problems, there is a compelling need to develop more robust OOD generalization methods that are effective across multiple OOD scenarios. Overall, we expect that this study, as well as the associated open access benchmark datasets, will enable further development of OOD generalization methods for mechanics specific regression problems.  ( 3 min )
    Lookback for Learning to Branch. (arXiv:2206.14987v1 [cs.LG])
    The expressive and computationally inexpensive bipartite Graph Neural Networks (GNN) have been shown to be an important component of deep learning based Mixed-Integer Linear Program (MILP) solvers. Recent works have demonstrated the effectiveness of such GNNs in replacing the branching (variable selection) heuristic in branch-and-bound (B&B) solvers. These GNNs are trained, offline and on a collection of MILPs, to imitate a very good but computationally expensive branching heuristic, strong branching. Given that B&B results in a tree of sub-MILPs, we ask (a) whether there are strong dependencies exhibited by the target heuristic among the neighboring nodes of the B&B tree, and (b) if so, whether we can incorporate them in our training procedure. Specifically, we find that with the strong branching heuristic, a child node's best choice was often the parent's second-best choice. We call this the "lookback" phenomenon. Surprisingly, the typical branching GNN of Gasse et al. (2019) often misses this simple "answer". To imitate the target behavior more closely by incorporating the lookback phenomenon in GNNs, we propose two methods: (a) target smoothing for the standard cross-entropy loss function, and (b) adding a Parent-as-Target (PAT) Lookback regularizer term. Finally, we propose a model selection framework to incorporate harder-to-formulate objectives such as solving time in the final models. Through extensive experimentation on standard benchmark instances, we show that our proposal results in up to 22% decrease in the size of the B&B tree and up to 15% improvement in the solving times.  ( 3 min )
    Decision Forest Based EMG Signal Classification with Low Volume Dataset Augmented with Random Variance Gaussian Noise. (arXiv:2206.14947v1 [q-bio.NC])
    Electromyography signals can be used as training data by machine learning models to classify various gestures. We seek to produce a model that can classify six different hand gestures with a limited number of samples that generalizes well to a wider audience while comparing the effect of our feature extraction results on model accuracy to other more conventional methods such as the use of AR parameters on a sliding window across the channels of a signal. We appeal to a set of more elementary methods such as the use of random bounds on a signal, but desire to show the power these methods can carry in an online setting where EMG classification is being conducted, as opposed to more complicated methods such as the use of the Fourier Transform. To augment our limited training data, we used a standard technique, known as jitter, where random noise is added to each observation in a channel wise manner. Once all datasets were produced using the above methods, we performed a grid search with Random Forest and XGBoost to ultimately create a high accuracy model. For human computer interface purposes, high accuracy classification of EMG signals is of particular importance to their functioning and given the difficulty and cost of amassing any sort of biomedical data in a high volume, it is valuable to have techniques that can work with a low amount of high-quality samples with less expensive feature extraction methods that can reliably be carried out in an online application.  ( 3 min )
    Meta-analysis of heterogeneous data: integrative sparse regression in high-dimensions. (arXiv:1912.11928v2 [stat.ME] UPDATED)
    We consider the task of meta-analysis in high-dimensional settings in which the data sources are similar but non-identical. To borrow strength across such heterogeneous datasets, we introduce a global parameter that emphasizes interpretability and statistical efficiency in the presence of heterogeneity. We also propose a one-shot estimator of the global parameter that preserves the anonymity of the data sources and converges at a rate that depends on the size of the combined dataset. For high-dimensional linear model settings, we demonstrate the superiority of our identification restrictions in adapting to a previously seen data distribution as well as predicting for a new/unseen data distribution. Finally, we demonstrate the benefits of our approach on a large-scale drug treatment dataset involving several different cancer cell-lines.  ( 2 min )
    A note on large deviations for interacting particle dynamics for finding mixed equilibria in zero-sum games. (arXiv:2206.15177v1 [stat.ML])
    Finding equilibria points in continuous minimax games has become a key problem within machine learning, in part due to its connection to the training of generative adversarial networks. Because of existence and robustness issues, recent developments have shifted from pure equilibria to focusing on mixed equilibria points. In this note we consider a method proposed by Domingo-Enrich et al. for finding mixed equilibria in two-layer zero-sum games. The method is based on entropic regularisation and the two competing strategies are represented by two sets of interacting particles. We show that the sequence of empirical measures of the particle system satisfies a large deviation principle as the number of particles grows to infinity, and how this implies convergence of the empirical measure and the associated Nikaid\^o-Isoda error, complementing existing law of large numbers results.  ( 2 min )
    Interpretable Anomaly Detection in Echocardiograms with Dynamic Variational Trajectory Models. (arXiv:2206.15316v1 [cs.LG])
    We propose a novel anomaly detection method for echocardiogram videos. The introduced method takes advantage of the periodic nature of the heart cycle to learn different variants of a variational latent trajectory model (TVAE). The models are trained on the healthy samples of an in-house dataset of infant echocardiogram videos consisting of multiple chamber views to learn a normative prior of the healthy population. During inference, maximum a posteriori (MAP) based anomaly detection is performed to detect out-of-distribution samples in our dataset. The proposed method reliably identifies severe congenital heart defects, such as Ebstein's Anomaly or Shonecomplex. Moreover, it achieves superior performance over MAP-based anomaly detection with standard variational autoencoders on the task of detecting pulmonary hypertension and right ventricular dilation. Finally, we demonstrate that the proposed method provides interpretable explanations of its output through heatmaps which highlight the regions corresponding to anomalous heart structures.  ( 2 min )
    Chained Generalisation Bounds. (arXiv:2203.00977v2 [stat.ML] UPDATED)
    This work discusses how to derive upper bounds for the expected generalisation error of supervised learning algorithms by means of the chaining technique. By developing a general theoretical framework, we establish a duality between generalisation bounds based on the regularity of the loss function, and their chained counterparts, which can be obtained by lifting the regularity assumption from the loss onto its gradient. This allows us to re-derive the chaining mutual information bound from the literature, and to obtain novel chained information-theoretic generalisation bounds, based on the Wasserstein distance and other probability metrics. We show on some toy examples that the chained generalisation bound can be significantly tighter than its standard counterpart, particularly when the distribution of the hypotheses selected by the algorithm is very concentrated. Keywords: Generalisation bounds; Chaining; Information-theoretic bounds; Mutual information; Wasserstein distance; PAC-Bayes.  ( 2 min )
    Business analytics meets artificial intelligence: Assessing the demand effects of discounts on Swiss train tickets. (arXiv:2105.01426v4 [econ.GN] UPDATED)
    We assess the demand effects of discounts on train tickets issued by the Swiss Federal Railways, the so-called `supersaver tickets', based on machine learning, a subfield of artificial intelligence. Considering a survey-based sample of buyers of supersaver tickets, we investigate which customer- or trip-related characteristics (including the discount rate) predict buying behavior, namely: booking a trip otherwise not realized by train, buying a first- rather than second-class ticket, or rescheduling a trip (e.g.\ away from rush hours) when being offered a supersaver ticket. Predictive machine learning suggests that customer's age, demand-related information for a specific connection (like departure time and utilization), and the discount level permit forecasting buying behavior to a certain extent. Furthermore, we use causal machine learning to assess the impact of the discount rate on rescheduling a trip, which seems relevant in the light of capacity constraints at rush hours. Assuming that (i) the discount rate is quasi-random conditional on our rich set of characteristics and (ii) the buying decision increases weakly monotonically in the discount rate, we identify the discount rate's effect among `always buyers', who would have traveled even without a discount, based on our survey that asks about customer behavior in the absence of discounts. We find that on average, increasing the discount rate by one percentage point increases the share of rescheduled trips by 0.16 percentage points among always buyers. Investigating effect heterogeneity across observables suggests that the effects are higher for leisure travelers and during peak hours when controlling several other characteristics.  ( 3 min )
    Provably Efficient Reinforcement Learning for Online Adaptive Influence Maximization. (arXiv:2206.14846v1 [cs.LG])
    Online influence maximization aims to maximize the influence spread of a content in a social network with unknown network model by selecting a few seed nodes. Recent studies followed a non-adaptive setting, where the seed nodes are selected before the start of the diffusion process and network parameters are updated when the diffusion stops. We consider an adaptive version of content-dependent online influence maximization problem where the seed nodes are sequentially activated based on real-time feedback. In this paper, we formulate the problem as an infinite-horizon discounted MDP under a linear diffusion process and present a model-based reinforcement learning solution. Our algorithm maintains a network model estimate and selects seed users adaptively, exploring the social network while improving the optimal policy optimistically. We establish $\widetilde O(\sqrt{T})$ regret bound for our algorithm. Empirical evaluations on synthetic network demonstrate the efficiency of our algorithm.  ( 2 min )
    Shifts 2.0: Extending The Dataset of Real Distributional Shifts. (arXiv:2206.15407v1 [cs.LG])
    Distributional shift, or the mismatch between training and deployment data, is a significant obstacle to the usage of machine learning in high-stakes industrial applications, such as autonomous driving and medicine. This creates a need to be able to assess how robustly ML models generalize as well as the quality of their uncertainty estimates. Standard ML baseline datasets do not allow these properties to be assessed, as the training, validation and test data are often identically distributed. Recently, a range of dedicated benchmarks have appeared, featuring both distributionally matched and shifted data. Among these benchmarks, the Shifts dataset stands out in terms of the diversity of tasks as well as the data modalities it features. While most of the benchmarks are heavily dominated by 2D image classification tasks, Shifts contains tabular weather forecasting, machine translation, and vehicle motion prediction tasks. This enables the robustness properties of models to be assessed on a diverse set of industrial-scale tasks and either universal or directly applicable task-specific conclusions to be reached. In this paper, we extend the Shifts Dataset with two datasets sourced from industrial, high-risk applications of high societal importance. Specifically, we consider the tasks of segmentation of white matter Multiple Sclerosis lesions in 3D magnetic resonance brain images and the estimation of power consumption in marine cargo vessels. Both tasks feature ubiquitous distributional shifts and a strict safety requirement due to the high cost of errors. These new datasets will allow researchers to further explore robust generalization and uncertainty estimation in new situations. In this work, we provide a description of the dataset and baseline results for both tasks.  ( 3 min )
    Universal and data-adaptive algorithms for model selection in linear contextual bandits. (arXiv:2111.04688v2 [cs.LG] UPDATED)
    Model selection in contextual bandits is an important complementary problem to regret minimization with respect to a fixed model class. We consider the simplest non-trivial instance of model-selection: distinguishing a simple multi-armed bandit problem from a linear contextual bandit problem. Even in this instance, current state-of-the-art methods explore in a suboptimal manner and require strong "feature-diversity" conditions. In this paper, we introduce new algorithms that a) explore in a data-adaptive manner, and b) provide model selection guarantees of the form $\mathcal{O}(d^{\alpha} T^{1- \alpha})$ with no feature diversity conditions whatsoever, where $d$ denotes the dimension of the linear model and $T$ denotes the total number of rounds. The first algorithm enjoys a "best-of-both-worlds" property, recovering two prior results that hold under distinct distributional assumptions, simultaneously. The second removes distributional assumptions altogether, expanding the scope for tractable model selection. Our approach extends to model selection among nested linear contextual bandits under some additional assumptions.  ( 2 min )
    Prediction of Dilatory Behavior in eLearning: A Comparison of Multiple Machine Learning Models. (arXiv:2206.15079v1 [stat.ML])
    Procrastination, the irrational delay of tasks, is a common occurrence in online learning. Potential negative consequences include higher risk of drop-outs, increased stress, and reduced mood. Due to the rise of learning management systems and learning analytics, indicators of such behavior can be detected, enabling predictions of future procrastination and other dilatory behavior. However, research focusing on such predictions is scarce. Moreover, studies involving different types of predictors and comparisons between the predictive performance of various methods are virtually non-existent. In this study, we aim to fill these research gaps by analyzing the performance of multiple machine learning algorithms when predicting the delayed or timely submission of online assignments in a higher education setting with two categories of predictors: subjective, questionnaire-based variables and objective, log-data based indicators extracted from a learning management system. The results show that models with objective predictors consistently outperform models with subjective predictors, and a combination of both variable types perform slightly better. For each of these three options, a different approach prevailed (Gradient Boosting Machines for the subjective, Bayesian multilevel models for the objective, and Random Forest for the combined predictors). We conclude that careful attention should be paid to the selection of predictors and algorithms before implementing such models in learning management systems.  ( 3 min )
    Which Minimizer Does My Neural Network Converge To?. (arXiv:2011.02408v2 [stat.ML] UPDATED)
    The loss surface of an overparameterized neural network (NN) possesses many global minima of zero training error. We explain how common variants of the standard NN training procedure change the minimizer obtained. First, we make explicit how the size of the initialization of a strongly overparameterized NN affects the minimizer and can deteriorate its final test performance. We propose a strategy to limit this effect. Then, we demonstrate that for adaptive optimization such as AdaGrad, the obtained minimizer generally differs from the gradient descent (GD) minimizer. This adaptive minimizer is changed further by stochastic mini-batch training, even though in the non-adaptive case, GD and stochastic GD result in essentially the same minimizer. Lastly, we explain that these effects remain relevant for less overparameterized NNs. While overparameterization has its benefits, our work highlights that it induces sources of error absent from underparameterized models.  ( 2 min )
    Federated Over-Air Subspace Tracking from Incomplete and Corrupted Data. (arXiv:2002.12873v4 [cs.LG] UPDATED)
    In this work we study the problem of Subspace Tracking with missing data (ST-miss) and outliers (Robust ST-miss). We propose a novel algorithm, and provide a guarantee for both these problems. Unlike past work on this topic, the current work does not impose the piecewise constant subspace change assumption. Additionally, the proposed algorithm is much simpler (uses fewer parameters) than our previous work. Secondly, we extend our approach and its analysis to provably solving these problems when the data is federated and when the over-air data communication modality is used for information exchange between the $K$ peer nodes and the center. We validate our theoretical claims with extensive numerical experiments.  ( 2 min )
    Wasserstein GANs with Gradient Penalty Compute Congested Transport. (arXiv:2109.00528v2 [cs.LG] UPDATED)
    Wasserstein GANs with Gradient Penalty (WGAN-GP) are a very popular method for training generative models to produce high quality synthetic data. While WGAN-GP were initially developed to calculate the Wasserstein 1 distance between generated and real data, recent works (e.g. [23]) have provided empirical evidence that this does not occur, and have argued that WGAN-GP perform well not in spite of this issue, but because of it. In this paper we show for the first time that WGAN-GP compute the minimum of a different optimal transport problem, the so-called congested transport [7]. Congested transport determines the cost of moving one distribution to another under a transport model that penalizes congestion. For WGAN-GP, we find that the congestion penalty has a spatially varying component determined by the sampling strategy used in [12] which acts like a local speed limit, making congestion cost less in some regions than others. This aspect of the congested transport problem is new, in that the congestion penalty turns out to be unbounded and depends on the distributions to be transported, and so we provide the necessary mathematical proofs for this setting. One facet of our discovery is a formula connecting the gradient of solutions to the optimization problem in WGAN-GP to the time averaged momentum of the optimal mass flow. This is in contrast to the gradient of Kantorovich potentials for the Wasserstein 1 distance, which is just the normalized direction of flow. Based on this and other considerations, we speculate on how our results explain the observed performance of WGAN-GP. Beyond applications to GANs, our theorems also point to the possibility of approximately solving large scale congested transport problems using neural network techniques.  ( 3 min )
    Randomized K-FACs: Speeding up K-FAC with Randomized Numerical Linear Algebra. (arXiv:2206.15397v1 [cs.LG])
    K-FAC is a successful tractable implementation of Natural Gradient for Deep Learning, which nevertheless suffers from the requirement to compute the inverse of the Kronecker factors (through an eigen-decomposition). This can be very time-consuming (or even prohibitive) when these factors are large. In this paper, we theoretically show that, owing to the exponential-average construction paradigm of the Kronecker factors that is typically used, their eigen-spectrum must decay. We show numerically that in practice this decay is very rapid, leading to the idea that we could save substantial computation by only focusing on the first few eigen-modes when inverting the Kronecker-factors. Randomized Numerical Linear Algebra provides us with the necessary tools to do so. Numerical results show we obtain $\approx2.5\times$ reduction in per-epoch time and $\approx3.3\times$ reduction in time to target accuracy. We compare our proposed K-FAC sped-up versions with a more computationally efficient NG implementation, SENG, and observe we perform on par with it.  ( 2 min )
    Best of Both Worlds Model Selection. (arXiv:2206.14912v1 [cs.LG])
    We study the problem of model selection in bandit scenarios in the presence of nested policy classes, with the goal of obtaining simultaneous adversarial and stochastic ("best of both worlds") high-probability regret guarantees. Our approach requires that each base learner comes with a candidate regret bound that may or may not hold, while our meta algorithm plays each base learner according to a schedule that keeps the base learner's candidate regret bounds balanced until they are detected to violate their guarantees. We develop careful mis-specification tests specifically designed to blend the above model selection criterion with the ability to leverage the (potentially benign) nature of the environment. We recover the model selection guarantees of the CORRAL algorithm for adversarial environments, but with the additional benefit of achieving high probability regret bounds, specifically in the case of nested adversarial linear bandits. More importantly, our model selection results also hold simultaneously in stochastic environments under gap assumptions. These are the first theoretical results that achieve best of both world (stochastic and adversarial) guarantees while performing model selection in (linear) bandit scenarios.  ( 2 min )
    Learning Nonparametric Ordinary differential Equations: Application to Sparse and Noisy Data. (arXiv:2206.15215v1 [stat.ML])
    Learning nonparametric systems of Ordinary Differential Equations (ODEs) $\dot x = f(t,x)$ from noisy and sparse data is an emerging machine learning topic. We use the well-developed theory of Reproducing Kernel Hilbert Spaces (RKHS) to define candidates for $f$ for which the solution of the ODE exists and is unique. Learning $f$ consists of solving a constrained optimization problem in an RKHS. We propose a penalty method that iteratively uses the Representer theorem and Euler approximations to provide a numerical solution. We prove a generalization bound for the $L^2$ distance between $x$ and its estimator. Experiments are provided for the FitzHugh Nagumo oscillator and for the prediction of the Amyloid level in the cortex of aging subjects. In both cases, we show competitive results when compared with the state of the art.  ( 2 min )
    Fair Policy Targeting. (arXiv:2005.12395v3 [econ.EM] UPDATED)
    One of the major concerns of targeting interventions on individuals in social welfare programs is discrimination: individualized treatments may induce disparities across sensitive attributes such as age, gender, or race. This paper addresses the question of the design of fair and efficient treatment allocation rules. We adopt the non-maleficence perspective of first do no harm: we select the fairest allocation within the Pareto frontier. We cast the optimization into a mixed-integer linear program formulation, which can be solved using off-the-shelf algorithms. We derive regret bounds on the unfairness of the estimated policy function and small sample guarantees on the Pareto frontier under general notions of fairness. Finally, we illustrate our method using an application from education economics.  ( 2 min )
  • Open

    Phi Phi
    I was reading something this afternoon and ran across φ(φ(m)) and thought that was unusual. I often run across φ(m), the number of positive integers less than m and relative prime to m, but don’t often see Euler’s phi function iterated. Application of φ∘φ This section will give an example of a theorem where φ(φ(m)) […] Phi Phi first appeared on John D. Cook.  ( 5 min )

  • Open

    [R] Proprietary ML model in research paper
    I am writing a research paper, and in it I use a proprietary ML model I made. I want to show the model's results and I can explain how it works, but I don't want to explicitly provide the model/its code. Is that commonplace in research papers or must I include specifics to show validity? submitted by /u/Typical-Ad-7443 [link] [comments]  ( 87 min )
    [D][P] Ideas about how to model from a dataset with columns containing arrays of data?
    Hello. I have built a dataset that contains results of experiments I have been doing over some physical materials. Each row contains summary data for each piece, like width, height, weight, etc. Then I have several columns which values are arrays. Each one of these columns contain a list of tuples, for example (162636363, 1373.8377). The first number is a timestamp, the second one the magnitude of a force applied to the material (or, for instance, the position where the force was applied, contact duration, etc.). We have hundreds or even thousands of tuples on each column. So, all columns represent measurements of the experiments done to a particular material. We are recording when the material is damaged, since we want to predict its lifetime when the material is exposed to repetitive forces. I'm wondering what to do with those array values. One option is to sort the tuples lists by timestamp and then treat the readings as a vector of a predefined dimension. But I have never fed this kind of data to a boosted tree model/framework like XGBoost. The only experience I had feeding long vectors to a model was when doing some NLP, in that case the vectors were representations of words. Do you think a vector made of all my experiments over a material can be treated as an embedding in a way? If so, how is the recommended way to proceed with this data in the modeling stage? Time series perhaps? I'd appreciate your ideas and comments. Thanks!! submitted by /u/iblysa [link] [comments]  ( 87 min )
    [D] Usage of the [class] token in ViT
    So I've read up on ViT, and while it's an impressive architecture, I seem to notice that they are using a [class] token to get the actual class from an input image (see image below). ​ Architecture of ViT While I know that it's standard to use an extra token in this fashion, since the encoder spits out one embedding for every input token (or patch in this case), I was wondering why don't we simply concatenate all the embeddings before feeding them into the MLP head (of an appropriate size)? ​ It seems to me like we are discarding a lot of information here, that could be helpful in the classification task. It's true, in theory, that the attention should take care of that, but do you know of any papers where this concatenation strategy has been tried? Does it even make sense? ​ Cheers! submitted by /u/MurlocXYZ [link] [comments]  ( 87 min )
    [D] Creating a neural network for my daughter's sake. Need advice on acronym.
    Hi, very long time lurker here. I'm planning to propose an end to end architecture for my daughter's sake. Data is biomedical and any CNN is well capable of classfying if over %95 Acc (easy data u know!). However, I need to come up with an acronym to fit my daughter's name. Her name is DURU and here is what I come up with: D- Deep (Deep like you know, deep learning) U-Unified (I may use multiple models to form up an ensemble or feature concat, which will make it unified) R- Residual (I may use residual connections between Cnn blocks. Though not flashy right now.) R- Recommender (Could use recommender keyword, since I'm putting down sort of a Computer Aided Diagnosis Framework thingy) R- Another R thing is welcome. U - I need another U and I'm totally out of words. Three letters is all I came up with. Couldn't find a word for the 4th letter that makes sense. U-net? I'm not segmentating anything. But if it was a segmentation dataset I may have come up with DUR-UNet which would make sense. I need a final keyword starting with U which is applicable with CNNs. It could be minor trick to cope with overfitting, a loss function, an activation function, etc. It could also be a filler term like Unified. Hope we could come up with a solution. submitted by /u/cltexe [link] [comments]  ( 87 min )
    [R] Introducing causal inference in the energy-efficient building design process
    I am very excited to share our latest research: Causal inference in the scenario of an energy-efficient building design to answer "what-if" questions during the design process. Abs: "What-if" questions are intuitively generated and commonly asked during the design process. Engineers and architects need to inherently conduct design decisions, progressing from one phase to another. They either use empirical domain experience, simulations, or data-driven methods to provide consequential feedback. We take an example from an interdisciplinary domain of energy-efficient building design to argue that the current methods for decision support have four limitations: 1. Less carefully inspected parametric independence raises the risks of biased results and spurious relationships. 2. The integration …  ( 88 min )
    [P]how to improve performance of face recognition using dlib?
    I am using dlib.get_frontal_face_detector() And fir large images (several mb) it takes a lot of time to detect a face. What are the ways to increase speed of face detection, without sacrificing accuracy? I cannot use gpu/cuda sadly... submitted by /u/glorsh66 [link] [comments]  ( 86 min )
    [D] Moody Actor Critic
    Generally actor critic algorithms have 1 Neural net giving 1 of each via a linear layer - to give a policy and to give the value. But humans change decisions and how they think based on their mood. I wanted to incorporate this into a standard actor critic like A2C/A3C. I wanted to add another actor in this architecture that represented a certain mood, where it's objective was not to maximize the reward but something else that I have in mind. I don't see any such literature in the field and I don't know how to add more actors. Is it not possible to have multiple actors with one critic ? Has this been passed on by the community for a lack of potential ? submitted by /u/darthsocker [link] [comments]  ( 87 min )
    [P] Upgini 1.0 is released (a Python library for data search through autoML )
    Upgini is a simple feature search & enrichment library in Python. With Upgini, you spend less time for external data search and feature engineering, which will be done for you automatically. Just use your labeled dataset to initiate search through thousands of features and data sources, including public datasets and scraped data shared by Data science community. Only the relevant features that improve prediction power of your ML model are returned. Motivation: for most supervised ML models external data & features boost accuracy significantly better than any hyperparameters tuning. But lack of automated and time-efficient search tools for external data blocks massive adoption of external features in ML pipelines.We want radically simplify features search and delivery for ML pipelines to make external data a standard approach. Like a hyperparameter tuning for machine learning nowadays. Mission: Democratize access to data sources for data science community. 📊 Data coverage and statistics Total: 239 countries and up to 41 years of history https://preview.redd.it/oj87fnkw9s891.png?width=1220&format=png&auto=webp&s=4195a607addca12d400bc4b0b62307ac4db87b67 More info about the library To install Upgini from PyPI run pip install -U upgini Full release notes: https://github.com/upgini/upgini Try the online demo at Colab. submitted by /u/AnnualLimp1418 [link] [comments]  ( 87 min )
    [P] Albumentations 1.2 is released (a Python library for image augmentation)
    The new release of a fast and flexible library for image augmentation includes: New augmentations: UnsharpMask sharpens the input image using Unsharp Masking processing and overlays the result with the original image. PixelDropout randomly replaces pixels with the passed value. https://preview.redd.it/ic1nm7mw3s891.png?width=942&format=png&auto=webp&s=c95e319f26a19bad42d33fd84e0ef27703db9095 RingingOvershoot creates ringing or overshoot artifacts by convolving the image with a 2D sinc filter. AdvancedBlur blurs the input image using a Generalized Normal filter with randomly selected parameters. It also adds multiplicative noise to generated kernel before convolution. https://preview.redd.it/wb1v6vyw3s891.png?width=941&format=png&auto=webp&s=2e57ae1b583b7aab7c30125058a25ee296afd2bd Improvements and bug fixes Fixed all np.random use cases to prevent identical values when using multiprocessing. Also, we fixed corner cases and made improvements for many augmentations. Release notes Full release notes are available at https://github.com/albumentations-team/albumentations/releases/tag/1.2.0 Installation As always, you can install the latest version of the library by running: pip install -U albumentations submitted by /u/alexparinov [link] [comments]  ( 87 min )
    [P] Sharing an Interactive Research Demo on the Cloud
    I am curious to hear what you usually use to develop interactive versions of your research models! And, if you have any, I'd be excited to see some examples for inspiration 😊. On that note, about 2 weeks ago, I shared an article on developing a Super-Resolution GAN Research Demo in Lightning on r/MachineLearning: Bottom-up look at the new Lightning Framework for building anything from production-ready ML systems to research demos Running research demos locally is not super useful by itself (unless you maybe do that live at a poster session -- I still have painful memories of doing that with privacy GAN demo in Flask), so this is a follow-up article on deploying the App on the Cloud: Sharing Deep Learning Research Models with Lightning Part 2: Leveraging the Cloud. Also, curious to hear what you think! submitted by /u/seraschka [link] [comments]  ( 88 min )
    [N][R][CfP] Workshop on Artificial Intelligence for Strategy Games @ AIIDE 22
    Hello Everyone! My name is Derek, and I am a co-chair for the Workshop on AI for Strategy Games at AIIDE this year. I wanted to share some info about the workshop for those that may be interested in discussing the future of AI for strategy games or looking to publish/get feedback on any work research you are doing with strategy games. Feel free to message me if you have any questions! Workshop website: https://skatgame.net/mburo/aiide22ws/ Submission deadline: July 29, 2022 Topics This workshop welcomes original research contributions, position papers, competition AI system descriptions, and post-mortem game analyses in the area of AI for strategy games --- including modern video strategy games (such as FPS and RTS games), and turn based games and puzzles. Topics include, but are not r…  ( 89 min )
    [D] What is considered a "large" model?
    Curious about the usage of the word "large" in the research community and in papers as a descriptor. About 3 years ago, Bert-Large was considered large at 345 million parameters. Today we have a 11-B parameter T-5 model and larger. When describing models in papers, is there consensus as to what we consider a "large" model or set of categories to describe models based on their size? submitted by /u/certain_entropy [link] [comments]  ( 88 min )
    [D] Are there still any SOTA architectures trainable from-scratch for a student ?
    When I say "SOTA" I'm talking about recent architectures like ViT, BERT, GPT-like models.. Is it possible to train any of these from scratch (no pre-trained checkpoint) with low resources (Colab, Colab pro) ? submitted by /u/Silver_Doughnut_8175 [link] [comments]  ( 91 min )
    [D] Loss Function, Uncertainty
    Hello members, soo my question is suppose we have a model or architecture at we have an image classifier at the end of it which is trained on mnist images. We need to train the model such that when the image is passed through the classifier it outcomes it's results with some uncertainty in its predictions. We need to use that uncertainty in order to develop a loss function to train the whole model as we can't use the true labels of the images. Any resources or ideas related to above which can be helpful pls share with me. Any suggestions will be appreciated. Thanks submitted by /u/Anonymous_Guy_12 [link] [comments]  ( 86 min )
    [D] Algorithms for Anomaly Detection
    Hi guys, I am dealing with 1000s of devices distributed over the whole world. These devices log and upload events (e.g. various kinds of device faults) including a time stamp to a database. My tasks now is to analyze these time series of events, detect anomalies and then automatically send notifications about these anomalies. Anomalies I want to detect may include things like: - sudden spikes in the number of events - sudden changes of the type of events - long term drift of the number of events - etc. etc. Any advice on suitable algorithms for this kind of problems and/or according literature would be highly appreciated. Thanks! 👍 submitted by /u/RafiRafiRafiRafi [link] [comments]  ( 93 min )
    [P] [R] Automated Essay Scoring Systems for other languages
    Hey guys, working on an AES project. Just wanted to know if there exists an AES system that can be trained on languages like Swahili, Arabic, Hindi etc. Languages having almost no AES studies done. Would be very helpful of you to guide me through, any other tips/pointers towards this task are much appreciated, would love it if someone can point me in the right direction. submitted by /u/NeoKoseii [link] [comments]  ( 86 min )
    [R] RankSEG: A Consistent Ranking-based Framework for Segmentation
    I am very excited to share our latest research: a new framework RankSEG on (image) segmentation. Abs: In this paper, we establish a theoretical foundation of segmentation with respect to the Dice/IoU metrics, including the Bayes rule and Dice/IoU-calibration, analogous to classification-calibration or Fisher consistency in classification. We prove that the existing thresholding-based framework with most operating losses are NOT consistent with respect to the Dice/IoU metrics, and thus may lead to a suboptimal solution. To address this pitfall, we propose a novel consistent ranking-based framework, namely RankDice/RankIoU, inspired by plug-in rules of the Bayes segmentation rule. Three numerical algorithms with GPU parallel execution are developed to implement the proposed framework in lar…  ( 88 min )
    [D] Why are transformers still being used?
    We already have architecture(s) which are supposed to fix one of the biggest issues with transformers, namely that they scale quadratically with input size. The performer scales linearly, which should allow for much bigger context windows, yet looking at recent large language models from major players, all of them seem to be using the old transformer save for some minor improvements. The only exception was Flamingo which had to use a Perceiver because images are huge. So why haven't we ditched the transformer yet? submitted by /u/DickMan64 [link] [comments]  ( 92 min )
    [N] Introducing Anomalib: A library for benchmarking, developing and deploying deep learning anomaly detection algorithms by Intel
    Anomalib is Machine Library developed by AI researchers from Intel which implements state of the art algorithms for anomaly detection. Anomaly detection is popular use case in the industrial sector and such algorithms can help provide real-time feedback to manufactures on how well their production lines are performing. Anomaly Detection is a challenging problem often due to a biased dataset. Anomalous images can be scare therefore these algorithms are trained on good images in an unsupervised fashion. By learning the normality, upon inference, the models can detect whether images are anomalous or not. Anomalib was built using a PyTorchLightning Backbone and offers an easy way to deploy the models with OpenVino for inference speedup. Link to the github repo: https://github.com/openvinotoolkit/anomalib Link to a tutorial on how to train your custom dataset with anomalib: https://github.com/openvinotoolkit/anomalib/tree/development/docs/blog/001-train-custom-dataset Please feel free to check out the repo and give us your feedback submitted by /u/alder-ice [link] [comments]  ( 87 min )
    [D] On advisors and PhD students
    I think the answer to this question depends on heavily on the area at hand. That is why I am asking here, even though this question has been asked elsewhere a gazillion times. How much does your advisor help/contribute? How often do you meet? I am especially interested in people who have published papers. Who proposed the problem, and then found a solution? how much of that solution was joint work vs either of you submitting ideas to the other and being approved or rejected? how satisfied/dissatisfied do you feel with respect to your advisor? have you had multiple advisors? if so, how do they compare? Let me start by sharing my experience. I always take the initiative when organizing a meeting with my advisor; if I don't say anything, we probably wouldn't meet. I send him biweekly emails with my progress. Usually this entails a write-up explaining my ideas and their development. I think he skims through it, but he definitely does not read it carefully/go through the details. When we have a meeting I generally have to explain the content of the write-up. In terms of the content itself, he tells me whether the ideas/problem seem sound or not, but does not propose improvements. Sometimes, he proposes other ideas that would imply a significant shift of my current work, which honestly I tend to reject because I have already invested a great deal of time to my ideas and I am more emotionally attached to them (I know this latter point isn't good practice). Overall, I don't know how to feel because I don't really know what's generally expected. If I had to chose, however, I'd say I feel mildly satisfied. What's your experience? submitted by /u/carlml [link] [comments]  ( 100 min )
    [Discussion] Regarding Long Term Memory in NLP Models
    Does anyone know if there exists a NLP model, like Lambda, that takes every conversation attempts to update their weights in order to incorporate it into its training? My thought process would be instead of using attention and a subsection of the conversation to generate a response, it takes everything. Basically everything gets back propagated and adjusts the weights. This way the model might begin to "remember" its previous conversations. This may be a stretch and perhaps I am missing something fundamental, but it seems like an interesting experiment. I'd love to continue this conversation and elaborate more in the comments. submitted by /u/gabe415160 [link] [comments]  ( 84 min )
    [D] emerging fields of ML that will help mankind
    Hey all, This is a question I've been asking myself lately; what are the fields of ML which show the most promise in helping mankind in non-frivolous ways (e.g. not animojis)? A few years ago I remember an article describing how one of Microsoft's object detection services helped give 'sight' to the blind by describing what's in the room around them through its computer vision. Another one that I found inspiring was about assisting those with locked-in-syndrome by mapping their brain waves and/or eye movement to certain images, words or letters (I forget which). submitted by /u/lituga [link] [comments]  ( 84 min )
  • Open

    DARK DERELICT CITY | RAW UNSCALED | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 84 min )
    This is what happens when you allow a chatbot be trained by the public.
    submitted by /u/LeglessLoach [link] [comments]  ( 83 min )
    AIs that run on language models aren't so much intelligences as world-generators. Their inner workings are mysterious; humans will need to intuit their outputs and become mystics -- called "prompt engineers" -- as a result
    submitted by /u/cold-depths [link] [comments]  ( 85 min )
    Made with Dalle-2 A.i
    submitted by /u/OneFinding1429 [link] [comments]  ( 84 min )
    pixelz.ai updates 👇🏽
    submitted by /u/pixelz_ai [link] [comments]  ( 84 min )
    How well would it work if artificial intelligence were used to summarize and rewrite a non-fiction book and have the AI automatically remove all of the author's personal experiences?
    Then you would have a whole new book, which is better to read. ​ Wouldn't that also pass the copyright from the author to the developer of the AI? ​ 2 Question: Is there an AI that really has a realistic voice, which you can use to make audio books? submitted by /u/xXNOdrugsForMEXx [link] [comments]  ( 85 min )
    looking for admins for a Discord based non-profit project. No time commitment, just want to have some AI experts to help awnser community member questions! DM for more info :)
    submitted by /u/Accomplished_Head5 [link] [comments]  ( 85 min )
    Steph Curry's former coach says AI can help train the next NBA champions
    submitted by /u/estasfuera [link] [comments]  ( 85 min )
    What happens if we put a ‘sentient’ AI inside of a lab-grown brain?
    submitted by /u/estasfuera [link] [comments]  ( 85 min )
    Open-source language AI challenges big tech’s models
    submitted by /u/bperki8 [link] [comments]  ( 85 min )
    Google's latest image Ai beats Imagen (Googles 4 week old image Ai), which itself beats Dalle 2.
    submitted by /u/Hallowmew [link] [comments]  ( 86 min )
    AI redrawls TikTok logo (https://www.craiyon.com/)
    submitted by /u/Various_Yoghurt1859 [link] [comments]  ( 84 min )
    Computing machinery and intelligence
    I was just reading the research paper "Computing machinery and Intelligence", a paper that Alan Turing had published in 1950, and I did not quite understand a section of the paper where he wrote about the parameters of the machine to be considered in the "imitation game". It would be of great help if anyone could explain these parameters , especially the third parameter. ​ https://preview.redd.it/0bcnothvzr891.png?width=630&format=png&auto=webp&s=ddee792f5c451568bc73c368c0c14c8bd09597e4 submitted by /u/Huckleberry-4915 [link] [comments]  ( 87 min )
    Altis AI Personal Trainer gamifies movement instruction
    submitted by /u/NinaMJ [link] [comments]  ( 84 min )
    AI will not destroyed humanity but rather save it!
    submitted by /u/MufBoiLegend420 [link] [comments]  ( 85 min )
    World’s Top 50 Innovators 2022
    submitted by /u/chelsea_bear [link] [comments]  ( 85 min )
    Codeformer - Face Image Restoration model
    submitted by /u/imapurplemango [link] [comments]  ( 85 min )
    "Einstein" - Created on Pixelz.ai
    ​ https://preview.redd.it/ezbng11f6q891.jpg?width=1170&format=pjpg&auto=webp&s=cecad6b5326e492cfe897d78290aec46c197d198 submitted by /u/pixelz_ai [link] [comments]  ( 84 min )
    What voice-changing apps are available right now?
    I know there's one or two options for real-time voice changing that don't sound so convincing (that is, they sound robotic). I was wondering if there's anything that might sound better but doesn't operate in real time? I plan on voicing male and female characters in a video and I have plenty of time to edit the voice clips together but I need them to sound convincing. Free stuff is preferred but I'd consider paying money if that's the only way to get good results. submitted by /u/outsm0ked [link] [comments]  ( 86 min )
  • Open

    Workshop on Artificial Intelligence for Strategy Games @ AIIDE 22
    Hello Everyone! My name is Derek, and I am a co-chair for the Workshop on AI for Strategy Games at AIIDE this year. I wanted to share some info about the workshop for those that may be interested in discussing the future of AI for strategy games or looking to publish/get feedback on any work research you are doing with strategy games. Feel free to message me if you have any questions! Workshop website: https://skatgame.net/mburo/aiide22ws/ Submission deadline: July 29, 2022 Topics This workshop welcomes original research contributions, position papers, competition AI system descriptions, and post-mortem game analyses in the area of AI for strategy games --- including modern video strategy games (such as FPS and RTS games), and turn based games and puzzles. Topics include, but are not r…  ( 87 min )
    Are exploration and credit assignment independent? What's your opinion?
    submitted by /u/Conscious_Heron_9133 [link] [comments]  ( 85 min )
    Optimal State-Value Function vs Optimal Action-Value Function
    In Sutton's book, page 63, there is this proof/statement: ​ ​ https://preview.redd.it/azjnvai76r891.png?width=711&format=png&auto=webp&s=ecc4f276a986e07247a769ec4d9df4a9d68da194 Can anyone explain or point out some reference that explains: Why is that v*(s) = max q(s,a)? How can I get Eq (3.18)? Thank you! submitted by /u/rlopes404 [link] [comments]  ( 86 min )
    Ideal size of the visual observation
    Hi, I am using the MPE (https://github.com/openai/multiagent-particle-envs) and I'm planning to use a visual observation. I was wondering, what size should it be? I assume that if it is too large and the agents are only a few, I am wasting lots of compute for nothing and also the noise becomes a lot. But how to find the best size? 60x60x3 for example? submitted by /u/No_Possibility_7588 [link] [comments]  ( 85 min )
  • Open

    Identifying Disfluencies in Natural Speech
    Posted by Dan Walker and Dan Liebling, Software Engineers, Google Research People don’t write in the same way that they speak. Written language is controlled and deliberate, whereas transcripts of spontaneous speech (like interviews) are hard to read because speech is disorganized and less fluent. One aspect that makes speech transcripts particularly difficult to read is disfluency, which includes self-corrections, repetitions, and filled pauses (e.g., words like “umm”, and “you know”). Following is an example of a spoken sentence with disfluencies from the LDC CALLHOME corpus: But that's it's not, it's not, it's, uh, it's a word play on what you just said. It takes some time to understand this sentence — the listener must filter out the extraneous words and resolve all of the nots. Remo…  ( 27 min )
    Minerva: Solving Quantitative Reasoning Problems with Language Models
    Posted by Ethan Dyer and Guy Gur-Ari, Research Scientists, Google Research, Blueshift Team Language models have demonstrated remarkable performance on a variety of natural language tasks — indeed, a general lesson from many works, including BERT, GPT-3, Gopher, and PaLM, has been that neural networks trained on diverse data at large scale in an unsupervised way can perform well on a variety of tasks. Quantitative reasoning is one area in which language models still fall far short of human-level performance. Solving mathematical and scientific questions requires a combination of skills, including correctly parsing a question with natural language and mathematical notation, recalling relevant formulas and constants, and generating step-by-step solutions involving numerical calculations and…  ( 25 min )
  • Open

    The Riemann Hypothesis in One Picture
    I wrote this article for machine learning and analytic professionals in general. Actually, I describe a new visual, simple, intuitive method for supervised classification. It involves synthetic data and explainable AI. But at the same time, I describe in layman’s terms the Riemann Hypothesis (RH). Also, I offer a new perspective on the subject for… Read More »The Riemann Hypothesis in One Picture The post The Riemann Hypothesis in One Picture appeared first on Data Science Central.  ( 21 min )
  • Open

    Secure Amazon SageMaker Studio presigned URLs Part 1: Foundational infrastructure
    You can access Amazon SageMaker Studio notebooks from the Amazon SageMaker console via AWS Identity and Access Management (IAM) authenticated federation from your identity provider (IdP), such as Okta. When a Studio user opens the notebook link, Studio validates the federated user’s IAM policy to authorize access, and generates and resolves the presigned URL for […]  ( 6 min )
    Secure Amazon SageMaker Studio presigned URLs Part 2: Private API with JWT authentication
    In part 1 of this series, we demonstrated how to resolve an Amazon SageMaker Studio presigned URL from a corporate network using Amazon private VPC endpoints without traversing the internet. In this post, we will continue to build on top of the previous solution to demonstrate how to build a private API Gateway via Amazon API […]  ( 7 min )
  • Open

    Three Wheeling: Startup Faction Develops Affordable Tri-Wheel AVs on NVIDIA DRIVE
    Some things are easy as A, B, C. But when it comes to autonomous vehicles, the key may be in one, two, three. Faction, a Bay Area-based startup and NVIDIA Inception member, is preparing to debut its business-to-business autonomous delivery service, accelerating its commercial deployment with three-wheel production electric vehicles purpose-built for driverless services. In Read article > The post Three Wheeling: Startup Faction Develops Affordable Tri-Wheel AVs on NVIDIA DRIVE appeared first on NVIDIA Blog.  ( 5 min )
    The Gaming Evolution Will Be Televised: GFN Thursday Levels Up the Living Room Experience on New Samsung TVs and More
    Turn the TV on. GeForce NOW is leveling up gaming in the living room. The Samsung Gaming Hub launched today, delivering GeForce NOW natively on 2022 Samsung Smart TVs. Plus, the SHIELD Software Experience Upgrade 9.1 is now rolling out to all NVIDIA SHIELD TVs, delivering new gaming features that improve GeForce NOW. Great living Read article > The post The Gaming Evolution Will Be Televised: GFN Thursday Levels Up the Living Room Experience on New Samsung TVs and More appeared first on NVIDIA Blog.  ( 8 min )
  • Open

    Speculation on new SI prefixes
    The SI prefixes giga and tera were adopted in 1960. The prefixes eta and peta were adopted in 1975, and zetta and yotta were adopted in 1991. Following this 15-year cadence, we should have adopted a few more prefixes by now. If we ever do introduce new prefixes, what might they be? The latest prefixes […] Speculation on new SI prefixes first appeared on John D. Cook.  ( 5 min )
  • Open

    byteLAKE’s CFD Suite (AI-accelerated CFD) — recommended hardware for AI training at the Edge (1/3)
    Blog post miniseries summarizing byteLAKE’s recommendation about hardware platforms to perform CFD Suite’s AI Training at the Edge.  ( 15 min )
  • Open

    FIGS: Attaining XGBoost-level performance with the interpretability and speed of CART
    FIGS (Fast Interpretable Greedy-tree Sums): A method for building interpretable models by simultaneously growing an ensemble of decision trees in competition with one another. Recent machine-learning advances have led to increasingly complex predictive models, often at the cost of interpretability. We often need interpretability, particularly in high-stakes applications such as in clinical decision-making; interpretable models help with all kinds of things, such as identifying errors, leveraging domain knowledge, and making speedy predictions. In this blog post we’ll cover FIGS, a new method for fitting an interpretable model that takes the form of a sum of trees. Real-world experiments and theoretical results show that FIGS can effectively adapt to a wide range of structure in data, ach…  ( 3 min )
  • Open

    How do I train a neural network on a small dataset?
    I have a dataset with 7 input features and 1 output.The length of the dataset is just 260, which is small. How can I train a neural network with the help of keras and achieve accuracy of over 80%? What should be the architecture of the deep neural network? submitted by /u/mono1110 [link] [comments]  ( 85 min )
  • Open

    Building explainability into the components of machine-learning models
    Researchers develop tools to help data scientists make the features used in machine-learning models more understandable for end users.  ( 7 min )
  • Open

    Generalized Permutants and Graph GENEOs. (arXiv:2206.14798v1 [math.CO])
    In this paper we establish a bridge between Topological Data Analysis and Geometric Deep Learning, adapting the topological theory of group equivariant non-expansive operators (GENEOs) to act on the space of all graphs weighted on vertices or edges. This is done by showing how the general concept of GENEO can be used to transform graphs and to give information about their structure. This requires the introduction of the new concepts of generalized permutant and generalized permutant measure and the mathematical proof that these concepts allow us to build GENEOs between graphs. An experimental section concludes the paper, illustrating the possible use of our operators to extract information from graphs. This paper is part of a line of research devoted to developing a compositional and geometric theory of GENEOs for Geometric Deep Learning.  ( 2 min )
    DrumGAN VST: A Plugin for Drum Sound Analysis/Synthesis With Autoencoding Generative Adversarial Networks. (arXiv:2206.14723v1 [cs.SD])
    In contemporary popular music production, drum sound design is commonly performed by cumbersome browsing and processing of pre-recorded samples in sound libraries. One can also use specialized synthesis hardware, typically controlled through low-level, musically meaningless parameters. Today, the field of Deep Learning offers methods to control the synthesis process via learned high-level features and allows generating a wide variety of sounds. In this paper, we present DrumGAN VST, a plugin for synthesizing drum sounds using a Generative Adversarial Network. DrumGAN VST operates on 44.1 kHz sample-rate audio, offers independent and continuous instrument class controls, and features an encoding neural network that maps sounds into the GAN's latent space, enabling resynthesis and manipulation of pre-existing drum sounds. We provide numerous sound examples and a demo of the proposed VST plugin.  ( 2 min )
    Private Graph Extraction via Feature Explanations. (arXiv:2206.14724v1 [cs.LG])
    Privacy and interpretability are two of the important ingredients for achieving trustworthy machine learning. We study the interplay of these two aspects in graph machine learning through graph reconstruction attacks. The goal of the adversary here is to reconstruct the graph structure of the training data given access to model explanations. Based on the different kinds of auxiliary information available to the adversary, we propose several graph reconstruction attacks. We show that additional knowledge of post-hoc feature explanations substantially increases the success rate of these attacks. Further, we investigate in detail the differences between attack performance with respect to three different classes of explanation methods for graph neural networks: gradient-based, perturbation-based, and surrogate model-based methods. While gradient-based explanations reveal the most in terms of the graph structure, we find that these explanations do not always score high in utility. For the other two classes of explanations, privacy leakage increases with an increase in explanation utility. Finally, we propose a defense based on a randomized response mechanism for releasing the explanations which substantially reduces the attack success rate. Our anonymized code is available.  ( 2 min )
    On Monocular Depth Estimation and Uncertainty Quantification using Classification Approaches for Regression. (arXiv:2202.12369v2 [cs.CV] UPDATED)
    Monocular depth is important in many tasks, such as 3D reconstruction and autonomous driving. Deep learning based models achieve state-of-the-art performance in this field. A set of novel approaches for estimating monocular depth consists of transforming the regression task into a classification one. However, there is a lack of detailed descriptions and comparisons for Classification Approaches for Regression (CAR) in the community and no in-depth exploration of their potential for uncertainty estimation. To this end, this paper will introduce a taxonomy and summary of CAR approaches, a new uncertainty estimation solution for CAR, and a set of experiments on depth accuracy and uncertainty quantification for CAR-based models on KITTI dataset. The experiments reflect the differences in the portability of various CAR methods on two backbones. Meanwhile, the newly proposed method for uncertainty estimation can outperform the ensembling method with only one forward propagation.
    Ultra-sensitive Flexible Sponge-Sensor Array for Muscle Activities Detection and Human Limb Motion Recognition. (arXiv:2205.03238v2 [eess.SP] UPDATED)
    Human limb motion tracking and recognition plays an important role in medical rehabilitation training, lower limb assistance, prosthetics design for amputees, feedback control for assistive robots, etc. Lightweight wearable sensors, including inertial sensors, surface electromyography sensors, and flexible strain/pressure, are promising to become the next-generation human motion capture devices. Herein, we present a wireless wearable device consisting of a sixteen-channel flexible sponge-based pressure sensor array to recognize various human lower limb motions by detecting contours on the human skin caused by calf gastrocnemius muscle actions. Each sensing element is a round porous structure of thin carbon nanotube/polydimethylsiloxane nanocomposites with a diameter of 4 mm and thickness of about 400 {\mu}m. Ten human subjects were recruited to perform ten different lower limb motions while wearing the developed device. The motion classification result with the support vector machine method shows a macro-recall of about 97.3% for all ten motions tested. This work demonstrates a portable wearable muscle activity detection device with a lower limb motion recognition application, which can be potentially used in assistive robot control, healthcare, sports monitoring, etc.
    Variational Bayesian inference for CP tensor completion with side information. (arXiv:2206.12486v2 [cs.LG] UPDATED)
    We propose a message passing algorithm, based on variational Bayesian inference, for low-rank tensor completion with automatic rank determination in the canonical polyadic format when additional side information (SI) is given. The SI comes in the form of low-dimensional subspaces the contain the fiber spans of the tensor (columns, rows, tubes, etc.). We validate the regularization properties induced by SI with extensive numerical experiments on synthetic and real-world data and present the results about tensor recovery and rank determination. The results show that the number of samples required for successful completion is significantly reduced in the presence of SI. We also discuss the origin of a bump in the phase transition curves that exists when the dimensionality of SI is comparable with that of the tensor.
    Fast algorithm for overcomplete order-3 tensor decomposition. (arXiv:2202.06442v2 [cs.LG] UPDATED)
    We develop the first fast spectral algorithm to decompose a random third-order tensor over $\mathbb{R}^d$ of rank up to $O(d^{3/2}/\text{polylog}(d))$. Our algorithm only involves simple linear algebra operations and can recover all components in time $O(d^{6.05})$ under the current matrix multiplication time. Prior to this work, comparable guarantees could only be achieved via sum-of-squares [Ma, Shi, Steurer 2016]. In contrast, fast algorithms [Hopkins, Schramm, Shi, Steurer 2016] could only decompose tensors of rank at most $O(d^{4/3}/\text{polylog}(d))$. Our algorithmic result rests on two key ingredients. A clean lifting of the third-order tensor to a sixth-order tensor, which can be expressed in the language of tensor networks. A careful decomposition of the tensor network into a sequence of rectangular matrix multiplications, which allows us to have a fast implementation of the algorithm.
    Simulate Time-integrated Coarse-grained Molecular Dynamics with Geometric Machine Learning. (arXiv:2204.10348v2 [cs.LG] UPDATED)
    Molecular dynamics (MD) simulation is the workhorse of various scientific domains but is limited by high computational cost. Learning-based force fields have made major progress in accelerating ab-initio MD simulation but are still not fast enough for many real-world applications that require long-time MD simulation. In this paper, we adopt a different machine learning approach where we coarse-grain a physical system using graph clustering, and model the system evolution with a very large time-integration step using graph neural networks. A novel score-based GNN refinement module resolves the long-standing challenge of long-time simulation instability. Despite only trained with short MD trajectory data, our learned simulator can generalize to unseen novel systems and simulate for much longer than the training trajectories. Properties requiring 10-100 ns level long-time dynamics can be accurately recovered at several-orders-of-magnitude higher speed than classical force fields. We demonstrate the effectiveness of our method on two realistic complex systems: (1) single-chain coarse-grained polymers in implicit solvent; (2) multi-component Li-ion polymer electrolyte systems.
    A Learnable Variational Model for Joint Multimodal MRI Reconstruction and Synthesis. (arXiv:2204.03804v2 [eess.IV] UPDATED)
    Generating multi-contrasts/modal MRI of the same anatomy enriches diagnostic information but is limited in practice due to excessive data acquisition time. In this paper, we propose a novel deep-learning model for joint reconstruction and synthesis of multi-modal MRI using incomplete k-space data of several source modalities as inputs. The output of our model includes reconstructed images of the source modalities and high-quality image synthesized in the target modality. Our proposed model is formulated as a variational problem that leverages several learnable modality-specific feature extractors and a multimodal synthesis module. We propose a learnable optimization algorithm to solve this model, which induces a multi-phase network whose parameters can be trained using multi-modal MRI data. Moreover, a bilevel-optimization framework is employed for robust parameter training. We demonstrate the effectiveness of our approach using extensive numerical experiments.
    Uniform Convergence Rates for Lipschitz Learning on Graphs. (arXiv:2111.12370v2 [math.NA] UPDATED)
    Lipschitz learning is a graph-based semi-supervised learning method where one extends labels from a labeled to an unlabeled data set by solving the infinity Laplace equation on a weighted graph. In this work we prove uniform convergence rates for solutions of the graph infinity Laplace equation as the number of vertices grows to infinity. Their continuum limits are absolutely minimizing Lipschitz extensions with respect to the geodesic metric of the domain where the graph vertices are sampled from. We work under very general assumptions on the graph weights, the set of labeled vertices, and the continuum domain. Our main contribution is that we obtain quantitative convergence rates even for very sparsely connected graphs, as they typically appear in applications like semi-supervised learning. In particular, our framework allows for graph bandwidths down to the connectivity radius. For proving this we first show a quantitative convergence statement for graph distance functions to geodesic distance functions in the continuum. Using the "comparison with distance functions" principle, we can pass these convergence statements to infinity harmonic functions and absolutely minimizing Lipschitz extensions.
    Overcoming Oscillations in Quantization-Aware Training. (arXiv:2203.11086v2 [cs.LG] UPDATED)
    When training neural networks with simulated quantization, we observe that quantized weights can, rather unexpectedly, oscillate between two grid-points. The importance of this effect and its impact on quantization-aware training (QAT) are not well-understood or investigated in literature. In this paper, we delve deeper into the phenomenon of weight oscillations and show that it can lead to a significant accuracy degradation due to wrongly estimated batch-normalization statistics during inference and increased noise during training. These effects are particularly pronounced in low-bit ($\leq$ 4-bits) quantization of efficient networks with depth-wise separable layers, such as MobileNets and EfficientNets. In our analysis we investigate several previously proposed QAT algorithms and show that most of these are unable to overcome oscillations. Finally, we propose two novel QAT algorithms to overcome oscillations during training: oscillation dampening and iterative weight freezing. We demonstrate that our algorithms achieve state-of-the-art accuracy for low-bit (3 & 4 bits) weight and activation quantization of efficient architectures, such as MobileNetV2, MobileNetV3, and EfficentNet-lite on ImageNet. Our source code is available at {https://github.com/qualcomm-ai-research/oscillations-qat}.
    Order Constraints in Optimal Transport. (arXiv:2110.07275v2 [cs.LG] UPDATED)
    Optimal transport is a framework for comparing measures whereby a cost is incurred for transporting one measure to another. Recent works have aimed to improve optimal transport plans through the introduction of various forms of structure. We introduce novel order constraints into the optimal transport formulation to allow for the incorporation of structure. We define an efficient method for obtaining explainable solutions to the new formulation that scales far better than standard approaches. The theoretical properties of the method are provided. We demonstrate experimentally that order constraints improve explainability using the e-SNLI (Stanford Natural Language Inference) dataset that includes human-annotated rationales as well as on several image color transfer examples.
    Measuring Fairness under Unawareness of Sensitive Attributes: A Quantification-Based Approach. (arXiv:2109.08549v3 [cs.CY] UPDATED)
    Algorithms and models are increasingly deployed to inform decisions about people, inevitably affecting their lives. As a consequence, those in charge of developing these models must carefully evaluate their impact on different groups of people and favour group fairness, that is, ensure that groups determined by sensitive demographic attributes, such as race or sex, are not treated unjustly. To achieve this goal, the availability (awareness) of these demographic attributes to those evaluating the impact of these models is fundamental. Unfortunately, collecting and storing these attributes is often in conflict with industry practices and legislation on data minimisation and privacy. For this reason, it can be hard to measure the group fairness of trained models, even from within the companies developing them. In this work, we tackle the problem of measuring group fairness under unawareness of sensitive attributes, by using techniques from quantification, a supervised learning task concerned with directly providing group-level prevalence estimates (rather than individual-level class labels). We show that quantification approaches are particularly suited to tackle the fairness-under-unawareness problem, as they are robust to inevitable distribution shifts while at the same time decoupling the (desirable) objective of measuring group fairness from the (undesirable) side effect of allowing the inference of sensitive attributes of individuals. More in detail, we show that fairness under unawareness can be cast as a quantification problem and solved with proven methods from the quantification literature. We show that these methods outperform previous approaches to measure demographic parity in five experimental protocols, corresponding to important challenges that complicate the estimation of classifier fairness under unawareness.
    Bayesian Structure Learning with Generative Flow Networks. (arXiv:2202.13903v2 [cs.LG] UPDATED)
    In Bayesian structure learning, we are interested in inferring a distribution over the directed acyclic graph (DAG) structure of Bayesian networks, from data. Defining such a distribution is very challenging, due to the combinatorially large sample space, and approximations based on MCMC are often required. Recently, a novel class of probabilistic models, called Generative Flow Networks (GFlowNets), have been introduced as a general framework for generative modeling of discrete and composite objects, such as graphs. In this work, we propose to use a GFlowNet as an alternative to MCMC for approximating the posterior distribution over the structure of Bayesian networks, given a dataset of observations. Generating a sample DAG from this approximate distribution is viewed as a sequential decision problem, where the graph is constructed one edge at a time, based on learned transition probabilities. Through evaluation on both simulated and real data, we show that our approach, called DAG-GFlowNet, provides an accurate approximation of the posterior over DAGs, and it compares favorably against other methods based on MCMC or variational inference.
    Hidden Parameter Recurrent State Space Models For Changing Dynamics Scenarios. (arXiv:2206.14697v1 [cs.LG])
    Recurrent State-space models (RSSMs) are highly expressive models for learning patterns in time series data and system identification. However, these models assume that the dynamics are fixed and unchanging, which is rarely the case in real-world scenarios. Many control applications often exhibit tasks with similar but not identical dynamics which can be modeled as a latent variable. We introduce the Hidden Parameter Recurrent State Space Models (HiP-RSSMs), a framework that parametrizes a family of related dynamical systems with a low-dimensional set of latent factors. We present a simple and effective way of learning and performing inference over this Gaussian graphical model that avoids approximations like variational inference. We show that HiP-RSSMs outperforms RSSMs and competing multi-task models on several challenging robotic benchmarks both on real-world systems and simulations.
    Physics-informed Guided Disentanglement in Generative Networks. (arXiv:2107.14229v3 [cs.CV] UPDATED)
    Image-to-image translation (i2i) networks suffer from entanglement effects in presence of physics-related phenomena in target domain (such as occlusions, fog, etc), lowering altogether the translation quality, controllability and variability. In this paper, we build upon collection of simple physics models and present a comprehensive method for disentangling visual traits in target images, guiding the process with a physical model that renders some of the target traits, and learning the remaining ones. Because it allows explicit and interpretable outputs, our physical models (optimally regressed on target) allows generating unseen scenarios in a controllable manner. We also extend our framework, showing versatility to neural-guided disentanglement. The results show our disentanglement strategies dramatically increase performances qualitatively and quantitatively in several challenging scenarios for image translation.
    Acoustics-specific Piano Velocity Estimation. (arXiv:2203.16294v2 [cs.SD] UPDATED)
    Motivated by the state-of-art psychological research, we note that a piano performance transcribed with existing Automatic Music Transcription (AMT) methods cannot be successfully resynthesized without affecting the artistic content of the performance. This is due to 1) the different mappings between MIDI parameters used by different instruments, and 2) the fact that musicians adapt their way of playing to the surrounding acoustic environment. To face this issue, we propose a methodology to build acoustics-specific AMT systems that are able to model the adaptations that musicians apply to convey their interpretation. Specifically, we train models tailored for virtual instruments in a modular architecture that takes as input an audio recording and the relative aligned music score, and outputs the acoustics-specific velocities of each note. We test different model shapes and show that the proposed methodology generally outperforms the usual AMT pipeline which does not consider specificities of the instrument and of the acoustic environment. Interestingly, such a methodology is extensible in a straightforward way since only slight efforts are required to train models for the inference of other piano parameters, such as pedaling.
    On the R\'{e}nyi Cross-Entropy. (arXiv:2206.14329v1 [cs.IT])
    The R\'{e}nyi cross-entropy measure between two distributions, a generalization of the Shannon cross-entropy, was recently used as a loss function for the improved design of deep learning generative adversarial networks. In this work, we examine the properties of this measure and derive closed-form expressions for it when one of the distributions is fixed and when both distributions belong to the exponential family. We also analytically determine a formula for the cross-entropy rate for stationary Gaussian processes and for finite-alphabet Markov sources.  ( 2 min )
    Inferring Cyber Threat Intelligence -- A Knowledge Graph-based Approach. (arXiv:2102.05571v4 [cs.CR] UPDATED)
    Security analysts prepare threat analysis upon investigating an attack, an emerging cyber threat, or a recently discovered vulnerability. Threat intelligence on malware attacks and campaigns is shared on blog posts, reports, analyses, and tweets with varying technical details. Other security analysts use this intelligence to inform them of emerging threats, indicators of compromise, attack methods, and preventative measures. Collectively known as threat intelligence, it is typically in an unstructured format and, therefore, challenging to integrate seamlessly into existing IDPS systems. In this paper, we propose a framework that aggregates and combines CTI - the openly available cyber threat intelligence information. The information is extracted and stored in a structured format using knowledge graphs such that the semantics of the threat intelligence can be preserved and shared at scale with other security analysts. We propose the first semi-supervised open-source knowledge graph (KG) framework, TINKER, to capture cyber threat information and its context. Following TINKER, we generate a Cyberthreat Intelligence Knowledge Graph (CTI-KG). We demonstrate the efficacy of CTI-KG using different use cases and its application for security analysts.
    Backdoor Detection in Reinforcement Learning. (arXiv:2202.03609v3 [cs.LG] UPDATED)
    While the real world application of reinforcement learning (RL) is becoming popular, the safety concern and the robustness of an RL system require more attention. A recent work reveals that, in a multi-agent RL environment, backdoor trigger actions can be injected into a victim agent (a.k.a. trojan agent), which can result in a catastrophic failure as soon as it sees the backdoor trigger action. We propose the problem of RL Backdoor Detection, aiming to address this safety vulnerability. An interesting observation we drew from extensive empirical studies is a trigger smoothness property where normal actions similar to the backdoor trigger actions can also trigger low performance of the trojan agent. Inspired by this observation, we propose a reinforcement learning solution TrojanSeeker to find approximate trigger actions for the trojan agents, and further propose an efficient approach to mitigate the trojan agents based on machine unlearning. Experiments show that our approach can correctly distinguish and mitigate all the trojan agents across various types of agents and environments.
    Supervised Training of Conditional Monge Maps. (arXiv:2206.14262v1 [cs.LG])
    Optimal transport (OT) theory describes general principles to define and select, among many possible choices, the most efficient way to map a probability measure onto another. That theory has been mostly used to estimate, given a pair of source and target probability measures $(\mu,\nu)$, a parameterized map $T_\theta$ that can efficiently map $\mu$ onto $\nu$. In many applications, such as predicting cell responses to treatments, the data measures $\mu,\nu$ (features of untreated/treated cells) that define optimal transport problems do not arise in isolation but are associated with a context $c$ (the treatment). To account for and incorporate that context in OT estimation, we introduce CondOT, an approach to estimate OT maps conditioned on a context variable, using several pairs of measures $(\mu_i, \nu_i)$ tagged with a context label $c_i$. Our goal is to % extract from a dataset of labeled pairs $\{(c_i, (\mu_i, \nu_i))\}$ learn a global map $\mathcal{T}_{\theta}$ which is not only expected to fit em all pairs in the dataset $\{(c_i, (\mu_i, \nu_i))\}$, i.e., $\mathcal{T}_{\theta}(c_i) \sharp\mu_i \approx \nu_i$, but should generalize to produce meaningful maps $\mathcal{T}_{\theta}(c_{\text{new}})$ conditioned on unseen contexts $c_{\text{new}}$. Our approach harnesses and provides a novel usage for partially input convex neural networks, for which we introduce a robust and efficient initialization strategy inspired by Gaussian approximations. We demonstrate the ability of CondOT to infer the effect of an arbitrary combination of genetic or therapeutic perturbations on single cells, using only observations of the effects of said perturbations separately.  ( 3 min )
    Some variational recipes for quantum field theories. (arXiv:2109.05547v3 [quant-ph] UPDATED)
    Rapid developments of quantum information technology show promising opportunities for simulating quantum field theory in near-term quantum devices. In this work, we formulate the theory of (time-dependent) variational quantum simulation of the 1+1 dimensional $\lambda \phi^4$ quantum field theory including encoding, state preparation, and time evolution, with several numerical simulation results. These algorithms could be understood as near-term variational analogs of the Jordan-Lee-Preskill algorithm, the basic algorithm for simulating quantum field theory using universal quantum devices. Besides, we highlight the advantages of encoding with harmonic oscillator basis based on the LSZ reduction formula and several computational efficiency such as when implementing a bosonic version of the unitary coupled cluster ansatz to prepare initial states. We also discuss how to circumvent the "spectral crowding" problem in the quantum field theory simulation and appraise our algorithm by both state and subspace fidelities.
    Fast learning from label proportions with small bags. (arXiv:2110.03426v4 [cs.LG] UPDATED)
    In learning from label proportions (LLP), the instances are grouped into bags, and the task is to learn an instance classifier given relative class proportions in training bags. LLP is useful when obtaining individual instance labels is impossible or costly. In this work, we focus on the case of small bags, which allows to design an algorithm that explicitly considers all consistent instance label combinations. In particular, we propose an EM algorithm alternating between optimizing a general neural network instance classifier and incorporating bag-level annotations. Using two different image datasets, we experimentally compare this method with an approach based on normal approximation and two existing LLP methods. The results show that our approach converges faster to a comparable or better solution.
    Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy. (arXiv:2110.02642v5 [cs.LG] UPDATED)
    Unsupervised detection of anomaly points in time series is a challenging problem, which requires the model to derive a distinguishable criterion. Previous methods tackle the problem mainly through learning pointwise representation or pairwise association, however, neither is sufficient to reason about the intricate dynamics. Recently, Transformers have shown great power in unified modeling of pointwise representation and pairwise association, and we find that the self-attention weight distribution of each time point can embody rich association with the whole series. Our key observation is that due to the rarity of anomalies, it is extremely difficult to build nontrivial associations from abnormal points to the whole series, thereby, the anomalies' associations shall mainly concentrate on their adjacent time points. This adjacent-concentration bias implies an association-based criterion inherently distinguishable between normal and abnormal points, which we highlight through the \emph{Association Discrepancy}. Technically, we propose the \emph{Anomaly Transformer} with a new \emph{Anomaly-Attention} mechanism to compute the association discrepancy. A minimax strategy is devised to amplify the normal-abnormal distinguishability of the association discrepancy. The Anomaly Transformer achieves state-of-the-art results on six unsupervised time series anomaly detection benchmarks of three applications: service monitoring, space & earth exploration, and water treatment.
    DeepCore: A Comprehensive Library for Coreset Selection in Deep Learning. (arXiv:2204.08499v3 [cs.LG] UPDATED)
    Coreset selection, which aims to select a subset of the most informative training samples, is a long-standing learning problem that can benefit many downstream tasks such as data-efficient learning, continual learning, neural architecture search, active learning, etc. However, many existing coreset selection methods are not designed for deep learning, which may have high complexity and poor generalization performance. In addition, the recently proposed methods are evaluated on models, datasets, and settings of different complexities. To advance the research of coreset selection in deep learning, we contribute a comprehensive code library, namely DeepCore, and provide an empirical study on popular coreset selection methods on CIFAR10 and ImageNet datasets. Extensive experiments on CIFAR10 and ImageNet datasets verify that, although various methods have advantages in certain experiment settings, random selection is still a strong baseline.
    Training OOD Detectors in their Natural Habitats. (arXiv:2202.03299v2 [cs.LG] UPDATED)
    Out-of-distribution (OOD) detection is important for machine learning models deployed in the wild. Recent methods use auxiliary outlier data to regularize the model for improved OOD detection. However, these approaches make a strong distributional assumption that the auxiliary outlier data is completely separable from the in-distribution (ID) data. In this paper, we propose a novel framework that leverages wild mixture data, which naturally consists of both ID and OOD samples. Such wild data is abundant and arises freely upon deploying a machine learning classifier in their natural habitats. Our key idea is to formulate a constrained optimization problem and to show how to tractably solve it. Our learning objective maximizes the OOD detection rate, subject to constraints on the classification error of ID data and on the OOD error rate of ID examples. We extensively evaluate our approach on common OOD detection tasks and demonstrate superior performance.
    Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution. (arXiv:2009.14108v2 [cs.LG] UPDATED)
    Reinforcement learning algorithms require many samples when solving complex hierarchical tasks with sparse and delayed rewards. For such complex tasks, the recently proposed RUDDER uses reward redistribution to leverage steps in the Q-function that are associated with accomplishing sub-tasks. However, often only few episodes with high rewards are available as demonstrations since current exploration strategies cannot discover them in reasonable time. In this work, we introduce Align-RUDDER, which utilizes a profile model for reward redistribution that is obtained from multiple sequence alignment of demonstrations. Consequently, Align-RUDDER employs reward redistribution effectively and, thereby, drastically improves learning on few demonstrations. Align-RUDDER outperforms competitors on complex artificial tasks with delayed rewards and few demonstrations. On the Minecraft ObtainDiamond task, Align-RUDDER is able to mine a diamond, though not frequently. Code is available at https://github.com/ml-jku/align-rudder. YouTube: https://youtu.be/HO-_8ZUl-UY
    Using cognitive psychology to understand GPT-3. (arXiv:2206.14576v1 [cs.CL])
    We study GPT-3, a recent large language model, using tools from cognitive psychology. More specifically, we assess GPT-3's decision-making, information search, deliberation, and causal reasoning abilities on a battery of canonical experiments from the literature. We find that much of GPT-3's behavior is impressive: it solves vignette-based tasks similarly or better than human subjects, is able to make decent decisions from descriptions, outperforms humans in a multi-armed bandit task, and shows signatures of model-based reinforcement learning. Yet we also find that small perturbations to vignette-based tasks can lead GPT-3 vastly astray, that it shows no signatures of directed exploration, and that it fails miserably in a causal reasoning task. These results enrich our understanding of current large language models and pave the way for future investigations using tools from cognitive psychology to study increasingly capable and opaque artificial agents.  ( 2 min )
    Neural Integro-Differential Equations. (arXiv:2206.14282v1 [cs.LG])
    Modeling continuous dynamical systems from discretely sampled observations is a fundamental problem in data science. Often, such dynamics are the result of non-local processes that present an integral over time. As such, these systems are modeled with Integro-Differential Equations (IDEs); generalizations of differential equations that comprise both an integral and a differential component. For example, brain dynamics are not accurately modeled by differential equations since their behavior is non-Markovian, i.e. dynamics are in part dictated by history. Here, we introduce the Neural IDE (NIDE), a framework that models ordinary and integral components of IDEs using neural networks. We test NIDE on several toy and brain activity datasets and demonstrate that NIDE outperforms other models, including Neural ODE. These tasks include time extrapolation as well as predicting dynamics from unseen initial conditions, which we test on whole-cortex activity recordings in freely behaving mice. Further, we show that NIDE can decompose dynamics into its Markovian and non-Markovian constituents, via the learned integral operator, which we test on fMRI brain activity recordings of people on ketamine. Finally, the integrand of the integral operator provides a latent space that gives insight into the underlying dynamics, which we demonstrate on wide-field brain imaging recordings. Altogether, NIDE is a novel approach that enables modeling of complex non-local dynamics with neural networks.  ( 3 min )
    Deep Neural Networks and Tabular Data: A Survey. (arXiv:2110.01889v3 [cs.LG] UPDATED)
    Heterogeneous tabular data are the most commonly used form of data and are essential for numerous critical and computationally demanding applications. On homogeneous data sets, deep neural networks have repeatedly shown excellent performance and have therefore been widely adopted. However, their adaptation to tabular data for inference or data generation tasks remains challenging. To facilitate further progress in the field, this work provides an overview of state-of-the-art deep learning methods for tabular data. We categorize these methods into three groups: data transformations, specialized architectures, and regularization models. For each of these groups, our work offers a comprehensive overview of the main approaches. Moreover, we discuss deep learning approaches for generating tabular data, and we also provide an overview over strategies for explaining deep models on tabular data. Thus, our first contribution is to address the main research streams and existing methodologies in the mentioned areas, while highlighting relevant challenges and open research questions. Our second contribution is to provide an empirical comparison of traditional machine learning methods with eleven deep learning approaches across five popular real-world tabular data sets of different sizes and with different learning objectives. Our results, which we have made publicly available as competitive benchmarks, indicate that algorithms based on gradient-boosted tree ensembles still mostly outperform deep learning models on supervised learning tasks, suggesting that the research progress on competitive deep learning models for tabular data is stagnating. To the best of our knowledge, this is the first in-depth overview of deep learning approaches for tabular data; as such, this work can serve as a valuable starting point to guide researchers and practitioners interested in deep learning with tabular data.
    Locally Interpretable One-Class Anomaly Detection for Credit Card Fraud Detection. (arXiv:2108.02501v3 [cs.LG] UPDATED)
    For the highly imbalanced credit card fraud detection problem, most existing methods either use data augmentation methods or conventional machine learning models, while neural network-based anomaly detection approaches are lacking. Furthermore, few studies have employed AI interpretability tools to investigate the feature importance of transaction data, which is crucial for the black-box fraud detection module. Considering these two points together, we propose a novel anomaly detection framework for credit card fraud detection as well as a model-explaining module responsible for prediction explanations. The fraud detection model is composed of two deep neural networks, which are trained in an unsupervised and adversarial manner. Precisely, the generator is an AutoEncoder aiming to reconstruct genuine transaction data, while the discriminator is a fully-connected network for fraud detection. The explanation module has three white-box explainers in charge of interpretations of the AutoEncoder, discriminator, and the whole detection model, respectively. Experimental results show the state-of-the-art performances of our fraud detection model on the benchmark dataset compared with baselines. In addition, prediction analyses by three explainers are presented, offering a clear perspective on how each feature of an instance of interest contributes to the final model output.
    MOSRA: Joint Mean Opinion Score and Room Acoustics Speech Quality Assessment. (arXiv:2204.01345v2 [eess.AS] UPDATED)
    The acoustic environment can degrade speech quality during communication (e.g., video call, remote presentation, outside voice recording), and its impact is often unknown. Objective metrics for speech quality have proven challenging to develop given the multi-dimensionality of factors that affect speech quality and the difficulty of collecting labeled data. Hypothesizing the impact of acoustics on speech quality, this paper presents MOSRA: a non-intrusive multi-dimensional speech quality metric that can predict room acoustics parameters (SNR, STI, T60, DRR, and C50) alongside the overall mean opinion score (MOS) for speech quality. By explicitly optimizing the model to learn these room acoustics parameters, we can extract more informative features and improve the generalization for the MOS task when the training data is limited. Furthermore, we also show that this joint training method enhances the blind estimation of room acoustics, improving the performance of current state-of-the-art models. An additional side-effect of this joint prediction is the improvement in the explainability of the predictions, which is a valuable feature for many applications.  ( 2 min )
    Competence-based Multimodal Curriculum Learning for Medical Report Generation. (arXiv:2206.14579v1 [cs.CL])
    Medical report generation task, which targets to produce long and coherent descriptions of medical images, has attracted growing research interests recently. Different from the general image captioning tasks, medical report generation is more challenging for data-driven neural models. This is mainly due to 1) the serious data bias and 2) the limited medical data. To alleviate the data bias and make best use of available data, we propose a Competence-based Multimodal Curriculum Learning framework (CMCL). Specifically, CMCL simulates the learning process of radiologists and optimizes the model in a step by step manner. Firstly, CMCL estimates the difficulty of each training instance and evaluates the competence of current model; Secondly, CMCL selects the most suitable batch of training instances considering current model competence. By iterating above two steps, CMCL can gradually improve the model's performance. The experiments on the public IU-Xray and MIMIC-CXR datasets show that CMCL can be incorporated into existing models to improve their performance.  ( 2 min )
    QuantumFed: A Federated Learning Framework for Collaborative Quantum Training. (arXiv:2106.09109v4 [cs.LG] UPDATED)
    With the fast development of quantum computing and deep learning, quantum neural networks have attracted great attention recently. By leveraging the power of quantum computing, deep neural networks can potentially overcome computational power limitations in classic machine learning. However, when multiple quantum machines wish to train a global model using the local data on each machine, it may be very difficult to copy the data into one machine and train the model. Therefore, a collaborative quantum neural network framework is necessary. In this article, we borrow the core idea of federated learning to propose QuantumFed, a quantum federated learning framework to have multiple quantum nodes with local quantum data train a mode together. Our experiments show the feasibility and robustness of our framework.
    Depth-2 Neural Networks Under a Data-Poisoning Attack. (arXiv:2005.01699v3 [cs.LG] UPDATED)
    In this work, we study the possibility of defending against data-poisoning attacks while training a shallow neural network in a regression setup. We focus on doing supervised learning for a class of depth-2 finite-width neural networks, which includes single-filter convolutional networks. In this class of networks, we attempt to learn the network weights in the presence of a malicious oracle doing stochastic, bounded and additive adversarial distortions on the true output during training. For the non-gradient stochastic algorithm that we construct, we prove worst-case near-optimal trade-offs among the magnitude of the adversarial attack, the weight approximation accuracy, and the confidence achieved by the proposed algorithm. As our algorithm uses mini-batching, we analyze how the mini-batch size affects convergence. We also show how to utilize the scaling of the outer layer weights to counter output-poisoning attacks depending on the probability of attack. Lastly, we give experimental evidence demonstrating how our algorithm outperforms stochastic gradient descent under different input data distributions, including instances of heavy-tailed distributions.
    Deep Policies for Online Bipartite Matching: A Reinforcement Learning Approach. (arXiv:2109.10380v2 [cs.LG] UPDATED)
    The challenge in the widely applicable online matching problem lies in making irrevocable assignments while there is uncertainty about future inputs. Most theoretically-grounded policies are myopic or greedy in nature. In real-world applications where the matching process is repeated on a regular basis, the underlying data distribution can be leveraged for better decision-making. We present an end-to-end Reinforcement Learning framework for deriving better matching policies based on trial-and-error on historical data. We devise a set of neural network architectures, design feature representations, and empirically evaluate them across two online matching problems: Edge-Weighted Online Bipartite Matching and Online Submodular Bipartite Matching. We show that most of the learning approaches perform consistently better than classical baseline algorithms on four synthetic and real-world datasets. On average, our proposed models improve the matching quality by 3-10% on a variety of synthetic and real-world datasets. Our code is publicly available at https://github.com/lyeskhalil/CORL.
    When Do Extended Physics-Informed Neural Networks (XPINNs) Improve Generalization?. (arXiv:2109.09444v5 [cs.LG] UPDATED)
    Physics-informed neural networks (PINNs) have become a popular choice for solving high-dimensional partial differential equations (PDEs) due to their excellent approximation power and generalization ability. Recently, Extended PINNs (XPINNs) based on domain decomposition methods have attracted considerable attention due to their effectiveness in modeling multiscale and multiphysics problems and their parallelization. However, theoretical understanding on their convergence and generalization properties remains unexplored. In this study, we take an initial step towards understanding how and when XPINNs outperform PINNs. Specifically, for general multi-layer PINNs and XPINNs, we first provide a prior generalization bound via the complexity of the target functions in the PDE problem, and a posterior generalization bound via the posterior matrix norms of the networks after optimization. Moreover, based on our bounds, we analyze the conditions under which XPINNs improve generalization. Concretely, our theory shows that the key building block of XPINN, namely the domain decomposition, introduces a tradeoff for generalization. On the one hand, XPINNs decompose the complex PDE solution into several simple parts, which decreases the complexity needed to learn each part and boosts generalization. On the other hand, decomposition leads to less training data being available in each subdomain, and hence such model is typically prone to overfitting and may become less generalizable. Empirically, we choose five PDEs to show when XPINNs perform better than, similar to, or worse than PINNs, hence demonstrating and justifying our new theory.
    Reinforcement Learning for Datacenter Congestion Control. (arXiv:2102.09337v2 [cs.LG] UPDATED)
    We approach the task of network congestion control in datacenters using Reinforcement Learning (RL). Successful congestion control algorithms can dramatically improve latency and overall network throughput. Until today, no such learning-based algorithms have shown practical potential in this domain. Evidently, the most popular recent deployments rely on rule-based heuristics that are tested on a predetermined set of benchmarks. Consequently, these heuristics do not generalize well to newly-seen scenarios. Contrarily, we devise an RL-based algorithm with the aim of generalizing to different configurations of real-world datacenter networks. We overcome challenges such as partial-observability, non-stationarity, and multi-objectiveness. We further propose a policy gradient algorithm that leverages the analytical structure of the reward function to approximate its derivative and improve stability. We show that this scheme outperforms alternative popular RL approaches, and generalizes to scenarios that were not seen during training. Our experiments, conducted on a realistic simulator that emulates communication networks' behavior, exhibit improved performance concurrently on the multiple considered metrics compared to the popular algorithms deployed today in real datacenters. Our algorithm is being productized to replace heuristics in some of the largest datacenters in the world.
    CoMoGAN: continuous model-guided image-to-image translation. (arXiv:2103.06879v3 [cs.CV] UPDATED)
    CoMoGAN is a continuous GAN relying on the unsupervised reorganization of the target data on a functional manifold. To that matter, we introduce a new Functional Instance Normalization layer and residual mechanism, which together disentangle image content from position on target manifold. We rely on naive physics-inspired models to guide the training while allowing private model/translations features. CoMoGAN can be used with any GAN backbone and allows new types of image translation, such as cyclic image translation like timelapse generation, or detached linear translation. On all datasets, it outperforms the literature. Our code is available at this http URL .
    Exploring the Latent Space of Autoencoders with Interventional Assays. (arXiv:2106.16091v2 [cs.LG] UPDATED)
    Autoencoders exhibit impressive abilities to embed the data manifold into a low-dimensional latent space, making them a staple of representation learning methods. However, without explicit supervision, which is often unavailable, the representation is usually uninterpretable, making analysis and principled progress challenging. We propose a framework, called latent responses, which exploits the locally contractive behavior exhibited by variational autoencoders to explore the learned manifold. More specifically, we develop tools to probe the representation using interventions in the latent space to quantify the relationships between latent variables. We extend the notion of disentanglement to take the learned generative process into account and consequently avoid the limitations of existing metrics that may rely on spurious correlations. Our analyses underscore the importance of studying the causal structure of the representation to improve performance on downstream tasks such as generation, interpolation, and inference of the factors of variation.
    Matching Learned Causal Effects of Neural Networks with Domain Priors. (arXiv:2111.12490v4 [cs.LG] UPDATED)
    A trained neural network can be interpreted as a structural causal model (SCM) that provides the effect of changing input variables on the model's output. However, if training data contains both causal and correlational relationships, a model that optimizes prediction accuracy may not necessarily learn the true causal relationships between input and output variables. On the other hand, expert users often have prior knowledge of the causal relationship between certain input variables and output from domain knowledge. Therefore, we propose a regularization method that aligns the learned causal effects of a neural network with domain priors, including both direct and total causal effects. We show that this approach can generalize to different kinds of domain priors, including monotonicity of causal effect of an input variable on output or zero causal effect of a variable on output for purposes of fairness. Our experiments on twelve benchmark datasets show its utility in regularizing a neural network model to maintain desired causal effects, without compromising on accuracy. Importantly, we also show that a model thus trained is robust and gets improved accuracy on noisy inputs.
    BiometryNet: Landmark-based Fetal Biometry Estimation from Standard Ultrasound Planes. (arXiv:2206.14678v1 [eess.IV])
    Fetal growth assessment from ultrasound is based on a few biometric measurements that are performed manually and assessed relative to the expected gestational age. Reliable biometry estimation depends on the precise detection of landmarks in standard ultrasound planes. Manual annotation can be time-consuming and operator dependent task, and may results in high measurements variability. Existing methods for automatic fetal biometry rely on initial automatic fetal structure segmentation followed by geometric landmark detection. However, segmentation annotations are time-consuming and may be inaccurate, and landmark detection requires developing measurement-specific geometric methods. This paper describes BiometryNet, an end-to-end landmark regression framework for fetal biometry estimation that overcomes these limitations. It includes a novel Dynamic Orientation Determination (DOD) method for enforcing measurement-specific orientation consistency during network training. DOD reduces variabilities in network training, increases landmark localization accuracy, thus yields accurate and robust biometric measurements. To validate our method, we assembled a dataset of 3,398 ultrasound images from 1,829 subjects acquired in three clinical sites with seven different ultrasound devices. Comparison and cross-validation of three different biometric measurements on two independent datasets shows that BiometryNet is robust and yields accurate measurements whose errors are lower than the clinically permissible errors, outperforming other existing automated biometry estimation methods. Code is available at https://github.com/netanellavisdris/fetalbiometry.  ( 3 min )
    An Embedding Framework for the Design and Analysis of Consistent Polyhedral Surrogates. (arXiv:2206.14707v1 [cs.LG])
    We formalize and study the natural approach of designing convex surrogate loss functions via embeddings, for problems such as classification, ranking, or structured prediction. In this approach, one embeds each of the finitely many predictions (e.g. rankings) as a point in $R^d$, assigns the original loss values to these points, and "convexifies" the loss in some way to obtain a surrogate. We establish a strong connection between this approach and polyhedral (piecewise-linear convex) surrogate losses: every discrete loss is embedded by some polyhedral loss, and every polyhedral loss embeds some discrete loss. Moreover, an embedding gives rise to a consistent link function as well as linear surrogate regret bounds. Our results are constructive, as we illustrate with several examples. In particular, our framework gives succinct proofs of consistency or inconsistency for various polyhedral surrogates in the literature, and for inconsistent surrogates, it further reveals the discrete losses for which these surrogates are consistent. We go on to show additional structure of embeddings, such as the equivalence of embedding and matching Bayes risks, and the equivalence of various notions of non-redudancy. Using these results, we establish that indirect elicitation, a necessary condition for consistency, is also sufficient when working with polyhedral surrogates.
    Representation Topology Divergence: A Method for Comparing Neural Network Representations. (arXiv:2201.00058v2 [cs.LG] UPDATED)
    Comparison of data representations is a complex multi-aspect problem that has not enjoyed a complete solution yet. We propose a method for comparing two data representations. We introduce the Representation Topology Divergence (RTD), measuring the dissimilarity in multi-scale topology between two point clouds of equal size with a one-to-one correspondence between points. The data point clouds are allowed to lie in different ambient spaces. The RTD is one of the few TDA-based practical methods applicable to real machine learning datasets. Experiments show that the proposed RTD agrees with the intuitive assessment of data representation similarity and is sensitive to its topological structure. We apply RTD to gain insights on neural networks representations in computer vision and NLP domains for various problems: training dynamics analysis, data distribution shift, transfer learning, ensemble learning, disentanglement assessment.
    Visual Foresight With a Local Dynamics Model. (arXiv:2206.14802v1 [cs.RO])
    Model-free policy learning has been shown to be capable of learning manipulation policies which can solve long-time horizon tasks using single-step manipulation primitives. However, training these policies is a time-consuming process requiring large amounts of data. We propose the Local Dynamics Model (LDM) which efficiently learns the state-transition function for these manipulation primitives. By combining the LDM with model-free policy learning, we can learn policies which can solve complex manipulation tasks using one-step lookahead planning. We show that the LDM is both more sample-efficient and outperforms other model architectures. When combined with planning, we can outperform other model-based and model-free policies on several challenging manipulation tasks in simulation.
    Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error. (arXiv:2201.12417v2 [cs.LG] UPDATED)
    In this work, we study the use of the Bellman equation as a surrogate objective for value prediction accuracy. While the Bellman equation is uniquely solved by the true value function over all state-action pairs, we find that the Bellman error (the difference between both sides of the equation) is a poor proxy for the accuracy of the value function. In particular, we show that (1) due to cancellations from both sides of the Bellman equation, the magnitude of the Bellman error is only weakly related to the distance to the true value function, even when considering all state-action pairs, and (2) in the finite data regime, the Bellman equation can be satisfied exactly by infinitely many suboptimal solutions. This means that the Bellman error can be minimized without improving the accuracy of the value function. We demonstrate these phenomena through a series of propositions, illustrative toy examples, and empirical analysis in standard benchmark domains.
    Deep Multiple Instance Learning For Forecasting Stock Trends Using Financial News. (arXiv:2206.14452v1 [cs.LG])
    A major source of information can be taken from financial news articles, which have some correlations about the fluctuation of stock trends. In this paper, we investigate the influences of financial news on the stock trends, from a multi-instance view. The intuition behind this is based on the news uncertainty of varying intervals of news occurrences and the lack of annotation in every single financial news. Under the scenario of Multiple Instance Learning (MIL) where training instances are arranged in bags, and a label is assigned for the entire bag instead of instances, we develop a flexible and adaptive multi-instance learning model and evaluate its ability in directional movement forecast of Standard & Poors 500 index on financial news dataset. Specifically, we treat each trading day as one bag, with certain amounts of news happening on each trading day as instances in each bag. Experiment results demonstrate that our proposed multi-instance-based framework gains outstanding results in terms of the accuracy of trend prediction, compared with other state-of-art approaches and baselines.  ( 2 min )
    Conditionally Elicitable Dynamic Risk Measures for Deep Reinforcement Learning. (arXiv:2206.14666v1 [cs.LG])
    We propose a novel framework to solve risk-sensitive reinforcement learning (RL) problems where the agent optimises time-consistent dynamic spectral risk measures. Based on the notion of conditional elicitability, our methodology constructs (strictly consistent) scoring functions that are used as penalizers in the estimation procedure. Our contribution is threefold: we (i) devise an efficient approach to estimate a class of dynamic spectral risk measures with deep neural networks, (ii) prove that these dynamic spectral risk measures may be approximated to any arbitrary accuracy using deep neural networks, and (iii) develop a risk-sensitive actor-critic algorithm that uses full episodes and does not require any additional nested transitions. We compare our conceptually improved reinforcement learning algorithm with the nested simulation approach and illustrate its performance in two settings: statistical arbitrage and portfolio allocation on both simulated and real data.
    3D-Aware Video Generation. (arXiv:2206.14797v1 [cs.CV])
    Generative models have emerged as an essential building block for many image synthesis and editing tasks. Recent advances in this field have also enabled high-quality 3D or video content to be generated that exhibits either multi-view or temporal consistency. With our work, we explore 4D generative adversarial networks (GANs) that learn unconditional generation of 3D-aware videos. By combining neural implicit representations with time-aware discriminator, we develop a GAN framework that synthesizes 3D video supervised only with monocular videos. We show that our method learns a rich embedding of decomposable 3D structures and motions that enables new visual effects of spatio-temporal renderings while producing imagery with quality comparable to that of existing 3D or video GANs.
    Online vs. Offline Adaptive Domain Randomization Benchmark. (arXiv:2206.14661v1 [cs.RO])
    Physics simulators have shown great promise for conveniently learning reinforcement learning policies in safe, unconstrained environments. However, transferring the acquired knowledge to the real world can be challenging due to the reality gap. To this end, several methods have been recently proposed to automatically tune simulator parameters with posterior distributions given real data, for use with domain randomization at training time. These approaches have been shown to work for various robotic tasks under different settings and assumptions. Nevertheless, existing literature lacks a thorough comparison of existing adaptive domain randomization methods with respect to transfer performance and real-data efficiency. In this work, we present an open benchmark for both offline and online methods (SimOpt, BayRn, DROID, DROPO), to shed light on which are most suitable for each setting and task at hand. We found that online methods are limited by the quality of the currently learned policy for the next iteration, while offline methods may sometimes fail when replaying trajectories in simulation with open-loop commands. The code used will be released at https://github.com/gabrieletiboni/adr-benchmark.
    SENTINEL: Taming Uncertainty with Ensemble-based Distributional Reinforcement Learning. (arXiv:2102.11075v3 [cs.LG] UPDATED)
    In this paper, we consider risk-sensitive sequential decision-making in Reinforcement Learning (RL). Our contributions are two-fold. First, we introduce a novel and coherent quantification of risk, namely composite risk, which quantifies the joint effect of aleatory and epistemic risk during the learning process. Existing works considered either aleatory or epistemic risk individually, or as an additive combination. We prove that the additive formulation is a particular case of the composite risk when the epistemic risk measure is replaced with expectation. Thus, the composite risk is more sensitive to both aleatory and epistemic uncertainty than the individual and additive formulations. We also propose an algorithm, SENTINEL-K, based on ensemble bootstrapping and distributional RL for representing epistemic and aleatory uncertainty respectively. The ensemble of K learners uses Follow The Regularised Leader (FTRL) to aggregate the return distributions and obtain the composite risk. We experimentally verify that SENTINEL-K estimates the return distribution better, and while used with composite risk estimates, demonstrates higher risk-sensitive performance than state-of-the-art risk-sensitive and distributional RL algorithms.
    Prediction Errors for Penalized Regressions based on Generalized Approximate Message Passing. (arXiv:2206.12832v2 [stat.ML] UPDATED)
    We discuss the prediction accuracy of assumed statistical models in terms of prediction errors for the generalized linear model and penalized maximum likelihood methods. We derive the forms of estimators for the prediction errors: $C_p$ criterion, information criteria, and leave-one-out cross validation (LOOCV) error, using the generalized approximate message passing (GAMP) algorithm and replica method. These estimators coincide with each other when the number of model parameters is sufficiently small; however, there is a discrepancy between them in particular in the overparametrized region where the number of model parameters is larger than the data dimension. In this paper, we review the prediction errors and corresponding estimators, and discuss their differences. In the framework of GAMP, we show that the information criteria can be expressed by using the variance of the estimates. Further, we demonstrate how to approach LOOCV error from the information criteria by utilizing the expression provided by GAMP.
    Quantum-Inspired Algorithms from Randomized Numerical Linear Algebra. (arXiv:2011.04125v7 [cs.DS] UPDATED)
    We create classical (non-quantum) dynamic data structures supporting queries for recommender systems and least-squares regression that are comparable to their quantum analogues. De-quantizing such algorithms has received a flurry of attention in recent years; we obtain sharper bounds for these problems. More significantly, we achieve these improvements by arguing that the previous quantum-inspired algorithms for these problems are doing leverage or ridge-leverage score sampling in disguise; these are powerful and standard techniques in randomized numerical linear algebra. With this recognition, we are able to employ the large body of work in numerical linear algebra to obtain algorithms for these problems that are simpler or faster (or both) than existing approaches. Our experiments demonstrate that the proposed data structures also work well on real-world datasets.
    Manifold Topology Divergence: a Framework for Comparing Data Manifolds. (arXiv:2106.04024v2 [cs.LG] CROSS LISTED)
    We develop a framework for comparing data manifolds, aimed, in particular, towards the evaluation of deep generative models. We describe a novel tool, Cross-Barcode(P,Q), that, given a pair of distributions in a high-dimensional space, tracks multiscale topology spacial discrepancies between manifolds on which the distributions are concentrated. Based on the Cross-Barcode, we introduce the Manifold Topology Divergence score (MTop-Divergence) and apply it to assess the performance of deep generative models in various domains: images, 3D-shapes, time-series, and on different datasets: MNIST, Fashion MNIST, SVHN, CIFAR10, FFHQ, chest X-ray images, market stock data, ShapeNet. We demonstrate that the MTop-Divergence accurately detects various degrees of mode-dropping, intra-mode collapse, mode invention, and image disturbance. Our algorithm scales well (essentially linearly) with the increase of the dimension of the ambient high-dimensional space. It is one of the first TDA-based practical methodologies that can be applied universally to datasets of different sizes and dimensions, including the ones on which the most recent GANs in the visual domain are trained. The proposed method is domain agnostic and does not rely on pre-trained networks.
    Traffic Management of Autonomous Vehicles using Policy Based Deep Reinforcement Learning and Intelligent Routing. (arXiv:2206.14608v1 [cs.LG])
    Deep Reinforcement Learning (DRL) uses diverse, unstructured data and makes RL capable of learning complex policies in high dimensional environments. Intelligent Transportation System (ITS) based on Autonomous Vehicles (AVs) offers an excellent playground for policy-based DRL. Deep learning architectures solve computational challenges of traditional algorithms while helping in real-world adoption and deployment of AVs. One of the main challenges in AVs implementation is that it can worsen traffic congestion on roads if not reliably and efficiently managed. Considering each vehicle's holistic effect and using efficient and reliable techniques could genuinely help optimise traffic flow management and congestion reduction. For this purpose, we proposed a intelligent traffic control system that deals with complex traffic congestion scenarios at intersections and behind the intersections. We proposed a DRL-based signal control system that dynamically adjusts traffic signals according to the current congestion situation on intersections. To deal with the congestion on roads behind the intersection, we used re-routing technique to load balance the vehicles on road networks. To achieve the actual benefits of the proposed approach, we break down the data silos and use all the data coming from sensors, detectors, vehicles and roads in combination to achieve sustainable results. We used SUMO micro-simulator for our simulations. The significance of our proposed approach is manifested from the results.
    Quantification of Deep Neural Network Prediction Uncertainties for VVUQ of Machine Learning Models. (arXiv:2206.14615v1 [cs.LG])
    Recent performance breakthroughs in Artificial intelligence (AI) and Machine learning (ML), especially advances in Deep learning (DL), the availability of powerful, easy-to-use ML libraries (e.g., scikit-learn, TensorFlow, PyTorch.), and increasing computational power have led to unprecedented interest in AI/ML among nuclear engineers. For physics-based computational models, Verification, Validation and Uncertainty Quantification (VVUQ) have been very widely investigated and a lot of methodologies have been developed. However, VVUQ of ML models has been relatively less studied, especially in nuclear engineering. In this work, we focus on UQ of ML models as a preliminary step of ML VVUQ, more specifically, Deep Neural Networks (DNNs) because they are the most widely used supervised ML algorithm for both regression and classification tasks. This work aims at quantifying the prediction, or approximation uncertainties of DNNs when they are used as surrogate models for expensive physical models. Three techniques for UQ of DNNs are compared, namely Monte Carlo Dropout (MCD), Deep Ensembles (DE) and Bayesian Neural Networks (BNNs). Two nuclear engineering examples are used to benchmark these methods, (1) time-dependent fission gas release data using the Bison code, and (2) void fraction simulation based on the BFBT benchmark using the TRACE code. It was found that the three methods typically require different DNN architectures and hyperparameters to optimize their performance. The UQ results also depend on the amount of training data available and the nature of the data. Overall, all these three methods can provide reasonable estimations of the approximation uncertainties. The uncertainties are generally smaller when the mean predictions are close to the test data, while the BNN methods usually produce larger uncertainties than MCD and DE.
    Latent Combinational Game Design. (arXiv:2206.14203v1 [cs.LG])
    We present an approach for generating playable games that blend a given set of games in a desired combination using deep generative latent variable models. We refer to this approach as latent combinational game design -- latent since we use learned latent representations to perform blending, combinational since game blending is a combinational creativity process and game design since the approach generates novel, playable games. We use Gaussian Mixture Variational Autoencoders (GMVAEs), which use a mixture of Gaussians to model the VAE latent space. Through supervised training, each component learns to encode levels from one game and lets us define new, blended games as linear combinations of these learned components. This enables generating new games that blend the input games as well as control the relative proportions of each game in the blend. We also extend prior work using conditional VAEs to perform blending and compare against the GMVAE. Our results show that both models can generate playable blended games that blend the input games in the desired proportions.
    Distilling Model Failures as Directions in Latent Space. (arXiv:2206.14754v1 [cs.LG])
    Existing methods for isolating hard subpopulations and spurious correlations in datasets often require human intervention. This can make these methods labor-intensive and dataset-specific. To address these shortcomings, we present a scalable method for automatically distilling a model's failure modes. Specifically, we harness linear classifiers to identify consistent error patterns, and, in turn, induce a natural representation of these failure modes as directions within the feature space. We demonstrate that this framework allows us to discover and automatically caption challenging subpopulations within the training dataset, and intervene to improve the model's performance on these subpopulations. Code available at https://github.com/MadryLab/failure-directions
    Imaging the time series of one single referenced EEG electrode for Epileptic Seizures Risk Analysis. (arXiv:2206.14520v1 [cs.LG])
    The time series captured by a single scalp electrode (plus the reference electrode) of refractory epileptic patients is used to forecast seizures susceptibility. The time series is preprocessed, segmented, and each segment transformed into an image, using three different known methods: Recurrence Plot, Gramian Angular Field, Markov Transition Field. The likelihood of the occurrence of a seizure in a future predefined time window is computed by averaging the output of the softmax layer of a CNN, differently from the usual consideration of the output of the classification layer. By thresholding this likelihood, seizure forecasting has better performance. Interestingly, for almost every patient, the best threshold was different from 50%. The results show that this technique can predict with good results for some seizures and patients. However, more tests, namely more patients and more seizures, are needed to better understand the real potential of this technique.
    Adjoint-aided inference of Gaussian process driven differential equations. (arXiv:2202.04589v2 [stat.ML] UPDATED)
    Linear systems occur throughout engineering and the sciences, most notably as differential equations. In many cases the forcing function for the system is unknown, and interest lies in using noisy observations of the system to infer the forcing, as well as other unknown parameters. In differential equations, the forcing function is an unknown function of the independent variables (typically time and space), and can be modelled as a Gaussian process (GP). In this paper we show how the adjoint of a linear system can be used to efficiently infer forcing functions modelled as GPs, using a truncated basis expansion of the GP kernel. We show how exact conjugate Bayesian inference for the truncated GP can be achieved, in many cases with substantially lower computation than would be required using MCMC methods. We demonstrate the approach on systems of both ordinary and partial differential equations, and show that the basis expansion approach approximates well the true forcing with a modest number of basis vectors. Finally, we show how to infer point estimates for the non-linear model parameters, such as the kernel length-scales, using Bayesian optimisation.
    EBMs vs. CL: Exploring Self-Supervised Visual Pretraining for Visual Question Answering. (arXiv:2206.14355v1 [cs.CV])
    The availability of clean and diverse labeled data is a major roadblock for training models on complex tasks such as visual question answering (VQA). The extensive work on large vision-and-language models has shown that self-supervised learning is effective for pretraining multimodal interactions. In this technical report, we focus on visual representations. We review and evaluate self-supervised methods to leverage unlabeled images and pretrain a model, which we then fine-tune on a custom VQA task that allows controlled evaluation and diagnosis. We compare energy-based models (EBMs) with contrastive learning (CL). While EBMs are growing in popularity, they lack an evaluation on downstream tasks. We find that both EBMs and CL can learn representations from unlabeled images that enable training a VQA model on very little annotated data. In a simple setting similar to CLEVR, we find that CL representations also improve systematic generalization, and even match the performance of representations from a larger, supervised, ImageNet-pretrained model. However, we find EBMs to be difficult to train because of instabilities and high variability in their results. Although EBMs prove useful for OOD detection, other results on supervised energy-based training and uncertainty calibration are largely negative. Overall, CL currently seems a preferable option over EBMs.
    An Auto-Regressive Formulation for Smoothing and Moving Mean with Exponentially Tapered Windows. (arXiv:2206.14749v1 [cs.LG])
    We investigate an auto-regressive formulation for the problem of smoothing time-series by manipulating the inherent objective function of the traditional moving mean smoothers. Not only the auto-regressive smoothers enforce a higher degree of smoothing, they are just as efficient as the traditional moving means and can be optimized accordingly with respect to the input dataset. Interestingly, the auto-regressive models result in moving means with exponentially tapered windows.
    Trial2Vec: Zero-Shot Clinical Trial Document Similarity Search using Self-Supervision. (arXiv:2206.14719v1 [cs.CL])
    Clinical trials are essential for drug development but are extremely expensive and time-consuming to conduct. It is beneficial to study similar historical trials when designing a clinical trial. However, lengthy trial documents and lack of labeled data make trial similarity search difficult. We propose a zero-shot clinical trial retrieval method, Trial2Vec, which learns through self-supervision without annotating similar clinical trials. Specifically, the meta-structure of trial documents (e.g., title, eligibility criteria, target disease) along with clinical knowledge (e.g., UMLS knowledge base https://www.nlm.nih.gov/research/umls/index.html) are leveraged to automatically generate contrastive samples. Besides, Trial2Vec encodes trial documents considering meta-structure thus producing compact embeddings aggregating multi-aspect information from the whole document. We show that our method yields medically interpretable embeddings by visualization and it gets a 15% average improvement over the best baselines on precision/recall for trial retrieval, which is evaluated on our labeled 1600 trial pairs. In addition, we prove the pre-trained embeddings benefit the downstream trial outcome prediction task over 240k trials.
    When Does Group Invariant Learning Survive Spurious Correlations?. (arXiv:2206.14534v1 [cs.LG])
    By inferring latent groups in the training data, recent works introduce invariant learning to the case where environment annotations are unavailable. Typically, learning group invariance under a majority/minority split is empirically shown to be effective in improving out-of-distribution generalization on many datasets. However, theoretical guarantee for these methods on learning invariant mechanisms is lacking. In this paper, we reveal the insufficiency of existing group invariant learning methods in preventing classifiers from depending on spurious correlations in the training set. Specifically, we propose two criteria on judging such sufficiency. Theoretically and empirically, we show that existing methods can violate both criteria and thus fail in generalizing to spurious correlation shifts. Motivated by this, we design a new group invariant learning method, which constructs groups with statistical independence tests, and reweights samples by group label proportion to meet the criteria. Experiments on both synthetic and real data demonstrate that the new method significantly outperforms existing group invariant learning methods in generalizing to spurious correlation shifts.
    A Multilingual Dataset of COVID-19 Vaccination Attitudes on Twitter. (arXiv:2206.14619v1 [cs.CL])
    Vaccine hesitancy is considered as one main cause of the stagnant uptake ratio of COVID-19 vaccines in Europe and the US where vaccines are sufficiently supplied. Fast and accurate grasp of public attitudes toward vaccination is critical to address vaccine hesitancy, and social media platforms have proved to be an effective source of public opinions. In this paper, we describe the collection and release of a dataset of tweets related to COVID-19 vaccines. This dataset consists of the IDs of 2,198,090 tweets collected from Western Europe, 17,934 of which are annotated with the originators' vaccination stances. Our annotation will facilitate using and developing data-driven models to extract vaccination attitudes from social media posts and thus further confirm the power of social media in public health surveillance. To lay the groundwork for future research, we not only perform statistical analysis and visualisation of our dataset, but also evaluate and compare the performance of established text-based benchmarks in vaccination stance extraction. We demonstrate one potential use of our data in practice in tracking the temporal changes of public COVID-19 vaccination attitudes.
    Probabilistic Models for Manufacturing Lead Times. (arXiv:2204.13792v2 [cs.LG] UPDATED)
    In this study, we utilize Gaussian processes, probabilistic neural network, natural gradient boosting, and quantile regression augmented gradient boosting to model lead times of laser manufacturing processes. We introduce probabilistic modelling in the domain and compare the models in terms of different abilities. While providing a comparison between the models in real-life data, our work has many use cases and substantial business value. Our results indicate that all of the models beat the company estimation benchmark that uses domain experience and have good calibration with the empirical frequencies.
    What Can Secondary Predictions Tell Us? An Exploration on Question-Answering with SQuAD-v2.0. (arXiv:2206.14348v1 [cs.CL])
    Performance in natural language processing, and specifically for the question-answer task, is typically measured by comparing a model\'s most confident (primary) prediction to golden answers (the ground truth). We are making the case that it is also useful to quantify how close a model came to predicting a correct answer even for examples that failed. We define the Golden Rank (GR) of an example as the rank of its most confident prediction that exactly matches a ground truth, and show why such a match always exists. For the 16 transformer models we analyzed, the majority of exactly matched golden answers in secondary prediction space hover very close to the top rank. We refer to secondary predictions as those ranking above 0 in descending confidence probability order. We demonstrate how the GR can be used to classify questions and visualize their spectrum of difficulty, from persistent near successes to persistent extreme failures. We derive a new aggregate statistic over entire test sets, named the Golden Rank Interpolated Median (GRIM) that quantifies the proximity of failed predictions to the top choice made by the model. To develop some intuition and explore the applicability of these metrics we use the Stanford Question Answering Dataset (SQuAD-2) and a few popular transformer models from the Hugging Face hub. We first demonstrate that the GRIM is not directly correlated with the F1 and exact match (EM) scores. We then calculate and visualize these scores for various transformer architectures, probe their applicability in error analysis by clustering failed predictions, and compare how they relate to other training diagnostics such as the EM and F1 scores. We finally suggest various research goals, such as broadening data collection for these metrics and their possible use in adversarial training.
    Auto-Encoder-Extreme Learning Machine Model for Boiler NOx Emission Concentration Prediction. (arXiv:2206.14496v1 [cs.LG])
    An automatic encoder (AE) extreme learning machine (ELM)-AE-ELM model is proposed to predict the NOx emission concentration based on the combination of mutual information algorithm (MI), AE, and ELM. First, the importance of practical variables is computed by the MI algorithm, and the mechanism is analyzed to determine the variables related to the NOx emission concentration. Then, the time delay correlations between the selected variables and NOx emission concentration are further analyzed to reconstruct the modeling data. Subsequently, the AE is applied to extract hidden features within the input variables. Finally, an ELM algorithm establishes the relationship between the NOx emission concentration and deep features. The experimental results on practical data indicate that the proposed model shows promising performance compared to state-of-art models.
    PyEPO: A PyTorch-based End-to-End Predict-then-Optimize Library for Linear and Integer Programming. (arXiv:2206.14234v1 [math.OC])
    In deterministic optimization, it is typically assumed that all parameters of the problem are fixed and known. In practice, however, some parameters may be a priori unknown but can be estimated from historical data. A typical predict-then-optimize approach separates predictions and optimization into two stages. Recently, end-to-end predict-then-optimize has become an attractive alternative. In this work, we present the PyEPO package, a PyTorch-based end-to-end predict-then-optimize library in Python. To the best of our knowledge, PyEPO (pronounced like "pineapple" with a silent "n") is the first such generic tool for linear and integer programming with predicted objective function coefficients. It provides two base algorithms: the first is based on the convex surrogate loss function from the seminal work of Elmachtoub & Grigas (2021), and the second is based on the differentiable black-box solver approach of Vlastelica et al. (2019). PyEPO provides a simple interface for the definition of new optimization problems, the implementation of state-of-the-art predict-then-optimize training algorithms, the use of custom neural network architectures, and the comparison of end-to-end approaches with the two-stage approach. PyEPO enables us to conduct a comprehensive set of experiments comparing a number of end-to-end and two-stage approaches along axes such as prediction accuracy, decision quality, and running time on problems such as Shortest Path, Multiple Knapsack, and the Traveling Salesperson Problem. We discuss some empirical insights from these experiments which could guide future research. PyEPO and its documentation are available at https://github.com/khalil-research/PyEPO.
    Benchmarking Bayesian Improved Surname Geocoding Against Machine Learning Methods. (arXiv:2206.14583v1 [cs.LG])
    Bayesian Improved Surname Geocoding (BISG) is the most popular method for proxying race/ethnicity in voter registration files that do not contain it. This paper benchmarks BISG against a range of previously untested machine learning alternatives, using voter files with self-reported race/ethnicity from California, Florida, North Carolina, and Georgia. This analysis yields three key findings. First, when given the exact same inputs, BISG and machine learning perform similarly for estimating aggregate racial/ethnic composition. Second, machine learning outperforms BISG at individual classification of race/ethnicity. Third, the performance of all methods varies substantially across states. These results suggest that pre-trained machine learning models are preferable to BISG for individual classification. Furthermore, mixed results at the precinct level and across states underscore the need for researchers to empirically validate their chosen race/ethnicity proxy in their populations of interest.
    Bottleneck Low-rank Transformers for Low-resource Spoken Language Understanding. (arXiv:2206.14318v1 [cs.CL])
    End-to-end spoken language understanding (SLU) systems benefit from pretraining on large corpora, followed by fine-tuning on application-specific data. The resulting models are too large for on-edge applications. For instance, BERT-based systems contain over 110M parameters. Observing the model is overparameterized, we propose lean transformer structure where the dimension of the attention mechanism is automatically reduced using group sparsity. We propose a variant where the learned attention subspace is transferred to an attention bottleneck layer. In a low-resource setting and without pre-training, the resulting compact SLU model achieves accuracies competitive with pre-trained large models.
    Evaluating Generative Patent Language Models. (arXiv:2206.14578v1 [cs.CL])
    This research aims to build generative language models in the patent domain and to evaluate the models from a human-centric perspective. The evaluation metric is to calculate the ratio of keystrokes that can be saved for a user in an autocomplete context based on the prediction of the generative models. The performance of models in different sizes can also be evaluated in such a metric by measuring a number of newly granted patents. On the basis of the metric, it is found that the largest model is not necessarily the best. Several models are pre-trained from scratch with patent corpus and are released. The experiments in this manuscript focus on patent claims, but the ideas and implementation can be applied to other parts of a patent document. Furthermore, this research is motivated to measure how close the pre-trained language model can generate a newly granted patent claim. Or, conversely, the task is to measure the probabilities for the model to generate each token text given the newly granted patent claim. In addition, this manuscript raises several legal implications on patent law for potential interdisciplinary research in the future. In particular, can the metric based on model prediction be a metric to measure the nonobviousness requirement in the patent law?
    Spherical Channels for Modeling Atomic Interactions. (arXiv:2206.14331v1 [physics.chem-ph])
    Modeling the energy and forces of atomic systems is a fundamental problem in computational chemistry with the potential to help address many of the world's most pressing problems, including those related to energy scarcity and climate change. These calculations are traditionally performed using Density Functional Theory, which is computationally very expensive. Machine learning has the potential to dramatically improve the efficiency of these calculations from days or hours to seconds. We propose the Spherical Channel Network (SCN) to model atomic energies and forces. The SCN is a graph neural network where nodes represent atoms and edges their neighboring atoms. The atom embeddings are a set of spherical functions, called spherical channels, represented using spherical harmonics. We demonstrate, that by rotating the embeddings based on the 3D edge orientation, more information may be utilized while maintaining the rotational equivariance of the messages. While equivariance is a desirable property, we find that by relaxing this constraint in both message passing and aggregation, improved accuracy may be achieved. We demonstrate state-of-the-art results on the large-scale Open Catalyst 2020 dataset in both energy and force prediction for numerous tasks and metrics.
    TE2Rules: Extracting Rule Lists from Tree Ensembles. (arXiv:2206.14359v1 [cs.LG])
    Tree Ensemble (TE) models (e.g. Gradient Boosted Trees and Random Forests) often provide higher prediction performance compared to single decision trees. However, TE models generally lack transparency and interpretability, as humans have difficulty understanding their decision logic. This paper presents a novel approach to convert a TE trained for a binary classification task, to a rule list (RL) that is a global equivalent to the TE and is comprehensible for a human. This RL captures all necessary and sufficient conditions for decision making by the TE. Experiments on benchmark datasets demonstrate that, compared to state-of-the-art methods, (i) predictions from the RL generated by TE2Rules have high fidelity with respect to the original TE, (ii) the RL from TE2Rules has high interpretability measured by the number and the length of the decision rules, (iii) the run-time of TE2Rules algorithm can be reduced significantly at the cost of a slightly lower fidelity, and (iv) the RL is a fast alternative to the state-of-the-art rule-based instance-level outcome explanation techniques.
    Online Anomaly Detection Based On Reservoir Sampling and LOF for IoT devices. (arXiv:2206.14265v1 [cs.LG])
    The growing number of IoT devices and their use to monitor the operation of machines and equipment increases interest in anomaly detection algorithms running on devices. However, the difficulty is the limitations of the available computational and memory resources on the devices. In the case of microcontrollers (MCUs), these are single megabytes of program and several hundred kilobytes of working memory. Consequently, algorithms must be appropriately matched to the capabilities of the devices. In the paper, we analyse the processing pipeline for anomaly detection and implementation of the Local Outliner Factor (LOF) algorithm on a MCU. We also show that it is possible to train such an algorithm directly on the device, which gives great potential to use the solution in real devices.
    Why patient data cannot be easily forgotten?. (arXiv:2206.14541v1 [cs.LG])
    Rights provisioned within data protection regulations, permit patients to request that knowledge about their information be eliminated by data holders. With the advent of AI learned on data, one can imagine that such rights can extent to requests for forgetting knowledge of patient's data within AI models. However, forgetting patients' imaging data from AI models, is still an under-explored problem. In this paper, we study the influence of patient data on model performance and formulate two hypotheses for a patient's data: either they are common and similar to other patients or form edge cases, i.e. unique and rare cases. We show that it is not possible to easily forget patient data. We propose a targeted forgetting approach to perform patient-wise forgetting. Extensive experiments on the benchmark Automated Cardiac Diagnosis Challenge dataset showcase the improved performance of the proposed targeted forgetting approach as opposed to a state-of-the-art method.
    Overview of Deep Learning-based CSI Feedback in Massive MIMO Systems. (arXiv:2206.14383v1 [eess.SP])
    Many performance gains achieved by massive multiple-input and multiple-output depend on the accuracy of the downlink channel state information (CSI) at the transmitter (base station), which is usually obtained by estimating at the receiver (user terminal) and feeding back to the transmitter. The overhead of CSI feedback occupies substantial uplink bandwidth resources, especially when the number of the transmit antennas is large. Deep learning (DL)-based CSI feedback refers to CSI compression and reconstruction by a DL-based autoencoder and can greatly reduce feedback overhead. In this paper, a comprehensive overview of state-of-the-art research on this topic is provided, beginning with basic DL concepts widely used in CSI feedback and then categorizing and describing some existing DL-based feedback works. The focus is on novel neural network architectures and utilization of communication expert knowledge to improve CSI feedback accuracy. Works on bit-level CSI feedback and joint design of CSI feedback with other communication modules are also introduced, and some practical issues, including training dataset collection, online training, complexity, generalization, and standardization effect, are discussed. At the end of the paper, some challenges and potential research directions associated with DL-based CSI feedback in future wireless communication systems are identified.
    Reinforcement Learning in Medical Image Analysis: Concepts, Applications, Challenges, and Future Directions. (arXiv:2206.14302v1 [cs.CV])
    Motivation: Medical image analysis involves tasks to assist physicians in qualitative and quantitative analysis of lesions or anatomical structures, significantly improving the accuracy and reliability of diagnosis and prognosis. Traditionally, these tasks are finished by physicians or medical physicists and lead to two major problems: (i) low efficiency; (ii) biased by personal experience. In the past decade, many machine learning methods have been applied to accelerate and automate the image analysis process. Compared to the enormous deployments of supervised and unsupervised learning models, attempts to use reinforcement learning in medical image analysis are scarce. This review article could serve as the stepping-stone for related research. Significance: From our observation, though reinforcement learning has gradually gained momentum in recent years, many researchers in the medical analysis field find it hard to understand and deploy in clinics. One cause is lacking well-organized review articles targeting readers lacking professional computer science backgrounds. Rather than providing a comprehensive list of all reinforcement learning models in medical image analysis, this paper may help the readers to learn how to formulate and solve their medical image analysis research as reinforcement learning problems. Approach & Results: We selected published articles from Google Scholar and PubMed. Considering the scarcity of related articles, we also included some outstanding newest preprints. The papers are carefully reviewed and categorized according to the type of image analysis task. We first review the basic concepts and popular models of reinforcement learning. Then we explore the applications of reinforcement learning models in landmark detection. Finally, we conclude the article by discussing the reviewed reinforcement learning approaches' limitations and possible improvements.
    On the Robustness of Dialogue History Representation in Conversational Question Answering: A Comprehensive Study and a New Prompt-based Method. (arXiv:2206.14796v1 [cs.CL])
    Most works on modeling the conversation history in Conversational Question Answering (CQA) report a single main result on a common CQA benchmark. While existing models show impressive results on CQA leaderboards, it remains unclear whether they are robust to shifts in setting (sometimes to more realistic ones), training data size (e.g. from large to small sets) and domain. In this work, we design and conduct the first large-scale robustness study of history modeling approaches for CQA. We find that high benchmark scores do not necessarily translate to strong robustness, and that various methods can perform extremely differently under different settings. Equipped with the insights from our study, we design a novel prompt-based history modeling approach, and demonstrate its strong robustness across various settings. Our approach is inspired by existing methods that highlight historic answers in the passage. However, instead of highlighting by modifying the passage token embeddings, we add textual prompts directly in the passage text. Our approach is simple, easy-to-plug into practically any model, and highly effective, thus we recommend it as a starting point for future model developers. We also hope that our study and insights will raise awareness to the importance of robustness-focused evaluation, in addition to obtaining high leaderboard scores, leading to better CQA systems.
    Applications of Reinforcement Learning in Finance -- Trading with a Double Deep Q-Network. (arXiv:2206.14267v1 [cs.LG])
    This paper presents a Double Deep Q-Network algorithm for trading single assets, namely the E-mini S&P 500 continuous futures contract. We use a proven setup as the foundation for our environment with multiple extensions. The features of our trading agent are constantly being expanded to include additional assets such as commodities, resulting in four models. We also respond to environmental conditions, including costs and crises. Our trading agent is first trained for a specific time period and tested on new data and compared with the long-and-hold strategy as a benchmark (market). We analyze the differences between the various models and the in-sample/out-of-sample performance with respect to the environment. The experimental results show that the trading agent follows an appropriate behavior. It can adjust its policy to different circumstances, such as more extensive use of the neutral position when trading costs are present. Furthermore, the net asset value exceeded that of the benchmark, and the agent outperformed the market in the test set. We provide initial insights into the behavior of an agent in a financial domain using a DDQN algorithm. The results of this study can be used for further development.
    Fair Machine Learning in Healthcare: A Review. (arXiv:2206.14397v1 [cs.LG])
    Benefiting from the digitization of healthcare data and the development of computing power, machine learning methods are increasingly used in the healthcare domain. Fairness problems have been identified in machine learning for healthcare, resulting in an unfair allocation of limited healthcare resources or excessive health risks for certain groups. Therefore, addressing the fairness problems has recently attracted increasing attention from the healthcare community. However, the intersection of machine learning for healthcare and fairness in machine learning remains understudied. In this review, we build the bridge by exposing fairness problems, summarizing possible biases, sorting out mitigation methods and pointing out challenges along with opportunities for the future.
    Predicting the Need for Blood Transfusion in Intensive Care Units with Reinforcement Learning. (arXiv:2206.14198v1 [cs.LG])
    As critically ill patients frequently develop anemia or coagulopathy, transfusion of blood products is a frequent intervention in the Intensive Care Units (ICU). However, inappropriate transfusion decisions made by physicians are often associated with increased risk of complications and higher hospital costs. In this work, we aim to develop a decision support tool that uses available patient information for transfusion decision-making on three common blood products (red blood cells, platelets, and fresh frozen plasma). To this end, we adopt an off-policy batch reinforcement learning (RL) algorithm, namely, discretized Batch Constrained Q-learning, to determine the best action (transfusion or not) given observed patient trajectories. Simultaneously, we consider different state representation approaches and reward design mechanisms to evaluate their impacts on policy learning. Experiments are conducted on two real-world critical care datasets: the MIMIC-III and the UCSF. Results demonstrate that policy recommendations on transfusion achieved comparable matching against true hospital policies via accuracy and weighted importance sampling evaluations on the MIMIC-III dataset. Furthermore, a combination of transfer learning (TL) and RL on the data-scarce UCSF dataset can provide up to $17.02% improvement in terms of accuracy, and up to 18.94% and 21.63% improvement in jump-start and asymptotic performance in terms of weighted importance sampling averaged over three transfusion tasks. Finally, simulations on transfusion decisions suggest that the transferred RL policy could reduce patients' estimated 28-day mortality rate by 2.74% and decreased acuity rate by 1.18% on the UCSF dataset.
    Beyond neural scaling laws: beating power law scaling via data pruning. (arXiv:2206.14486v1 [cs.LG])
    Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, these improvements through scaling alone require considerable costs in compute and energy. Here we focus on the scaling of error with dataset size and show how both in theory and practice we can break beyond power law scaling and reduce it to exponential scaling instead if we have access to a high-quality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size. We then test this new exponential scaling prediction with pruned dataset size empirically, and indeed observe better than power law scaling performance on ResNets trained on CIFAR-10, SVHN, and ImageNet. Given the importance of finding high-quality pruning metrics, we perform the first large-scale benchmarking study of ten different data pruning metrics on ImageNet. We find most existing high performing metrics scale poorly to ImageNet, while the best are computationally intensive and require labels for every image. We therefore developed a new simple, cheap and scalable self-supervised pruning metric that demonstrates comparable performance to the best supervised metrics. Overall, our work suggests that the discovery of good data-pruning metrics may provide a viable path forward to substantially improved neural scaling laws, thereby reducing the resource costs of modern deep learning.
    ECG Heartbeat classification using deep transfer learning with Convolutional Neural Network and STFT technique. (arXiv:2206.14200v1 [cs.LG])
    Electrocardiogram (ECG) is a simple non-invasive measure to identify heart-related issues such as irregular heartbeats known as arrhythmias. While artificial intelligence and machine learning is being utilized in a wide range of healthcare related applications and datasets, many arrhythmia classifiers using deep learning methods have been proposed in recent years. However, sizes of the available datasets from which to build and assess machine learning models is often very small and the lack of well-annotated public ECG datasets is evident. In this paper, we propose a deep transfer learning framework that is aimed to perform classification on a small size training dataset. The proposed method is to fine-tune a general-purpose image classifier ResNet-18 with MIT-BIH arrhythmia dataset in accordance with the AAMI EC57 standard. This paper further investigates many existing deep learning models that have failed to avoid data leakage against AAMI recommendations. We compare how different data split methods impact the model performance. This comparison study implies that future work in arrhythmia classification should follow the AAMI EC57 standard when using any including MIT-BIH arrhythmia dataset.
    Knowledge Graph Fusion for Language Model Fine-tuning. (arXiv:2206.14574v1 [cs.CL])
    Language Models such as BERT have grown in popularity due to their ability to be pre-trained and perform robustly on a wide range of Natural Language Processing tasks. Often seen as an evolution over traditional word embedding techniques, they can produce semantic representations of text, useful for tasks such as semantic similarity. However, state-of-the-art models often have high computational requirements and lack global context or domain knowledge which is required for complete language understanding. To address these limitations, we investigate the benefits of knowledge incorporation into the fine-tuning stages of BERT. An existing K-BERT model, which enriches sentences with triplets from a Knowledge Graph, is adapted for the English language and extended to inject contextually relevant information into sentences. As a side-effect, changes made to K-BERT for accommodating the English language also extend to other word-based languages. Experiments conducted indicate that injected knowledge introduces noise. We see statistically significant improvements for knowledge-driven tasks when this noise is minimised. We show evidence that, given the appropriate task, modest injection with relevant, high-quality knowledge is most performant.
    Generative Anomaly Detection for Time Series Datasets. (arXiv:2206.14597v1 [cs.LG])
    Traffic congestion anomaly detection is of paramount importance in intelligent traffic systems. The goals of transportation agencies are two-fold: to monitor the general traffic conditions in the area of interest and to locate road segments under abnormal congestion states. Modeling congestion patterns can achieve these goals for citywide roadways, which amounts to learning the distribution of multivariate time series (MTS). However, existing works are either not scalable or unable to capture the spatial-temporal information in MTS simultaneously. To this end, we propose a principled and comprehensive framework consisting of a data-driven generative approach that can perform tractable density estimation for detecting traffic anomalies. Our approach first clusters segments in the feature space and then uses conditional normalizing flow to identify anomalous temporal snapshots at the cluster level in an unsupervised setting. Then, we identify anomalies at the segment level by using a kernel density estimator on the anomalous cluster. Extensive experiments on synthetic datasets show that our approach significantly outperforms several state-of-the-art congestion anomaly detection and diagnosis methods in terms of Recall and F1-Score. We also use the generative model to sample labeled data, which can train classifiers in a supervised setting, alleviating the lack of labeled data for anomaly detection in sparse settings.
    Cooperative Retriever and Ranker in Deep Recommenders. (arXiv:2206.14649v1 [cs.IR])
    Deep recommender systems jointly leverage the retrieval and ranking operations to generate the recommendation result. The retriever targets selecting a small set of relevant candidates from the entire items with high efficiency; while the ranker, usually more precise but time-consuming, is supposed to identify the best items out of the retrieved candidates with high precision. However, the retriever and ranker are usually trained in poorly-cooperative ways, leading to limited recommendation performances when working as an entirety. In this work, we propose a novel DRS training framework CoRR(short for Cooperative Retriever and Ranker), where the retriever and ranker can be mutually reinforced. On one hand, the retriever is learned from recommendation data and the ranker via knowledge distillation; knowing that the ranker is more precise, the knowledge distillation may provide extra weak-supervision signals for the improvement of retrieval quality. On the other hand, the ranker is trained by learning to discriminate the truth positive items from hard negative candidates sampled from the retriever. With the iteration going on, the ranker may become more precise, which in return gives rise to informative training signals for the retriever; meanwhile, with the improvement of retriever, harder negative candidates can be sampled, which contributes to a higher discriminative capability of the ranker. To facilitate the effective conduct of CoRR, an asymptotic-unbiased approximation of KL divergence is introduced for the knowledge distillation over sampled items; besides, a scalable and adaptive strategy is developed to efficiently sample from the retriever. Comprehensive experimental studies are performed over four large-scale benchmark datasets, where CoRR improves the overall recommendation quality resulting from the cooperation between retriever and ranker.
    Semi-supervised Contrastive Outlier removal for Pseudo Expectation Maximization (SCOPE). (arXiv:2206.14261v1 [cs.LG])
    Semi-supervised learning is the problem of training an accurate predictive model by combining a small labeled dataset with a presumably much larger unlabeled dataset. Many methods for semi-supervised deep learning have been developed, including pseudolabeling, consistency regularization, and contrastive learning techniques. Pseudolabeling methods however are highly susceptible to confounding, in which erroneous pseudolabels are assumed to be true labels in early iterations, thereby causing the model to reinforce its prior biases and thereby fail to generalize to strong predictive performance. We present a new approach to suppress confounding errors through a method we describe as Semi-supervised Contrastive Outlier removal for Pseudo Expectation Maximization (SCOPE). Like basic pseudolabeling, SCOPE is related to Expectation Maximization (EM), a latent variable framework which can be extended toward understanding cluster-assumption deep semi-supervised algorithms. However, unlike basic pseudolabeling which fails to adequately take into account the probability of the unlabeled samples given the model, SCOPE introduces an outlier suppression term designed to improve the behavior of EM iteration given a discrimination DNN backbone in the presence of outliers. Our results show that SCOPE greatly improves semi-supervised classification accuracy over a baseline, and furthermore when combined with consistency regularization achieves the highest reported accuracy for the semi-supervised CIFAR-10 classification task using 250 and 4000 labeled samples. Moreover, we show that SCOPE reduces the prevalence of confounding errors during pseudolabeling iterations by pruning erroneous high-confidence pseudolabeled samples that would otherwise contaminate the labeled set in subsequent retraining iterations.
    SALO: An Efficient Spatial Accelerator Enabling Hybrid Sparse Attention Mechanisms for Long Sequences. (arXiv:2206.14550v1 [cs.AR])
    The attention mechanisms of transformers effectively extract pertinent information from the input sequence. However, the quadratic complexity of self-attention w.r.t the sequence length incurs heavy computational and memory burdens, especially for tasks with long sequences. Existing accelerators face performance degradation in these tasks. To this end, we propose SALO to enable hybrid sparse attention mechanisms for long sequences. SALO contains a data scheduler to map hybrid sparse attention patterns onto hardware and a spatial accelerator to perform the efficient attention computation. We show that SALO achieves 17.66x and 89.33x speedup on average compared to GPU and CPU implementations, respectively, on typical workloads, i.e., Longformer and ViL.
    Computer-aided diagnosis and prediction in brain disorders. (arXiv:2206.14683v1 [cs.LG])
    Computer-aided methods have shown added value for diagnosing and predicting brain disorders and can thus support decision making in clinical care and treatment planning. This chapter will provide insight into the type of methods, their working, their input data - such as cognitive tests, imaging and genetic data - and the types of output they provide. We will focus on specific use cases for diagnosis, i.e. estimating the current 'condition' of the patient, such as early detection and diagnosis of dementia, differential diagnosis of brain tumours, and decision making in stroke. Regarding prediction, i.e. estimation of the future 'condition' of the patient, we will zoom in on use cases such as predicting the disease course in multiple sclerosis and predicting patient outcomes after treatment in brain cancer. Furthermore, based on these use cases, we will assess the current state-of-the-art methodology and highlight current efforts on benchmarking of these methods and the importance of open science therein. Finally, we assess the current clinical impact of computer-aided methods and discuss the required next steps to increase clinical impact.  ( 2 min )
    Open Problem: Properly learning decision trees in polynomial time?. (arXiv:2206.14431v1 [cs.DS])
    The authors recently gave an $n^{O(\log\log n)}$ time membership query algorithm for properly learning decision trees under the uniform distribution (Blanc et al., 2021). The previous fastest algorithm for this problem ran in $n^{O(\log n)}$ time, a consequence of Ehrenfeucht and Haussler (1989)'s classic algorithm for the distribution-free setting. In this article we highlight the natural open problem of obtaining a polynomial-time algorithm, discuss possible avenues towards obtaining it, and state intermediate milestones that we believe are of independent interest.
    Revisiting Label Smoothing and Knowledge Distillation Compatibility: What was Missing?. (arXiv:2206.14532v1 [cs.LG])
    This work investigates the compatibility between label smoothing (LS) and knowledge distillation (KD). Contemporary findings addressing this thesis statement take dichotomous standpoints: Muller et al. (2019) and Shen et al. (2021b). Critically, there is no effort to understand and resolve these contradictory findings, leaving the primal question -- to smooth or not to smooth a teacher network? -- unanswered. The main contributions of our work are the discovery, analysis and validation of systematic diffusion as the missing concept which is instrumental in understanding and resolving these contradictory findings. This systematic diffusion essentially curtails the benefits of distilling from an LS-trained teacher, thereby rendering KD at increased temperatures ineffective. Our discovery is comprehensively supported by large-scale experiments, analyses and case studies including image classification, neural machine translation and compact student distillation tasks spanning across multiple datasets and teacher-student architectures. Based on our analysis, we suggest practitioners to use an LS-trained teacher with a low-temperature transfer to achieve high performance students. Code and models are available at https://keshik6.github.io/revisiting-ls-kd-compatibility/
    Can Push-forward Generative Models Fit Multimodal Distributions?. (arXiv:2206.14476v1 [stat.ML])
    Many generative models synthesize data by transforming a standard Gaussian random variable using a deterministic neural network. Among these models are the Variational Autoencoders and the Generative Adversarial Networks. In this work, we call them "push-forward" models and study their expressivity. We show that the Lipschitz constant of these generative networks has to be large in order to fit multimodal distributions. More precisely, we show that the total variation distance and the Kullback-Leibler divergence between the generated and the data distribution are bounded from below by a constant depending on the mode separation and the Lipschitz constant. Since constraining the Lipschitz constants of neural networks is a common way to stabilize generative models, there is a provable trade-off between the ability of push-forward models to approximate multimodal distributions and the stability of their training. We validate our findings on one-dimensional and image datasets and empirically show that generative models consisting of stacked networks with stochastic input at each step, such as diffusion models do not suffer of such limitations.
    Target alignment in truncated kernel ridge regression. (arXiv:2206.14255v1 [cs.LG])
    Kernel ridge regression (KRR) has recently attracted renewed interest due to its potential for explaining the transient effects, such as double descent, that emerge during neural network training. In this work, we study how the alignment between the target function and the kernel affects the performance of the KRR. We focus on the truncated KRR (TKRR) which utilizes an additional parameter that controls the spectral truncation of the kernel matrix. We show that for polynomial alignment, there is an \emph{over-aligned} regime, in which TKRR can achieve a faster rate than what is achievable by full KRR. The rate of TKRR can improve all the way to the parametric rate, while that of full KRR is capped at a sub-optimal value. This shows that target alignemnt can be better leveraged by utilizing spectral truncation in kernel methods. We also consider the bandlimited alignment setting and show that the regularization surface of TKRR can exhibit transient effects including multiple descent and non-monotonic behavior. Our results show that there is a strong and quantifable relation between the shape of the \emph{alignment spectrum} and the generalization performance of kernel methods, both in terms of rates and in finite samples.
    Adversarial Ensemble Training by Jointly Learning Label Dependencies and Member Models. (arXiv:2206.14477v1 [cs.LG])
    Training an ensemble of different sub-models has empirically proven to be an effective strategy to improve deep neural networks' adversarial robustness. Current ensemble training methods for image recognition usually encode the image labels by one-hot vectors, which neglect dependency relationships between the labels. Here we propose a novel adversarial training approach that learns the conditional dependencies between labels and the model ensemble jointly. We test our approach on widely used datasets MNIST, FasionMNIST and CIFAR-10. Results show that our approach is more robust against black-box attacks compared with state-of-the-art methods. Our code is available at https://github.com/ZJLAB-AMMI/LSD.
    Theoretical Perspectives on Deep Learning Methods in Inverse Problems. (arXiv:2206.14373v1 [stat.ML])
    In recent years, there have been significant advances in the use of deep learning methods in inverse problems such as denoising, compressive sensing, inpainting, and super-resolution. While this line of works has predominantly been driven by practical algorithms and experiments, it has also given rise to a variety of intriguing theoretical problems. In this paper, we survey some of the prominent theoretical developments in this line of works, focusing in particular on generative priors, untrained neural network priors, and unfolding algorithms. In addition to summarizing existing results in these topics, we highlight several ongoing challenges and open problems.
    Towards Traffic Scene Description: The Semantic Scene Graph. (arXiv:2111.10196v2 [cs.LG] UPDATED)
    For the classification of traffic scenes, a description model is necessary that can describe the scene in a uniform way, independent of its domain. A model to describe a traffic scene in a semantic way is described in this paper. The description model allows to describe a traffic scene independently of the road geometry and road topology. Here, the traffic participants are projected onto the road network and represented as nodes in a graph. Depending on the relative location between two traffic participants with respect to the road topology, semantically classified edges are created between the corresponding nodes. For concretization, the edge attributes are extended by relative distances and velocities between both traffic participants with regard to the course of the lane. An important aspect of the description is that it can be converted easily into a machine-readable format. The current description focuses on dynamic objects of a traffic scene and considers traffic participants, such as pedestrians or vehicles.
    Multiresolution Equivariant Graph Variational Autoencoder. (arXiv:2106.00967v3 [cs.LG] UPDATED)
    In this paper, we propose Multiresolution Equivariant Graph Variational Autoencoders (MGVAE), the first hierarchical generative model to learn and generate graphs in a multiresolution and equivariant manner. At each resolution level, MGVAE employs higher order message passing to encode the graph while learning to partition it into mutually exclusive clusters and coarsening into a lower resolution that eventually creates a hierarchy of latent distributions. MGVAE then constructs a hierarchical generative model to variationally decode into a hierarchy of coarsened graphs. Importantly, our proposed framework is end-to-end permutation equivariant with respect to node ordering. MGVAE achieves competitive results with several generative tasks including general graph generation, molecular generation, unsupervised molecular representation learning to predict molecular properties, link prediction on citation graphs, and graph-based image generation.
    Building Matters: Spatial Variability in Machine Learning Based Thermal Comfort Prediction in Winters. (arXiv:2206.14202v1 [cs.LG])
    Thermal comfort in indoor environments has an enormous impact on the health, well-being, and performance of occupants. Given the focus on energy efficiency and Internet-of-Things enabled smart buildings, machine learning (ML) is being increasingly used for data-driven thermal comfort (TC) prediction. Generally, ML-based solutions are proposed for air-conditioned or HVAC ventilated buildings and the models are primarily designed for adults. On the other hand, naturally ventilated (NV) buildings are the norm in most countries. They are also ideal for energy conservation and long-term sustainability goals. However, the indoor environment of NV buildings lacks thermal regulation and varies significantly across spatial contexts. These factors make TC prediction extremely challenging. Thus, determining the impact of the building environment on the performance of TC models is important. Further, the generalization capability of TC prediction models across different NV indoor spaces needs to be studied. This work addresses these problems. Data is gathered through month-long field experiments conducted in 5 naturally ventilated school buildings, involving 512 primary school students. The impact of spatial variability on student comfort is demonstrated through variation in prediction accuracy (by as much as 71%). The influence of building environment on TC prediction is also demonstrated through variation in feature importance. Further, a comparative analysis of spatial variability in model performance is done for children (our dataset) and adults (ASHRAE-II database). Finally, the generalization capability of thermal comfort models in NV classrooms is assessed and major challenges are highlighted.
    Massively Increasing the number of Antibody-Virus Interactions across Studies. (arXiv:2206.14566v1 [q-bio.QM])
    A central challenge in every field of biology is to use existing measurements to predict the outcomes of future experiments. In this work, we consider the wealth of antibody inhibition data against variants of the influenza virus. Due to this virus's genetic diversity and evolvability, the variants examined in one study will often have little-to-no overlap with other studies, making it difficult to discern common patterns or unify datasets for further analysis. To that end, we develop a computational framework that predicts how an antibody or serum would inhibit any variant from any other study. We use this framework to greatly expand 7 influenza datasets utilizing hemagglutination inhibition, validating our method upon 200,000 existing measurements and predicting more than 2,000,000 new values along with their prediction uncertainties. This data-driven approach does not require any information beyond each virus's name and measurements, and even datasets with as few as 5 viruses can be expanded, making this approach widely applicable. Future influenza studies using hemagglutination inhibition can directly utilize our curated datasets to predict newly measured antibody responses against ~80 H3N2 influenza viruses from 1968-2011, whereas immunological studies utilizing other viruses or a different assay only need to find a single partially-overlapping dataset to extend their work. In essence, this approach enables a shift in perspective when analyzing data from "what you see is what you get" into "what anyone sees is what everyone gets."
    Learning Time Delay Systems with Neural Ordinary Differential Equations. (arXiv:2206.14288v1 [cs.LG])
    A novel way of using neural networks to learn the dynamics of time delay systems from sequential data is proposed. A neural network with trainable delays is used to approximate the right hand side of a delay differential equation. We relate the delay differential equation to an ordinary differential equation by discretizing the time history and train the corresponding neural ordinary differential equation (NODE) to learn the dynamics. An example on learning the dynamics of the Mackey-Glass equation using data from chaotic behavior is given. After learning both the nonlinearity and the time delay, we demonstrate that the bifurcation diagram of the neural network matches that of the original system.
    Collecting high-quality adversarial data for machine reading comprehension tasks with humans and models in the loop. (arXiv:2206.14272v1 [cs.CL])
    We present our experience as annotators in the creation of high-quality, adversarial machine-reading-comprehension data for extractive QA for Task 1 of the First Workshop on Dynamic Adversarial Data Collection (DADC). DADC is an emergent data collection paradigm with both models and humans in the loop. We set up a quasi-experimental annotation design and perform quantitative analyses across groups with different numbers of annotators focusing on successful adversarial attacks, cost analysis, and annotator confidence correlation. We further perform a qualitative analysis of our perceived difficulty of the task given the different topics of the passages in our dataset and conclude with recommendations and suggestions that might be of value to people working on future DADC tasks and related annotation interfaces.
    A Perturbation Bound on the Subspace Estimator from Canonical Projections. (arXiv:2206.14278v1 [stat.ML])
    This paper derives a perturbation bound on the optimal subspace estimator obtained from a subset of its canonical projections contaminated by noise. This fundamental result has important implications in matrix completion, subspace clustering, and related problems.
    Framing Algorithmic Recourse for Anomaly Detection. (arXiv:2206.14384v1 [cs.LG])
    The problem of algorithmic recourse has been explored for supervised machine learning models, to provide more interpretable, transparent and robust outcomes from decision support systems. An unexplored area is that of algorithmic recourse for anomaly detection, specifically for tabular data with only discrete feature values. Here the problem is to present a set of counterfactuals that are deemed normal by the underlying anomaly detection model so that applications can utilize this information for explanation purposes or to recommend countermeasures. We present an approach -- Context preserving Algorithmic Recourse for Anomalies in Tabular data (CARAT), that is effective, scalable, and agnostic to the underlying anomaly detection model. CARAT uses a transformer based encoder-decoder model to explain an anomaly by finding features with low likelihood. Subsequently semantically coherent counterfactuals are generated by modifying the highlighted features, using the overall context of features in the anomalous instance(s). Extensive experiments help demonstrate the efficacy of CARAT.  ( 2 min )
    Extracting Weighted Finite Automata from Recurrent Neural Networks for Natural Languages. (arXiv:2206.14621v1 [cs.CL])
    Recurrent Neural Networks (RNNs) have achieved tremendous success in sequential data processing. However, it is quite challenging to interpret and verify RNNs' behaviors directly. To this end, many efforts have been made to extract finite automata from RNNs. Existing approaches such as exact learning are effective in extracting finite-state models to characterize the state dynamics of RNNs for formal languages, but are limited in the scalability to process natural languages. Compositional approaches that are scablable to natural languages fall short in extraction precision. In this paper, we identify the transition sparsity problem that heavily impacts the extraction precision. To address this problem, we propose a transition rule extraction approach, which is scalable to natural language processing models and effective in improving extraction precision. Specifically, we propose an empirical method to complement the missing rules in the transition diagram. In addition, we further adjust the transition matrices to enhance the context-aware ability of the extracted weighted finite automaton (WFA). Finally, we propose two data augmentation tactics to track more dynamic behaviors of the target RNN. Experiments on two popular natural language datasets show that our method can extract WFA from RNN for natural language processing with better precision than existing approaches.  ( 2 min )
    Cross-Silo Heterogeneous Model Federated Multitask Learning. (arXiv:2202.08603v3 [cs.LG] UPDATED)
    Federated learning (FL) is a machine learning technique that enables participants to collaboratively train high-quality models without exchanging their private data. Participants utilizing cross-silo FL (CS-FL) settings are independent organizations with different task needs, and they are concerned not only with data privacy but also with independently training their unique models due to intellectual property considerations. Most existing FL methods are incapable of satisfying the above scenarios. In this paper, we propose a FL method based on the pseudolabeling of unlabeled data via a process such as cotraining. To the best of our knowledge, this is the first FL method that is simultaneously compatible with heterogeneous tasks, heterogeneous models, and heterogeneous training algorithms. Experimental results show that the proposed method achieves better performance than competing ones. This is especially true for non-independent and identically distributed (IID) settings and heterogeneous models, where the proposed method achieves a 35% performance improvement.
    Forgetting Data from Pre-trained GANs. (arXiv:2206.14389v1 [cs.LG])
    Large pre-trained generative models are known to occasionally provide samples that may be undesirable for various reasons. The standard way to mitigate this is to re-train the models differently. In this work, we take a different, more compute-friendly approach and investigate how to post-edit a model after training so that it forgets certain kinds of samples. We provide three different algorithms for GANs that differ on how the samples to be forgotten are described. Extensive evaluations on real-world image datasets show that our algorithms are capable of forgetting data while retaining high generation quality at a fraction of the cost of full re-training.
    Active Exploration via Experiment Design in Markov Chains. (arXiv:2206.14332v1 [cs.LG])
    A key challenge in science and engineering is to design experiments to learn about some unknown quantity of interest. Classical experimental design optimally allocates the experimental budget to maximize a notion of utility (e.g., reduction in uncertainty about the unknown quantity). We consider a rich setting, where the experiments are associated with states in a {\em Markov chain}, and we can only choose them by selecting a {\em policy} controlling the state transitions. This problem captures important applications, from exploration in reinforcement learning to spatial monitoring tasks. We propose an algorithm -- \textsc{markov-design} -- that efficiently selects policies whose measurement allocation \emph{provably converges to the optimal one}. The algorithm is sequential in nature, adapting its choice of policies (experiments) informed by past measurements. In addition to our theoretical analysis, we showcase our framework on applications in ecological surveillance and pharmacology.
    Towards Robust Waveform-Based Acoustic Models. (arXiv:2110.08634v2 [cs.SD] UPDATED)
    We study the problem of learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. First, we characterize data augmentation theoretically as an instance of vicinal risk minimization, which aims at improving risk estimates during training by replacing the delta functions that define the empirical density over the input space with an approximation of the marginal population density in the vicinity of the training samples. More specifically, we assume that local neighborhoods centered at training samples can be approximated using a mixture of Gaussians, and demonstrate theoretically that this can incorporate robust inductive bias into the learning process. We then specify the individual mixture components implicitly via data augmentation schemes, designed to address common sources of spurious correlations in acoustic models. To avoid potential confounding effects on robustness due to information loss, which has been associated with standard feature extraction techniques (e.g., FBANK and MFCC features), we focus on the waveform-based setting. Our empirical results show that the approach can generalize to unseen noise conditions, with 150% relative improvement in out-of-distribution generalization compared to training using the standard risk minimization principle. Moreover, the results demonstrate competitive performance relative to models learned using a training sample designed to match the acoustic conditions characteristic of test utterances.
    Cyclical Kernel Adaptive Metropolis. (arXiv:2206.14421v1 [cs.LG])
    We propose cKAM, cyclical Kernel Adaptive Metropolis, which incorporates a cyclical stepsize scheme to allow control for exploration and sampling. We show that on a crafted bimodal distribution, existing Adaptive Metropolis type algorithms would fail to converge to the true posterior distribution. We point out that this is because adaptive samplers estimates the local/global covariance structure using past history of the chain, which will lead to adaptive algorithms be trapped in a local mode. We demonstrate that cKAM encourages exploration of the posterior distribution and allows the sampler to escape from a local mode, while maintaining the high performance of adaptive methods.
    DDKtor: Automatic Diadochokinetic Speech Analysis. (arXiv:2206.14639v1 [eess.AS])
    Diadochokinetic speech tasks (DDK), in which participants repeatedly produce syllables, are commonly used as part of the assessment of speech motor impairments. These studies rely on manual analyses that are time-intensive, subjective, and provide only a coarse-grained picture of speech. This paper presents two deep neural network models that automatically segment consonants and vowels from unannotated, untranscribed speech. Both models work on the raw waveform and use convolutional layers for feature extraction. The first model is based on an LSTM classifier followed by fully connected layers, while the second model adds more convolutional layers followed by fully connected layers. These segmentations predicted by the models are used to obtain measures of speech rate and sound duration. Results on a young healthy individuals dataset show that our LSTM model outperforms the current state-of-the-art systems and performs comparably to trained human annotators. Moreover, the LSTM model also presents comparable results to trained human annotators when evaluated on unseen older individuals with Parkinson's Disease dataset.
    Approximate Data Deletion in Generative Models. (arXiv:2206.14439v1 [cs.LG])
    Users have the right to have their data deleted by third-party learned systems, as codified by recent legislation such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). Such data deletion can be accomplished by full re-training, but this incurs a high computational cost for modern machine learning models. To avoid this cost, many approximate data deletion methods have been developed for supervised learning. Unsupervised learning, in contrast, remains largely an open problem when it comes to (approximate or exact) efficient data deletion. In this paper, we propose a density-ratio-based framework for generative models. Using this framework, we introduce a fast method for approximate data deletion and a statistical test for estimating whether or not training points have been deleted. We provide theoretical guarantees under various learner assumptions and empirically demonstrate our methods across a variety of generative methods.
    On the power of adaptivity in statistical adversaries. (arXiv:2111.10352v2 [cs.LG] UPDATED)
    We study a fundamental question concerning adversarial noise models in statistical problems where the algorithm receives i.i.d. draws from a distribution $\mathcal{D}$. The definitions of these adversaries specify the type of allowable corruptions (noise model) as well as when these corruptions can be made (adaptivity); the latter differentiates between oblivious adversaries that can only corrupt the distribution $\mathcal{D}$ and adaptive adversaries that can have their corruptions depend on the specific sample $S$ that is drawn from $\mathcal{D}$. In this work, we investigate whether oblivious adversaries are effectively equivalent to adaptive adversaries, across all noise models studied in the literature. Specifically, can the behavior of an algorithm $\mathcal{A}$ in the presence of oblivious adversaries always be well-approximated by that of an algorithm $\mathcal{A}'$ in the presence of adaptive adversaries? Our first result shows that this is indeed the case for the broad class of statistical query algorithms, under all reasonable noise models. We then show that in the specific case of additive noise, this equivalence holds for all algorithms. Finally, we map out an approach towards proving this statement in its fullest generality, for all algorithms and under all reasonable noise models.
    Modeling Teams Performance Using Deep Representational Learning on Graphs. (arXiv:2206.14741v1 [cs.SI])
    The large majority of human activities require collaborations within and across formal or informal teams. Our understanding of how the collaborative efforts spent by teams relate to their performance is still a matter of debate. Teamwork results in a highly interconnected ecosystem of potentially overlapping components where tasks are performed in interaction with team members and across other teams. To tackle this problem, we propose a graph neural network model designed to predict a team's performance while identifying the drivers that determine such an outcome. In particular, the model is based on three architectural channels: topological, centrality, and contextual which capture different factors potentially shaping teams' success. We endow the model with two attention mechanisms to boost model performance and allow interpretability. A first mechanism allows pinpointing key members inside the team. A second mechanism allows us to quantify the contributions of the three driver effects in determining the outcome performance. We test model performance on a wide range of domains outperforming most of the classical and neural baselines considered. Moreover, we include synthetic datasets specifically designed to validate how the model disentangles the intended properties on which our model vastly outperforms baselines.
    MurTree: Optimal Classification Trees via Dynamic Programming and Search. (arXiv:2007.12652v4 [cs.LG] UPDATED)
    Decision tree learning is a widely used approach in machine learning, favoured in applications that require concise and interpretable models. Heuristic methods are traditionally used to quickly produce models with reasonably high accuracy. A commonly criticised point, however, is that the resulting trees may not necessarily be the best representation of the data in terms of accuracy and size. In recent years, this motivated the development of optimal classification tree algorithms that globally optimise the decision tree in contrast to heuristic methods that perform a sequence of locally optimal decisions. We follow this line of work and provide a novel algorithm for learning optimal classification trees based on dynamic programming and search. Our algorithm supports constraints on the depth of the tree and number of nodes. The success of our approach is attributed to a series of specialised techniques that exploit properties unique to classification trees. Whereas algorithms for optimal classification trees have traditionally been plagued by high runtimes and limited scalability, we show in a detailed experimental study that our approach uses only a fraction of the time required by the state-of-the-art and can handle datasets with tens of thousands of instances, providing several orders of magnitude improvements and notably contributing towards the practical realisation of optimal decision trees.
    An extensible Benchmarking Graph-Mesh dataset for studying Steady-State Incompressible Navier-Stokes Equations. (arXiv:2206.14709v1 [cs.LG])
    Recent progress in \emph{Geometric Deep Learning} (GDL) has shown its potential to provide powerful data-driven models. This gives momentum to explore new methods for learning physical systems governed by \emph{Partial Differential Equations} (PDEs) from Graph-Mesh data. However, despite the efforts and recent achievements, several research directions remain unexplored and progress is still far from satisfying the physical requirements of real-world phenomena. One of the major impediments is the absence of benchmarking datasets and common physics evaluation protocols. In this paper, we propose a 2-D graph-mesh dataset to study the airflow over airfoils at high Reynolds regime (from $10^6$ and beyond). We also introduce metrics on the stress forces over the airfoil in order to evaluate GDL models on important physical quantities. Moreover, we provide extensive GDL baselines.
    Cut Inner Layers: A Structured Pruning Strategy for Efficient U-Net GANs. (arXiv:2206.14658v1 [cs.LG])
    Pruning effectively compresses overparameterized models. Despite the success of pruning methods for discriminative models, applying them for generative models has been relatively rarely approached. This study conducts structured pruning on U-Net generators of conditional GANs. A per-layer sensitivity analysis confirms that many unnecessary filters exist in the innermost layers near the bottleneck and can be substantially pruned. Based on this observation, we prune these filters from multiple inner layers or suggest alternative architectures by completely eliminating the layers. We evaluate our approach with Pix2Pix for image-to-image translation and Wav2Lip for speech-driven talking face generation. Our method outperforms global pruning baselines, demonstrating the importance of properly considering where to prune for U-Net generators.
    IBP Regularization for Verified Adversarial Robustness via Branch-and-Bound. (arXiv:2206.14772v1 [cs.LG])
    Recent works have tried to increase the verifiability of adversarially trained networks by running the attacks over domains larger than the original perturbations and adding various regularization terms to the objective. However, these algorithms either underperform or require complex and expensive stage-wise training procedures, hindering their practical applicability. We present IBP-R, a novel verified training algorithm that is both simple and effective. IBP-R induces network verifiability by coupling adversarial attacks on enlarged domains with a regularization term, based on inexpensive interval bound propagation, that minimizes the gap between the non-convex verification problem and its approximations. By leveraging recent branch-and-bound frameworks, we show that IBP-R obtains state-of-the-art verified robustness-accuracy trade-offs for small perturbations on CIFAR-10 while training significantly faster than relevant previous work. Additionally, we present UPB, a novel branching strategy that, relying on a simple heuristic based on $\beta$-CROWN, reduces the cost of state-of-the-art branching algorithms while yielding splits of comparable quality.
    On-device Synaptic Memory Consolidation using Fowler-Nordheim Quantum-tunneling. (arXiv:2206.14581v1 [cs.ET])
    Synaptic memory consolidation has been heralded as one of the key mechanisms for supporting continual learning in neuromorphic Artificial Intelligence (AI) systems. Here we report that a Fowler-Nordheim (FN) quantum-tunneling device can implement synaptic memory consolidation similar to what can be achieved by algorithmic consolidation models like the cascade and the elastic weight consolidation (EWC) models. The proposed FN-synapse not only stores the synaptic weight but also stores the synapse's historical usage statistic on the device itself. We also show that the operation of the FN-synapse is near-optimal in terms of the synaptic lifetime and we demonstrate that a network comprising FN-synapses outperforms a comparable EWC network for a small benchmark continual learning task. With an energy footprint of femtojoules per synaptic update, we believe that the proposed FN-synapse provides an ultra-energy-efficient approach for implementing both synaptic memory consolidation and persistent learning.
    From Kernel Methods to Neural Networks: A Unifying Variational Formulation. (arXiv:2206.14625v1 [cs.LG])
    The minimization of a data-fidelity term and an additive regularization functional gives rise to a powerful framework for supervised learning. In this paper, we present a unifying regularization functional that depends on an operator and on a generic Radon-domain norm. We establish the existence of a minimizer and give the parametric form of the solution(s) under very mild assumptions. When the norm is Hilbertian, the proposed formulation yields a solution that involves radial-basis functions and is compatible with the classical methods of machine learning. By contrast, for the total-variation norm, the solution takes the form of a two-layer neural network with an activation function that is determined by the regularization operator. In particular, we retrieve the popular ReLU networks by letting the operator be the Laplacian. We also characterize the solution for the intermediate regularization norms $\|\cdot\|=\|\cdot\|_{L_p}$ with $p\in(1,2]$. Our framework offers guarantees of universal approximation for a broad family of regularization operators or, equivalently, for a wide variety of shallow neural networks, including the cases (such as ReLU) where the activation function is increasing polynomially. It also explains the favorable role of bias and skip connections in neural architectures.
    Hardness and Algorithms for Robust and Sparse Optimization. (arXiv:2206.14354v1 [cs.LG])
    We explore algorithms and limitations for sparse optimization problems such as sparse linear regression and robust linear regression. The goal of the sparse linear regression problem is to identify a small number of key features, while the goal of the robust linear regression problem is to identify a small number of erroneous measurements. Specifically, the sparse linear regression problem seeks a $k$-sparse vector $x\in\mathbb{R}^d$ to minimize $\|Ax-b\|_2$, given an input matrix $A\in\mathbb{R}^{n\times d}$ and a target vector $b\in\mathbb{R}^n$, while the robust linear regression problem seeks a set $S$ that ignores at most $k$ rows and a vector $x$ to minimize $\|(Ax-b)_S\|_2$. We first show bicriteria, NP-hardness of approximation for robust regression building on the work of [OWZ15] which implies a similar result for sparse regression. We further show fine-grained hardness of robust regression through a reduction from the minimum-weight $k$-clique conjecture. On the positive side, we give an algorithm for robust regression that achieves arbitrarily accurate additive error and uses runtime that closely matches the lower bound from the fine-grained hardness result, as well as an algorithm for sparse regression with similar runtime. Both our upper and lower bounds rely on a general reduction from robust linear regression to sparse regression that we introduce. Our algorithms, inspired by the 3SUM problem, use approximate nearest neighbor data structures and may be of independent interest for solving sparse optimization problems. For instance, we demonstrate that our techniques can also be used for the well-studied sparse PCA problem.
    Deformable Graph Transformer. (arXiv:2206.14337v1 [cs.LG])
    Transformer-based models have been widely used and achieved state-of-the-art performance in various domains such as natural language processing and computer vision. Recent works show that Transformers can also be generalized to graph-structured data. However, the success is limited to small-scale graphs due to technical challenges such as the quadratic complexity in regards to the number of nodes and non-local aggregation that often leads to inferior generalization performance to conventional graph neural networks. In this paper, to address these issues, we propose Deformable Graph Transformer (DGT) that performs sparse attention with dynamically sampled key and value pairs. Specifically, our framework first constructs multiple node sequences with various criteria to consider both structural and semantic proximity. Then, the sparse attention is applied to the node sequences for learning node representations with a reduced computational cost. We also design simple and effective positional encodings to capture structural similarity and distance between nodes. Experiments demonstrate that our novel graph Transformer consistently outperforms existing Transformer-based models and shows competitive performance compared to state-of-the-art models on 8 graph benchmark datasets including large-scale graphs.
    Two-Stage Neural Contextual Bandits for Personalised News Recommendation. (arXiv:2206.14648v1 [cs.IR])
    We consider the problem of personalised news recommendation where each user consumes news in a sequential fashion. Existing personalised news recommendation methods focus on exploiting user interests and ignores exploration in recommendation, which leads to biased feedback loops and hurt recommendation quality in the long term. We build on contextual bandits recommendation strategies which naturally address the exploitation-exploration trade-off. The main challenges are the computational efficiency for exploring the large-scale item space and utilising the deep representations with uncertainty. We propose a two-stage hierarchical topic-news deep contextual bandits framework to efficiently learn user preferences when there are many news items. We use deep learning representations for users and news, and generalise the neural upper confidence bound (UCB) policies to generalised additive UCB and bilinear UCB. Empirical results on a large-scale news recommendation dataset show that our proposed policies are efficient and outperform the baseline bandit policies.
    No imputation without representation. (arXiv:2206.14254v1 [cs.LG])
    By filling in missing values in datasets, imputation allows these datasets to be used with algorithms that cannot handle missing values by themselves. However, missing values may in principle contribute useful information that is lost through imputation. The missing-indicator approach can be used in combination with imputation to instead represent this information as a part of the dataset. There are several theoretical considerations why missing-indicators may or may not be beneficial, but there has not been any large-scale practical experiment on real-life datasets to test this question for machine learning predictions. We perform this experiment for three imputation strategies and a range of different classification algorithms, on the basis of twenty real-life datasets. We find that on these datasets, missing-indicators generally increase classification performance. In addition, we find no evidence for most algorithms that nearest neighbour and iterative imputation lead to better performance than simple mean/mode imputation. Therefore, we recommend the use of missing-indicators with mean/mode imputation as a safe default, with the caveat that for decision trees, pruning is necessary to prevent overfitting. In a follow-up experiment, we determine attribute-specific missingness thresholds for each classifier above which missing-indicators are more likely than not to increase classification performance, and observe that these thresholds are much lower for categorical than for numerical attributes. Finally, we argue that mean imputation of numerical attributes may preserve some of the information from missing values, and we show that in the absence of missing-indicators, it can similarly be useful to apply mean imputation to one-hot encoded categorical attributes instead of mode imputation.
    Variational Quantum Approximate Support Vector Machine With Inference Transfer. (arXiv:2206.14507v1 [quant-ph])
    A kernel-based quantum classifier is the most interesting and powerful quantum machine learning technique for hyperlinear classification of complex data, which can be easily realized in shallow-depth quantum circuits such as a SWAP test classifier. Surprisingly, a support vector machine can be realized inherently and explicitly on these circuits by introduction of a variational scheme to map the quadratic optimization problem of the SVM theory to a quantum-classical variational optimization problem. This scheme is realized with parameterized quantum circuits (PQC) to create a nonuniform weight vector to index qubits that can evaluate training loss and classification score in a linear time. We train the classical parameters of this Variational Quantum Approximate Support Vector Machine (VQASVM), which can be transferred to many copies of other VQASVM decision inference circuits for classification of new query data. Our VQASVM algorithm is experimented with toy example data sets on cloud-based quantum machines for feasibility evaluation, and numerically investigated to evaluate its performance on a standard iris flower data set. The accuracy of iris data classification reached 98.8%.
    Multistep Automated Data Labelling Procedure (MADLaP) for Thyroid Nodules on Ultrasound: An Artificial Intelligence Approach for Automating Image Annotation. (arXiv:2206.14305v1 [eess.IV])
    Machine learning (ML) for diagnosis of thyroid nodules on ultrasound is an active area of research. However, ML tools require large, well-labelled datasets, the curation of which is time-consuming and labor-intensive. The purpose of our study was to develop and test a deep-learning-based tool to facilitate and automate the data annotation process for thyroid nodules; we named our tool Multistep Automated Data Labelling Procedure (MADLaP). MADLaP was designed to take multiple inputs included pathology reports, ultrasound images, and radiology reports. Using multiple step-wise modules including rule-based natural language processing, deep-learning-based imaging segmentation, and optical character recognition, MADLaP automatically identified images of a specific thyroid nodule and correctly assigned a pathology label. The model was developed using a training set of 378 patients across our health system and tested on a separate set of 93 patients. Ground truths for both sets were selected by an experienced radiologist. Performance metrics including yield (how many labeled images the model produced) and accuracy (percentage correct) were measured using the test set. MADLaP achieved a yield of 63% and an accuracy of 83%. The yield progressively increased as the input data moved through each module, while accuracy peaked part way through. Error analysis showed that inputs from certain examination sites had lower accuracy (40%) than the other sites (90%, 100%). MADLaP successfully created curated datasets of labeled ultrasound images of thyroid nodules. While accurate, the relatively suboptimal yield of MADLaP exposed some challenges when trying to automatically label radiology images from heterogeneous sources. The complex task of image curation and annotation could be automated, allowing for enrichment of larger datasets for use in machine learning development.
    Convolutional Neural Network Based Partial Face Detection. (arXiv:2206.14350v1 [cs.CV])
    Due to the massive explanation of artificial intelligence, machine learning technology is being used in various areas of our day-to-day life. In the world, there are a lot of scenarios where a simple crime can be prevented before it may even happen or find the person responsible for it. A face is one distinctive feature that we have and can differentiate easily among many other species. But not just different species, it also plays a significant role in determining someone from the same species as us, humans. Regarding this critical feature, a single problem occurs most often nowadays. When the camera is pointed, it cannot detect a person's face, and it becomes a poor image. On the other hand, where there was a robbery and a security camera installed, the robber's identity is almost indistinguishable due to the low-quality camera. But just making an excellent algorithm to work and detecting a face reduces the cost of hardware, and it doesn't cost that much to focus on that area. Facial recognition, widget control, and such can be done by detecting the face correctly. This study aims to create and enhance a machine learning model that correctly recognizes faces. Total 627 Data have been collected from different Bangladeshi people's faces on four angels. In this work, CNN, Harr Cascade, Cascaded CNN, Deep CNN & MTCNN are these five machine learning approaches implemented to get the best accuracy of our dataset. After creating and running the model, Multi-Task Convolutional Neural Network (MTCNN) achieved 96.2% best model accuracy with training data rather than other machine learning models.
    GERNERMED++: Transfer Learning in German Medical NLP. (arXiv:2206.14504v1 [cs.CL])
    We present a statistical model for German medical natural language processing trained for named entity recognition (NER) as an open, publicly available model. The work serves as a refined successor to our first GERNERMED model which is substantially outperformed by our work. We demonstrate the effectiveness of combining multiple techniques in order to achieve strong results in entity recognition performance by the means of transfer-learning on pretrained deep language models (LM), word-alignment and neural machine translation. Due to the sparse situation on open, public medical entity recognition models for German texts, this work offers benefits to the German research community on medical NLP as a baseline model. Since our model is based on public English data, its weights are provided without legal restrictions on usage and distribution. The sample code and the statistical model is available at: https://github.com/frankkramer-lab/GERNERMED-pp
    Exploiting Semantic Role Contextualized Video Features for Multi-Instance Text-Video Retrieval EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022. (arXiv:2206.14381v1 [cs.CV])
    In this report, we present our approach for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022. We first parse sentences into semantic roles corresponding to verbs and nouns; then utilize self-attentions to exploit semantic role contextualized video features along with textual features via triplet losses in multiple embedding spaces. Our method overpasses the strong baseline in normalized Discounted Cumulative Gain (nDCG), which is more valuable for semantic similarity. Our submission is ranked 3rd for nDCG and ranked 4th for mAP.
    Using Twitter Data to Understand Public Perceptions of Approved versus Off-label Use for COVID-19-related Medications. (arXiv:2206.14358v1 [cs.CY])
    Understanding public discourse on emergency use of unproven therapeutics is essential to monitor safe use and combat misinformation. We developed a natural language processing (NLP)-based pipeline to understand public perceptions of and stances on COVID-19-related drugs on Twitter across time. This retrospective study included 609,189 US-based tweets between January 29th, 2020 and November 30th, 2021 on four drugs that gained wide public attention during the COVID-19 pandemic: 1) Hydroxychloroquine and Ivermectin, drug therapies with anecdotal evidence; and 2) Molnupiravir and Remdesivir, FDA-approved treatment options for eligible patients. Time-trend analysis was used to understand the popularity and related events. Content and demographic analyses were conducted to explore potential rationales of people's stances on each drug. Time-trend analysis revealed that Hydroxychloroquine and Ivermectin received much more discussion than Molnupiravir and Remdesivir, particularly during COVID-19 surges. Hydroxychloroquine and Ivermectin were highly politicized, related to conspiracy theories, hearsay, celebrity effects, etc. The distribution of stance between the two major US political parties was significantly different (p<0.001); Republicans were much more likely to support Hydroxychloroquine (+55%) and Ivermectin (+30%) than Democrats. People with healthcare backgrounds tended to oppose Hydroxychloroquine (+7%) more than the general population; in contrast, the general population was more likely to support Ivermectin (+14%). We make all the data, code, and models available at https://github.com/ningkko/COVID-drug.
    Gaussian Latent Dirichlet Allocation for Discrete Human State Discovery. (arXiv:2206.14233v1 [cs.LG])
    In this article we propose and validate an unsupervised probabilistic model, Gaussian Latent Dirichlet Allocation (GLDA), for the problem of discrete state discovery from repeated, multivariate psychophysiological samples collected from multiple, inherently distinct, individuals. Psychology and medical research heavily involves measuring potentially related but individually inconclusive variables from a cohort of participants to derive diagnosis, necessitating clustering analysis. Traditional probabilistic clustering models such as Gaussian Mixture Model (GMM) assume a global mixture of component distributions, which may not be realistic for observations from different patients. The GLDA model borrows the individual-specific mixture structure from a popular topic model Latent Dirichlet Allocation (LDA) in Natural Language Processing and merges it with the Gaussian component distributions of GMM to suit continuous type data. We implemented GLDA using STAN (a probabilistic modeling language) and applied it on two datasets, one containing Ecological Momentary Assessments (EMA) and the other heart measures from electrocardiogram and impedance cardiograph. We found that in both datasets the GLDA-learned class weights achieved significantly higher correlations with clinically assessed depression, anxiety, and stress scores than those produced by the baseline GMM. Our findings demonstrate the advantage of GLDA over conventional finite mixture models for human state discovery from repeated multivariate data, likely due to better characterization of potential underlying between-participant differences. Future work is required to validate the utility of this model on a broader range of applications.  ( 3 min )
    RegMixup: Mixup as a Regularizer Can Surprisingly Improve Accuracy and Out Distribution Robustness. (arXiv:2206.14502v1 [cs.LG])
    We show that the effectiveness of the well celebrated Mixup [Zhang et al., 2018] can be further improved if instead of using it as the sole learning objective, it is utilized as an additional regularizer to the standard cross-entropy loss. This simple change not only provides much improved accuracy but also significantly improves the quality of the predictive uncertainty estimation of Mixup in most cases under various forms of covariate shifts and out-of-distribution detection experiments. In fact, we observe that Mixup yields much degraded performance on detecting out-of-distribution samples possibly, as we show empirically, because of its tendency to learn models that exhibit high-entropy throughout; making it difficult to differentiate in-distribution samples from out-distribution ones. To show the efficacy of our approach (RegMixup), we provide thorough analyses and experiments on vision datasets (ImageNet & CIFAR-10/100) and compare it with a suite of recent approaches for reliable uncertainty estimation.  ( 2 min )
    Intrinsic Anomaly Detection for Multi-Variate Time Series. (arXiv:2206.14342v1 [cs.LG])
    We introduce a novel, practically relevant variation of the anomaly detection problem in multi-variate time series: intrinsic anomaly detection. It appears in diverse practical scenarios ranging from DevOps to IoT, where we want to recognize failures of a system that operates under the influence of a surrounding environment. Intrinsic anomalies are changes in the functional dependency structure between time series that represent an environment and time series that represent the internal state of a system that is placed in said environment. We formalize this problem, provide under-studied public and new purpose-built data sets for it, and present methods that handle intrinsic anomaly detection. These address the short-coming of existing anomaly detection methods that cannot differentiate between expected changes in the system's state and unexpected ones, i.e., changes in the system that deviate from the environment's influence. Our most promising approach is fully unsupervised and combines adversarial learning and time series representation learning, thereby addressing problems such as label sparsity and subjectivity, while allowing to navigate and improve notoriously problematic anomaly detection data sets.
    Comparative Study of Inference Methods for Interpolative Decomposition. (arXiv:2206.14542v1 [cs.LG])
    In this paper, we propose a probabilistic model with automatic relevance determination (ARD) for learning interpolative decomposition (ID), which is commonly used for low-rank approximation, feature selection, and identifying hidden patterns in data, where the matrix factors are latent variables associated with each data dimension. Prior densities with support on the specified subspace are used to address the constraint for the magnitude of the factored component of the observed matrix. Bayesian inference procedure based on Gibbs sampling is employed. We evaluate the model on a variety of real-world datasets including CCLE $EC50$, CCLE $IC50$, Gene Body Methylation, and Promoter Methylation datasets with different sizes, and dimensions, and show that the proposed Bayesian ID algorithms with automatic relevance determination lead to smaller reconstructive errors even compared to vanilla Bayesian ID algorithms with fixed latent dimension set to matrix rank.
    GAN-based Intrinsic Exploration For Sample Efficient Reinforcement Learning. (arXiv:2206.14256v1 [cs.LG])
    In this study, we address the problem of efficient exploration in reinforcement learning. Most common exploration approaches depend on random action selection, however these approaches do not work well in environments with sparse or no rewards. We propose Generative Adversarial Network-based Intrinsic Reward Module that learns the distribution of the observed states and sends an intrinsic reward that is computed as high for states that are out of distribution, in order to lead agent to unexplored states. We evaluate our approach in Super Mario Bros for a no reward setting and in Montezuma's Revenge for a sparse reward setting and show that our approach is indeed capable of exploring efficiently. We discuss a few weaknesses and conclude by discussing future works.  ( 2 min )
    An Empirical Study of Challenges in Converting Deep Learning Models. (arXiv:2206.14322v1 [cs.LG])
    There is an increase in deploying Deep Learning (DL)-based software systems in real-world applications. Usually DL models are developed and trained using DL frameworks that have their own internal mechanisms/formats to represent and train DL models, and usually those formats cannot be recognized by other frameworks. Moreover, trained models are usually deployed in environments different from where they were developed. To solve the interoperability issue and make DL models compatible with different frameworks/environments, some exchange formats are introduced for DL models, like ONNX and CoreML. However, ONNX and CoreML were never empirically evaluated by the community to reveal their prediction accuracy, performance, and robustness after conversion. Poor accuracy or non-robust behavior of converted models may lead to poor quality of deployed DL-based software systems. We conduct, in this paper, the first empirical study to assess ONNX and CoreML for converting trained DL models. In our systematic approach, two popular DL frameworks, Keras and PyTorch, are used to train five widely used DL models on three popular datasets. The trained models are then converted to ONNX and CoreML and transferred to two runtime environments designated for such formats, to be evaluated. We investigate the prediction accuracy before and after conversion. Our results unveil that the prediction accuracy of converted models are at the same level of originals. The performance (time cost and memory consumption) of converted models are studied as well. The size of models are reduced after conversion, which can result in optimized DL-based software deployment. Converted models are generally assessed as robust at the same level of originals. However, obtained results show that CoreML models are more vulnerable to adversarial attacks compared to ONNX.
    Diagnosis and Prognosis of COVID-19 Disease Using Routine Blood Values and LogNNet Neural Network. (arXiv:2205.09974v2 [cs.LG] UPDATED)
    Since February 2020, the world has been engaged in an intense struggle with the COVID-19 dis-ease, and health systems have come under tragic pressure as the disease turned into a pandemic. The aim of this study is to obtain the most effective routine blood values (RBV) in the diagnosis and prognosis of COVID-19 using a backward feature elimination algorithm for the LogNNet reservoir neural network. The first dataset in the study consists of a total of 5296 patients with the same number of negative and positive COVID-19 tests. The LogNNet-model achieved the accuracy rate of 99.5% in the diagnosis of the disease with 46 features and the accuracy of 99.17% with only mean corpuscular hemoglobin concentration, mean corpuscular hemoglobin, and activated partial prothrombin time. The second dataset consists of a total of 3899 patients with a diagnosis of COVID-19 who were treated in hospital, of which 203 were severe patients and 3696 were mild patients. The model reached the accuracy rate of 94.4% in determining the prognosis of the disease with 48 features and the accuracy of 82.7% with only erythrocyte sedimentation rate, neutrophil count, and C reactive protein features. Our method will reduce the negative pressures on the health sector and help doctors to understand the pathogenesis of COVID-19 using the key features. The method is promising to create mobile health monitoring systems in the Internet of Things.  ( 3 min )
    TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s. (arXiv:2206.14286v1 [cs.PF])
    This paper presents a novel nearest neighbor search algorithm achieving TPU (Google Tensor Processing Unit) peak performance, outperforming state-of-the-art GPU algorithms with similar level of recall. The design of the proposed algorithm is motivated by an accurate accelerator performance model that takes into account both the memory and instruction bottlenecks. Our algorithm comes with an analytical guarantee of recall in expectation and does not require maintaining sophisticated index data structure or tuning, making it suitable for applications with frequent updates. Our work is available in the open-source package of Jax and Tensorflow on TPU.
    NumS: Scalable Array Programming for the Cloud. (arXiv:2206.14276v1 [cs.DC])
    Scientists increasingly rely on Python tools to perform scalable distributed memory array operations using rich, NumPy-like expressions. However, many of these tools rely on dynamic schedulers optimized for abstract task graphs, which often encounter memory and network bandwidth-related bottlenecks due to sub-optimal data and operator placement decisions. Tools built on the message passing interface (MPI), such as ScaLAPACK and SLATE, have better scaling properties, but these solutions require specialized knowledge to use. In this work, we present NumS, an array programming library which optimizes NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS). LSHS is a local search method which optimizes operator placement by minimizing maximum memory and network load on any given node within a distributed system. Coupled with a heuristic for load balanced data layouts, our approach is capable of attaining communication lower bounds on some common numerical operations, and our empirical study shows that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem. On terabyte-scale data, NumS achieves competitive performance to SLATE on DGEMM, up to 20x speedup over Dask on a key operation for tensor factorization, and a 2x speedup on logistic regression compared to Dask ML and Spark's MLlib.  ( 3 min )
    Optimal Estimation of Generic Dynamics by Path-Dependent Neural Jump ODEs. (arXiv:2206.14284v1 [stat.ML])
    This paper studies the problem of forecasting general stochastic processes using an extension of the Neural Jump ODE (NJ-ODE) framework. While NJ-ODE was the first framework to establish convergence guarantees for the prediction of irregularly observed time-series, these results were limited to data stemming from It\^o-diffusions with complete observations, in particular Markov processes where all coordinates are observed simultaneously. In this work, we generalise these results to generic, possibly non-Markovian or discontinuous, stochastic processes with incomplete observations, by utilising the reconstruction properties of the signature transform. These theoretical results are supported by empirical studies, where it is shown that the path-dependent NJ-ODE outperforms the original NJ-ODE framework in the case of non-Markovian data.
    Can Interpretable Reinforcement Learning Manage Prosperity Your Way?. (arXiv:2202.09064v2 [cs.LG] UPDATED)
    Personalisation of products and services is fast becoming the driver of success in banking and commerce. Machine learning holds the promise of gaining a deeper understanding of and tailoring to customers' needs and preferences. Whereas traditional solutions to financial decision problems frequently rely on model assumptions, reinforcement learning is able to exploit large amounts of data to improve customer modelling and decision-making in complex financial environments with fewer assumptions. Model explainability and interpretability present challenges from a regulatory perspective which demands transparency for acceptance; they also offer the opportunity for improved insight into and understanding of customers. Post-hoc approaches are typically used for explaining pretrained reinforcement learning models. Based on our previous modeling of customer spending behaviour, we adapt our recent reinforcement learning algorithm that intrinsically characterizes desirable behaviours and we transition to the problem of asset management. We train inherently interpretable reinforcement learning agents to give investment advice that is aligned with prototype financial personality traits which are combined to make a final recommendation. We observe that the trained agents' advice adheres to their intended characteristics, they learn the value of compound growth, and, without any explicit reference, the notion of risk as well as improved policy convergence.  ( 3 min )
    A Temporal-Difference Approach to Policy Gradient Estimation. (arXiv:2202.02396v3 [cs.LG] UPDATED)
    The policy gradient theorem (Sutton et al., 2000) prescribes the usage of a cumulative discounted state distribution under the target policy to approximate the gradient. Most algorithms based on this theorem, in practice, break this assumption, introducing a distribution shift that can cause the convergence to poor solutions. In this paper, we propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy. The policy gradient calculation in this form can be simplified in terms of a gradient critic, which can be recursively estimated due to a new Bellman equation of gradients. By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that sidesteps the distribution shift issue in a model-free way. We prove that, under certain realizability conditions, our estimator is unbiased regardless of the sampling strategy. We empirically show that our technique achieves a superior bias-variance trade-off and performance in presence of off-policy samples.  ( 2 min )
    Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead. (arXiv:2105.09121v3 [cs.LG] UPDATED)
    Deploying deep learning models in time-critical applications with limited computational resources, for instance in edge computing systems and IoT networks, is a challenging task that often relies on dynamic inference methods such as early exiting. In this paper, we introduce a novel architecture for early exiting based on the vision transformer architecture, as well as a fine-tuning strategy that significantly increase the accuracy of early exit branches compared to conventional approaches while introducing less overhead. Through extensive experiments on image and audio classification as well as audiovisual crowd counting, we show that our method works for both classification and regression problems, and in both single- and multi-modal settings. Additionally, we introduce a novel method for integrating audio and visual modalities within early exits in audiovisual data analysis, that can lead to a more fine-grained dynamic inference.  ( 2 min )
    Linear Model Against Malicious Adversaries with Local Differential Privacy. (arXiv:2202.02448v2 [cs.CR] UPDATED)
    Scientific collaborations benefit from collaborative learning of distributed sources, but remain difficult to achieve when data are sensitive. In recent years, privacy preserving techniques have been widely studied to analyze distributed data across different agencies while protecting sensitive information. Most existing privacy preserving techniques are designed to resist semi-honest adversaries and require intense computation to perform data analysis. Secure collaborative learning is significantly difficult with the presence of malicious adversaries who may deviates from the secure protocol. Another challenge is to maintain high computation efficiency with privacy protection. In this paper, matrix encryption is applied to encrypt data such that the secure schemes are against malicious adversaries, including chosen plaintext attack, known plaintext attack, and collusion attack. The encryption scheme also achieves local differential privacy. Moreover, cross validation is studied to prevent overfitting without additional communication cost. Empirical experiments on real-world datasets demonstrate that the proposed schemes are computationally efficient compared to existing techniques against malicious adversary and semi-honest model.  ( 2 min )
    Sparse Centroid-Encoder: A Nonlinear Model for Feature Selection. (arXiv:2201.12910v2 [cs.LG] UPDATED)
    Autoencoders have been widely used as a nonlinear tool for data dimensionality reduction. While autoencoders don't utilize the label information, Centroid-Encoders (CE)\cite{ghosh2022supervised} use the class label in their learning process. In this study, we propose a sparse optimization using the Centroid-Encoder architecture to determine a minimal set of features that discriminate between two or more classes. The resulting algorithm, Sparse Centroid-Encoder (SCE), extracts discriminatory features in groups using a sparsity inducing $\ell_1$-norm while mapping a point to its class centroid. One key attribute of SCE is that it can extract informative features from a multi-modal data set, i.e., data sets whose classes appear to have multiple clusters. The algorithm is applied to a wide variety of real world data sets, including single-cell data, high dimensional biological data, image data, speech data, and accelerometer sensor data. We compared our method to various state-of-the-art feature selection techniques, including supervised Concrete Autoencoders (SCAE), Feature Selection Network (FsNet), deep feature selection (DFS), Stochastic Gate (STG), and LassoNet. We empirically showed that SCE features often produced better classification accuracy than other methods on sequester test set.  ( 3 min )
    Data augmentation for learning predictive models on EEG: a systematic comparison. (arXiv:2206.14483v1 [cs.LG])
    The use of deep learning for electroencephalography (EEG) classification tasks has been rapidly growing in the last years, yet its application has been limited by the relatively small size of EEG datasets. Data augmentation, which consists in artificially increasing the size of the dataset during training, has been a key ingredient to obtain state-of-the-art performances across applications such as computer vision or speech. While a few augmentation transformations for EEG data have been proposed in the literature, their positive impact on performance across tasks remains elusive. In this work, we propose a unified and exhaustive analysis of the main existing EEG augmentations, which are compared in a common experimental setting. Our results highlight the best data augmentations to consider for sleep stage classification and motor imagery brain computer interfaces, showing predictive power improvements greater than 10% in some cases.  ( 2 min )
    Masked World Models for Visual Control. (arXiv:2206.14244v1 [cs.RO])
    Visual model-based reinforcement learning (RL) has the potential to enable sample-efficient robot learning from visual observations. Yet the current approaches typically train a single model end-to-end for learning both visual representations and dynamics, making it difficult to accurately model the interaction between robots and small objects. In this work, we introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning. Specifically, we train an autoencoder with convolutional layers and vision transformers (ViT) to reconstruct pixels given masked convolutional features, and learn a latent dynamics model that operates on the representations from the autoencoder. Moreover, to encode task-relevant information, we introduce an auxiliary reward prediction objective for the autoencoder. We continually update both autoencoder and dynamics model using online samples collected from environment interaction. We demonstrate that our decoupling approach achieves state-of-the-art performance on a variety of visual robotic tasks from Meta-world and RLBench, e.g., we achieve 81.7% success rate on 50 visual robotic manipulation tasks from Meta-world, while the baseline achieves 67.9%. Code is available on the project website: https://sites.google.com/view/mwm-rl.  ( 2 min )
    Optimization-Induced Graph Implicit Nonlinear Diffusion. (arXiv:2206.14418v1 [cs.LG])
    Due to the over-smoothing issue, most existing graph neural networks can only capture limited dependencies with their inherently finite aggregation layers. To overcome this limitation, we propose a new kind of graph convolution, called Graph Implicit Nonlinear Diffusion (GIND), which implicitly has access to infinite hops of neighbors while adaptively aggregating features with nonlinear diffusion to prevent over-smoothing. Notably, we show that the learned representation can be formalized as the minimizer of an explicit convex optimization objective. With this property, we can theoretically characterize the equilibrium of our GIND from an optimization perspective. More interestingly, we can induce new structural variants by modifying the corresponding optimization objective. To be specific, we can embed prior properties to the equilibrium, as well as introducing skip connections to promote training stability. Extensive experiments show that GIND is good at capturing long-range dependencies, and performs well on both homophilic and heterophilic graphs with nonlinear diffusion. Moreover, we show that the optimization-induced variants of our models can boost the performance and improve training stability and efficiency as well. As a result, our GIND obtains significant improvements on both node-level and graph-level tasks.  ( 2 min )
    Signature Methods in Machine Learning. (arXiv:2206.14674v1 [stat.ML])
    Signature-based techniques give mathematical insight into the interactions between complex streams of evolving data. These insights can be quite naturally translated into numerical approaches to understanding streamed data, and perhaps because of their mathematical precision, have proved useful in analysing streamed data in situations where the data is irregular, and not stationary, and the dimension of the data and the sample sizes are both moderate. Understanding streamed multi-modal data is exponential: a word in $n$ letters from an alphabet of size $d$ can be any one of $d^n$ messages. Signatures remove the exponential amount of noise that arises from sampling irregularity, but an exponential amount of information still remain. This survey aims to stay in the domain where that exponential scaling can be managed directly. Scalability issues are an important challenge in many problems but would require another survey article and further ideas. This survey describes a range of contexts where the data sets are small enough to remove the possibility of massive machine learning, and the existence of small sets of context free and principled features can be used effectively. The mathematical nature of the tools can make their use intimidating to non-mathematicians. The examples presented in this article are intended to bridge this communication gap and provide tractable working examples drawn from the machine learning context. Notebooks are available online for several of these examples. This survey builds on the earlier paper of Ilya Chevryev and Andrey Kormilitzin which had broadly similar aims at an earlier point in the development of this machinery. This article illustrates how the theoretical insights offered by signatures are simply realised in the analysis of application data in a way that is largely agnostic to the data type.  ( 3 min )
    Model-Based Policy Search Using Monte Carlo Gradient Estimation with Real Systems Application. (arXiv:2101.12115v3 [cs.LG] UPDATED)
    In this paper, we present a Model-Based Reinforcement Learning (MBRL) algorithm named \emph{Monte Carlo Probabilistic Inference for Learning COntrol} (MC-PILCO). The algorithm relies on Gaussian Processes (GPs) to model the system dynamics and on a Monte Carlo approach to estimate the policy gradient. This defines a framework in which we ablate the choice of the following components: (i) the selection of the cost function, (ii) the optimization of policies using dropout, (iii) an improved data efficiency through the use of structured kernels in the GP models. The combination of the aforementioned aspects affects dramatically the performance of MC-PILCO. Numerical comparisons in a simulated cart-pole environment show that MC-PILCO exhibits better data efficiency and control performance w.r.t. state-of-the-art GP-based MBRL algorithms. Finally, we apply MC-PILCO to real systems, considering in particular systems with partially measurable states. We discuss the importance of modeling both the measurement system and the state estimators during policy optimization. The effectiveness of the proposed solutions has been tested in simulation and on two real systems, a Furuta pendulum and a ball-and-plate rig.  ( 3 min )
    Enabling Visual Action Planning for Object Manipulation through Latent Space Roadmap. (arXiv:2103.02554v3 [cs.RO] UPDATED)
    We present a framework for visual action planning of complex manipulation tasks with high-dimensional state spaces, focusing on manipulation of deformable objects. We propose a Latent Space Roadmap (LSR) for task planning which is a graph-based structure globally capturing the system dynamics in a low-dimensional latent space. Our framework consists of three parts: (1) a Mapping Module (MM) that maps observations given in the form of images into a structured latent space extracting the respective states as well as generates observations from the latent states, (2) the LSR which builds and connects clusters containing similar states in order to find the latent plans between start and goal states extracted by MM, and (3) the Action Proposal Module that complements the latent plan found by the LSR with the corresponding actions. We present a thorough investigation of our framework on simulated box stacking and rope/box manipulation tasks, and a folding task executed on a real robot.  ( 2 min )
    Understanding Generalization via Leave-One-Out Conditional Mutual Information. (arXiv:2206.14800v1 [cs.LG])
    We study the mutual information between (certain summaries of) the output of a learning algorithm and its $n$ training data, conditional on a supersample of $n+1$ i.i.d. data from which the training data is chosen at random without replacement. These leave-one-out variants of the conditional mutual information (CMI) of an algorithm (Steinke and Zakynthinou, 2020) are also seen to control the mean generalization error of learning algorithms with bounded loss functions. For learning algorithms achieving zero empirical risk under 0-1 loss (i.e., interpolating algorithms), we provide an explicit connection between leave-one-out CMI and the classical leave-one-out error estimate of the risk. Using this connection, we obtain upper and lower bounds on risk in terms of the (evaluated) leave-one-out CMI. When the limiting risk is constant or decays polynomially, the bounds converge to within a constant factor of two. As an application, we analyze the population risk of the one-inclusion graph algorithm, a general-purpose transductive learning algorithm for VC classes in the realizable setting. Using leave-one-out CMI, we match the optimal bound for learning VC classes in the realizable setting, answering an open challenge raised by Steinke and Zakynthinou (2020). Finally, in order to understand the role of leave-one-out CMI in studying generalization, we place leave-one-out CMI in a hierarchy of measures, with a novel unconditional mutual information at the root. For 0-1 loss and interpolating learning algorithms, this mutual information is observed to be precisely the risk.  ( 3 min )
    Meta-Learning over Time for Destination Prediction Tasks. (arXiv:2206.14801v1 [cs.LG])
    A need to understand and predict vehicles' behavior underlies both public and private goals in the transportation domain, including urban planning and management, ride-sharing services, and intelligent transportation systems. Individuals' preferences and intended destinations vary throughout the day, week, and year: for example, bars are most popular in the evenings, and beaches are most popular in the summer. Despite this principle, we note that recent studies on a popular benchmark dataset from Porto, Portugal have found, at best, only marginal improvements in predictive performance from incorporating temporal information. We propose an approach based on hypernetworks, a variant of meta-learning ("learning to learn") in which a neural network learns to change its own weights in response to an input. In our case, the weights responsible for destination prediction vary with the metadata, in particular the time, of the input trajectory. The time-conditioned weights notably improve the model's error relative to ablation studies and comparable prior work, and we confirm our hypothesis that knowledge of time should improve prediction of a vehicle's intended destination.  ( 2 min )
    ENS-10: A Dataset For Post-Processing Ensemble Weather Forecast. (arXiv:2206.14786v1 [cs.LG])
    Post-processing ensemble prediction systems can improve weather forecasting, especially for extreme event prediction. In recent years, different machine learning models have been developed to improve the quality of the post-processing step. However, these models heavily rely on the data and generating such ensemble members requires multiple runs of numerical weather prediction models, at high computational cost. This paper introduces the ENS-10 dataset, consisting of ten ensemble members spread over 20 years (1998-2017). The ensemble members are generated by perturbing numerical weather simulations to capture the chaotic behavior of the Earth. To represent the three-dimensional state of the atmosphere, ENS-10 provides the most relevant atmospheric variables in 11 distinct pressure levels as well as the surface at 0.5-degree resolution. The dataset targets the prediction correction task at 48-hour lead time, which is essentially improving the forecast quality by removing the biases of the ensemble members. To this end, ENS-10 provides the weather variables for forecast lead times T=0, 24, and 48 hours (two data points per week). We provide a set of baselines for this task on ENS-10 and compare their performance in correcting the prediction of different weather variables. We also assess our baselines for predicting extreme events using our dataset. The ENS-10 dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) licence.  ( 3 min )
    Multi-scale Physical Representations for Approximating PDE Solutions with Graph Neural Operators. (arXiv:2206.14687v1 [cs.LG])
    Representing physical signals at different scales is among the most challenging problems in engineering. Several multi-scale modeling tools have been developed to describe physical systems governed by \emph{Partial Differential Equations} (PDEs). These tools are at the crossroad of principled physical models and numerical schema. Recently, data-driven models have been introduced to speed-up the approximation of PDE solutions compared to numerical solvers. Among these recent data-driven methods, neural integral operators are a class that learn a mapping between function spaces. These functions are discretized on graphs (meshes) which are appropriate for modeling interactions in physical phenomena. In this work, we study three multi-resolution schema with integral kernel operators that can be approximated with \emph{Message Passing Graph Neural Networks} (MPGNNs). To validate our study, we make extensive MPGNNs experiments with well-chosen metrics considering steady and unsteady PDEs.  ( 2 min )
    Matryoshka: Stealing Functionality of Private ML Data by Hiding Models in Model. (arXiv:2206.14371v1 [stat.ML])
    In this paper, we present a novel insider attack called Matryoshka, which employs an irrelevant scheduled-to-publish DNN model as a carrier model for covert transmission of multiple secret models which memorize the functionality of private ML data stored in local data centers. Instead of treating the parameters of the carrier model as bit strings and applying conventional steganography, we devise a novel parameter sharing approach which exploits the learning capacity of the carrier model for information hiding. Matryoshka simultaneously achieves: (i) High Capacity -- With almost no utility loss of the carrier model, Matryoshka can hide a 26x larger secret model or 8 secret models of diverse architectures spanning different application domains in the carrier model, neither of which can be done with existing steganography techniques; (ii) Decoding Efficiency -- once downloading the published carrier model, an outside colluder can exclusively decode the hidden models from the carrier model with only several integer secrets and the knowledge of the hidden model architecture; (iii) Effectiveness -- Moreover, almost all the recovered models have similar performance as if it were trained independently on the private data; (iv) Robustness -- Information redundancy is naturally implemented to achieve resilience against common post-processing techniques on the carrier before its publishing; (v) Covertness -- A model inspector with different levels of prior knowledge could hardly differentiate a carrier model from a normal model.
    SPI-GAN: Distilling Score-based Generative Models with Straight-Path Interpolations. (arXiv:2206.14464v1 [cs.LG])
    Score-based generative models (SGMs) are a recently proposed paradigm for deep generative tasks and now show the state-of-the-art sampling performance. It is known that the original SGM design solves the two problems of the generative trilemma: i) sampling quality, and ii) sampling diversity. However, the last problem of the trilemma was not solved, i.e., their training/sampling complexity is notoriously high. To this end, distilling SGMs into simpler models, e.g., generative adversarial networks (GANs), is gathering much attention currently. We present an enhanced distillation method, called straight-path interpolation GAN (SPI-GAN), which can be compared to the state-of-the-art shortcut-based distillation method, called denoising diffusion GAN (DD-GAN). However, our method corresponds to an extreme method that does not use any intermediate shortcut information of the reverse SDE path, in which case DD-GAN fails to obtain good results. Nevertheless, our straight-path interpolation method greatly stabilizes the overall training process. As a result, SPI-GAN is one of the best models in terms of the sampling quality/diversity/time for CIFAR-10, CelebA-HQ-256, and LSUN-Church-256.
  • Open

    Functional Classification of Bitcoin Addresses. (arXiv:2202.12019v2 [stat.AP] UPDATED)
    This paper proposes a classification model for predicting the main activity of bitcoin addresses based on their balances. Since the balances are functions of time, we apply methods from functional data analysis; more specifically, the features of the proposed classification model are the functional principal components of the data. Classifying bitcoin addresses is a relevant problem for two main reasons: to understand the composition of the bitcoin market, and to identify addresses used for illicit activities. Although other bitcoin classifiers have been proposed, they focus primarily on network analysis rather than curve behavior. Our approach, on the other hand, does not require any network information for prediction. Furthermore, functional features have the advantage of being straightforward to build, unlike expert-built features. Results show improvement when combining functional features with scalar features, and similar accuracy for the models using those features separately, which points to the functional model being a good alternative when domain-specific knowledge is not available.  ( 2 min )
    The split Gibbs sampler revisited: improvements to its algorithmic structure and augmented target distribution. (arXiv:2206.13894v1 [stat.CO] CROSS LISTED)
    This paper proposes a new accelerated proximal Markov chain Monte Carlo (MCMC) methodology to perform Bayesian computation efficiently in imaging inverse problems. The proposed methodology is derived from the Langevin diffusion process and stems from tightly integrating two state-of-the-art proximal Langevin MCMC samplers, SK-ROCK and split Gibbs sampling (SGS), which employ distinctively different strategies to improve convergence speed. More precisely, we show how to integrate, at the level of the Langevin diffusion process, the proximal SK-ROCK sampler which is based on a stochastic Runge-Kutta-Chebyshev approximation of the diffusion, with the model augmentation and relaxation strategy that SGS exploits to speed up Bayesian computation at the expense of asymptotic bias. This leads to a new and faster proximal SK-ROCK sampler that combines the accelerated quality of the original SK-ROCK sampler with the computational benefits of augmentation and relaxation. Moreover, rather than viewing the augmented and relaxed model as an approximation of the target model, positioning relaxation in a bias-variance trade-off, we propose to regard the augmented and relaxed model as a generalisation of the target model. This then allows us to carefully calibrate the amount of relaxation in order to simultaneously improve the accuracy of the model (as measured by the model evidence) and the sampler's convergence speed. To achieve this, we derive an empirical Bayesian method to automatically estimate the optimal amount of relaxation by maximum marginal likelihood estimation. The proposed methodology is demonstrated with a range of numerical experiments related to image deblurring and inpainting, as well as with comparisons with alternative approaches from the state of the art.
    Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error. (arXiv:2201.12417v2 [cs.LG] UPDATED)
    In this work, we study the use of the Bellman equation as a surrogate objective for value prediction accuracy. While the Bellman equation is uniquely solved by the true value function over all state-action pairs, we find that the Bellman error (the difference between both sides of the equation) is a poor proxy for the accuracy of the value function. In particular, we show that (1) due to cancellations from both sides of the Bellman equation, the magnitude of the Bellman error is only weakly related to the distance to the true value function, even when considering all state-action pairs, and (2) in the finite data regime, the Bellman equation can be satisfied exactly by infinitely many suboptimal solutions. This means that the Bellman error can be minimized without improving the accuracy of the value function. We demonstrate these phenomena through a series of propositions, illustrative toy examples, and empirical analysis in standard benchmark domains.  ( 2 min )
    Depth-2 Neural Networks Under a Data-Poisoning Attack. (arXiv:2005.01699v3 [cs.LG] UPDATED)
    In this work, we study the possibility of defending against data-poisoning attacks while training a shallow neural network in a regression setup. We focus on doing supervised learning for a class of depth-2 finite-width neural networks, which includes single-filter convolutional networks. In this class of networks, we attempt to learn the network weights in the presence of a malicious oracle doing stochastic, bounded and additive adversarial distortions on the true output during training. For the non-gradient stochastic algorithm that we construct, we prove worst-case near-optimal trade-offs among the magnitude of the adversarial attack, the weight approximation accuracy, and the confidence achieved by the proposed algorithm. As our algorithm uses mini-batching, we analyze how the mini-batch size affects convergence. We also show how to utilize the scaling of the outer layer weights to counter output-poisoning attacks depending on the probability of attack. Lastly, we give experimental evidence demonstrating how our algorithm outperforms stochastic gradient descent under different input data distributions, including instances of heavy-tailed distributions.  ( 2 min )
    Non-Parametric Manifold Learning. (arXiv:2107.08089v2 [math.ST] UPDATED)
    We introduce an estimator for distances in a compact Riemannian manifold M based on graph Laplacian estimates of the Laplace-Beltrami operator. We upper bound the l2-loss for the ratio of the estimator over the true manifold distance, or more precisely an approximation of manifold distance in non-commutative geometry (cf. [Connes and Suijelekom, 2020]), in terms of spectral errors in the graph Laplacian estimates and, implicitly, several geometric properties of the manifold. We consequently obtain a consistency result for the estimator for samples equidistributed from a strictly positive density on M and graph Laplacians which spectrally converge, in a suitable sense, to the Laplace-Beltrami operator. The estimator resembles, and in fact its convergence properties are derived from, a special case of the Kontorovic dual reformulation of Wasserstein distance known as Connes' Distance Formula.  ( 2 min )
    Forgetting Data from Pre-trained GANs. (arXiv:2206.14389v1 [cs.LG])
    Large pre-trained generative models are known to occasionally provide samples that may be undesirable for various reasons. The standard way to mitigate this is to re-train the models differently. In this work, we take a different, more compute-friendly approach and investigate how to post-edit a model after training so that it forgets certain kinds of samples. We provide three different algorithms for GANs that differ on how the samples to be forgotten are described. Extensive evaluations on real-world image datasets show that our algorithms are capable of forgetting data while retaining high generation quality at a fraction of the cost of full re-training.  ( 2 min )
    IBP Regularization for Verified Adversarial Robustness via Branch-and-Bound. (arXiv:2206.14772v1 [cs.LG])
    Recent works have tried to increase the verifiability of adversarially trained networks by running the attacks over domains larger than the original perturbations and adding various regularization terms to the objective. However, these algorithms either underperform or require complex and expensive stage-wise training procedures, hindering their practical applicability. We present IBP-R, a novel verified training algorithm that is both simple and effective. IBP-R induces network verifiability by coupling adversarial attacks on enlarged domains with a regularization term, based on inexpensive interval bound propagation, that minimizes the gap between the non-convex verification problem and its approximations. By leveraging recent branch-and-bound frameworks, we show that IBP-R obtains state-of-the-art verified robustness-accuracy trade-offs for small perturbations on CIFAR-10 while training significantly faster than relevant previous work. Additionally, we present UPB, a novel branching strategy that, relying on a simple heuristic based on $\beta$-CROWN, reduces the cost of state-of-the-art branching algorithms while yielding splits of comparable quality.  ( 2 min )
    Bayesian Structure Learning with Generative Flow Networks. (arXiv:2202.13903v2 [cs.LG] UPDATED)
    In Bayesian structure learning, we are interested in inferring a distribution over the directed acyclic graph (DAG) structure of Bayesian networks, from data. Defining such a distribution is very challenging, due to the combinatorially large sample space, and approximations based on MCMC are often required. Recently, a novel class of probabilistic models, called Generative Flow Networks (GFlowNets), have been introduced as a general framework for generative modeling of discrete and composite objects, such as graphs. In this work, we propose to use a GFlowNet as an alternative to MCMC for approximating the posterior distribution over the structure of Bayesian networks, given a dataset of observations. Generating a sample DAG from this approximate distribution is viewed as a sequential decision problem, where the graph is constructed one edge at a time, based on learned transition probabilities. Through evaluation on both simulated and real data, we show that our approach, called DAG-GFlowNet, provides an accurate approximation of the posterior over DAGs, and it compares favorably against other methods based on MCMC or variational inference.  ( 2 min )
    Towards Robust Waveform-Based Acoustic Models. (arXiv:2110.08634v2 [cs.SD] UPDATED)
    We study the problem of learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. First, we characterize data augmentation theoretically as an instance of vicinal risk minimization, which aims at improving risk estimates during training by replacing the delta functions that define the empirical density over the input space with an approximation of the marginal population density in the vicinity of the training samples. More specifically, we assume that local neighborhoods centered at training samples can be approximated using a mixture of Gaussians, and demonstrate theoretically that this can incorporate robust inductive bias into the learning process. We then specify the individual mixture components implicitly via data augmentation schemes, designed to address common sources of spurious correlations in acoustic models. To avoid potential confounding effects on robustness due to information loss, which has been associated with standard feature extraction techniques (e.g., FBANK and MFCC features), we focus on the waveform-based setting. Our empirical results show that the approach can generalize to unseen noise conditions, with 150% relative improvement in out-of-distribution generalization compared to training using the standard risk minimization principle. Moreover, the results demonstrate competitive performance relative to models learned using a training sample designed to match the acoustic conditions characteristic of test utterances.  ( 3 min )
    Signature Methods in Machine Learning. (arXiv:2206.14674v1 [stat.ML])
    Signature-based techniques give mathematical insight into the interactions between complex streams of evolving data. These insights can be quite naturally translated into numerical approaches to understanding streamed data, and perhaps because of their mathematical precision, have proved useful in analysing streamed data in situations where the data is irregular, and not stationary, and the dimension of the data and the sample sizes are both moderate. Understanding streamed multi-modal data is exponential: a word in $n$ letters from an alphabet of size $d$ can be any one of $d^n$ messages. Signatures remove the exponential amount of noise that arises from sampling irregularity, but an exponential amount of information still remain. This survey aims to stay in the domain where that exponential scaling can be managed directly. Scalability issues are an important challenge in many problems but would require another survey article and further ideas. This survey describes a range of contexts where the data sets are small enough to remove the possibility of massive machine learning, and the existence of small sets of context free and principled features can be used effectively. The mathematical nature of the tools can make their use intimidating to non-mathematicians. The examples presented in this article are intended to bridge this communication gap and provide tractable working examples drawn from the machine learning context. Notebooks are available online for several of these examples. This survey builds on the earlier paper of Ilya Chevryev and Andrey Kormilitzin which had broadly similar aims at an earlier point in the development of this machinery. This article illustrates how the theoretical insights offered by signatures are simply realised in the analysis of application data in a way that is largely agnostic to the data type.  ( 3 min )
    Treatment Effect Estimation from Observational Network Data using Augmented Inverse Probability Weighting and Machine Learning. (arXiv:2206.14591v1 [stat.ME])
    Causal inference methods for treatment effect estimation usually assume independent experimental units. However, this assumption is often questionable because experimental units may interact. We develop augmented inverse probability weighting (AIPW) for estimation and inference of causal treatment effects on dependent observational data. Our framework covers very general cases of spillover effects induced by units interacting in networks. We use plugin machine learning to estimate infinite-dimensional nuisance components leading to a consistent treatment effect estimator that converges at the parametric rate and asymptotically follows a Gaussian distribution.  ( 2 min )
    When Do Extended Physics-Informed Neural Networks (XPINNs) Improve Generalization?. (arXiv:2109.09444v5 [cs.LG] UPDATED)
    Physics-informed neural networks (PINNs) have become a popular choice for solving high-dimensional partial differential equations (PDEs) due to their excellent approximation power and generalization ability. Recently, Extended PINNs (XPINNs) based on domain decomposition methods have attracted considerable attention due to their effectiveness in modeling multiscale and multiphysics problems and their parallelization. However, theoretical understanding on their convergence and generalization properties remains unexplored. In this study, we take an initial step towards understanding how and when XPINNs outperform PINNs. Specifically, for general multi-layer PINNs and XPINNs, we first provide a prior generalization bound via the complexity of the target functions in the PDE problem, and a posterior generalization bound via the posterior matrix norms of the networks after optimization. Moreover, based on our bounds, we analyze the conditions under which XPINNs improve generalization. Concretely, our theory shows that the key building block of XPINN, namely the domain decomposition, introduces a tradeoff for generalization. On the one hand, XPINNs decompose the complex PDE solution into several simple parts, which decreases the complexity needed to learn each part and boosts generalization. On the other hand, decomposition leads to less training data being available in each subdomain, and hence such model is typically prone to overfitting and may become less generalizable. Empirically, we choose five PDEs to show when XPINNs perform better than, similar to, or worse than PINNs, hence demonstrating and justifying our new theory.  ( 3 min )
    MurTree: Optimal Classification Trees via Dynamic Programming and Search. (arXiv:2007.12652v4 [cs.LG] UPDATED)
    Decision tree learning is a widely used approach in machine learning, favoured in applications that require concise and interpretable models. Heuristic methods are traditionally used to quickly produce models with reasonably high accuracy. A commonly criticised point, however, is that the resulting trees may not necessarily be the best representation of the data in terms of accuracy and size. In recent years, this motivated the development of optimal classification tree algorithms that globally optimise the decision tree in contrast to heuristic methods that perform a sequence of locally optimal decisions. We follow this line of work and provide a novel algorithm for learning optimal classification trees based on dynamic programming and search. Our algorithm supports constraints on the depth of the tree and number of nodes. The success of our approach is attributed to a series of specialised techniques that exploit properties unique to classification trees. Whereas algorithms for optimal classification trees have traditionally been plagued by high runtimes and limited scalability, we show in a detailed experimental study that our approach uses only a fraction of the time required by the state-of-the-art and can handle datasets with tens of thousands of instances, providing several orders of magnitude improvements and notably contributing towards the practical realisation of optimal decision trees.  ( 3 min )
    Beyond neural scaling laws: beating power law scaling via data pruning. (arXiv:2206.14486v1 [cs.LG])
    Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, these improvements through scaling alone require considerable costs in compute and energy. Here we focus on the scaling of error with dataset size and show how both in theory and practice we can break beyond power law scaling and reduce it to exponential scaling instead if we have access to a high-quality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size. We then test this new exponential scaling prediction with pruned dataset size empirically, and indeed observe better than power law scaling performance on ResNets trained on CIFAR-10, SVHN, and ImageNet. Given the importance of finding high-quality pruning metrics, we perform the first large-scale benchmarking study of ten different data pruning metrics on ImageNet. We find most existing high performing metrics scale poorly to ImageNet, while the best are computationally intensive and require labels for every image. We therefore developed a new simple, cheap and scalable self-supervised pruning metric that demonstrates comparable performance to the best supervised metrics. Overall, our work suggests that the discovery of good data-pruning metrics may provide a viable path forward to substantially improved neural scaling laws, thereby reducing the resource costs of modern deep learning.  ( 3 min )
    Prediction Errors for Penalized Regressions based on Generalized Approximate Message Passing. (arXiv:2206.12832v2 [stat.ML] UPDATED)
    We discuss the prediction accuracy of assumed statistical models in terms of prediction errors for the generalized linear model and penalized maximum likelihood methods. We derive the forms of estimators for the prediction errors: $C_p$ criterion, information criteria, and leave-one-out cross validation (LOOCV) error, using the generalized approximate message passing (GAMP) algorithm and replica method. These estimators coincide with each other when the number of model parameters is sufficiently small; however, there is a discrepancy between them in particular in the overparametrized region where the number of model parameters is larger than the data dimension. In this paper, we review the prediction errors and corresponding estimators, and discuss their differences. In the framework of GAMP, we show that the information criteria can be expressed by using the variance of the estimates. Further, we demonstrate how to approach LOOCV error from the information criteria by utilizing the expression provided by GAMP.  ( 2 min )
    Linear Model Against Malicious Adversaries with Local Differential Privacy. (arXiv:2202.02448v2 [cs.CR] UPDATED)
    Scientific collaborations benefit from collaborative learning of distributed sources, but remain difficult to achieve when data are sensitive. In recent years, privacy preserving techniques have been widely studied to analyze distributed data across different agencies while protecting sensitive information. Most existing privacy preserving techniques are designed to resist semi-honest adversaries and require intense computation to perform data analysis. Secure collaborative learning is significantly difficult with the presence of malicious adversaries who may deviates from the secure protocol. Another challenge is to maintain high computation efficiency with privacy protection. In this paper, matrix encryption is applied to encrypt data such that the secure schemes are against malicious adversaries, including chosen plaintext attack, known plaintext attack, and collusion attack. The encryption scheme also achieves local differential privacy. Moreover, cross validation is studied to prevent overfitting without additional communication cost. Empirical experiments on real-world datasets demonstrate that the proposed schemes are computationally efficient compared to existing techniques against malicious adversary and semi-honest model.  ( 2 min )
    Score Matching for Truncated Density Estimation on a Manifold. (arXiv:2206.14668v1 [stat.ME])
    When observations are truncated, we are limited to an incomplete picture of our dataset. Recent methods deal with truncated density estimation problems by turning to score matching, where the access to the intractable normalising constant is not required. We present a novel extension to truncated score matching for a Riemannian manifold. Applications are presented for the von Mises-Fisher and Kent distributions on a two dimensional sphere in $\R^3$, as well as a real-world application of extreme storm observations in the USA. In simulated data experiments, our score matching estimator is able to approximate the true parameter values with a low estimation error and shows improvements over a maximum likelihood estimator.  ( 2 min )
    Cyclical Kernel Adaptive Metropolis. (arXiv:2206.14421v1 [cs.LG])
    We propose cKAM, cyclical Kernel Adaptive Metropolis, which incorporates a cyclical stepsize scheme to allow control for exploration and sampling. We show that on a crafted bimodal distribution, existing Adaptive Metropolis type algorithms would fail to converge to the true posterior distribution. We point out that this is because adaptive samplers estimates the local/global covariance structure using past history of the chain, which will lead to adaptive algorithms be trapped in a local mode. We demonstrate that cKAM encourages exploration of the posterior distribution and allows the sampler to escape from a local mode, while maintaining the high performance of adaptive methods.  ( 2 min )
    When Does Group Invariant Learning Survive Spurious Correlations?. (arXiv:2206.14534v1 [cs.LG])
    By inferring latent groups in the training data, recent works introduce invariant learning to the case where environment annotations are unavailable. Typically, learning group invariance under a majority/minority split is empirically shown to be effective in improving out-of-distribution generalization on many datasets. However, theoretical guarantee for these methods on learning invariant mechanisms is lacking. In this paper, we reveal the insufficiency of existing group invariant learning methods in preventing classifiers from depending on spurious correlations in the training set. Specifically, we propose two criteria on judging such sufficiency. Theoretically and empirically, we show that existing methods can violate both criteria and thus fail in generalizing to spurious correlation shifts. Motivated by this, we design a new group invariant learning method, which constructs groups with statistical independence tests, and reweights samples by group label proportion to meet the criteria. Experiments on both synthetic and real data demonstrate that the new method significantly outperforms existing group invariant learning methods in generalizing to spurious correlation shifts.  ( 2 min )
    Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution. (arXiv:2009.14108v2 [cs.LG] UPDATED)
    Reinforcement learning algorithms require many samples when solving complex hierarchical tasks with sparse and delayed rewards. For such complex tasks, the recently proposed RUDDER uses reward redistribution to leverage steps in the Q-function that are associated with accomplishing sub-tasks. However, often only few episodes with high rewards are available as demonstrations since current exploration strategies cannot discover them in reasonable time. In this work, we introduce Align-RUDDER, which utilizes a profile model for reward redistribution that is obtained from multiple sequence alignment of demonstrations. Consequently, Align-RUDDER employs reward redistribution effectively and, thereby, drastically improves learning on few demonstrations. Align-RUDDER outperforms competitors on complex artificial tasks with delayed rewards and few demonstrations. On the Minecraft ObtainDiamond task, Align-RUDDER is able to mine a diamond, though not frequently. Code is available at https://github.com/ml-jku/align-rudder. YouTube: https://youtu.be/HO-_8ZUl-UY  ( 2 min )
    Can Push-forward Generative Models Fit Multimodal Distributions?. (arXiv:2206.14476v1 [stat.ML])
    Many generative models synthesize data by transforming a standard Gaussian random variable using a deterministic neural network. Among these models are the Variational Autoencoders and the Generative Adversarial Networks. In this work, we call them "push-forward" models and study their expressivity. We show that the Lipschitz constant of these generative networks has to be large in order to fit multimodal distributions. More precisely, we show that the total variation distance and the Kullback-Leibler divergence between the generated and the data distribution are bounded from below by a constant depending on the mode separation and the Lipschitz constant. Since constraining the Lipschitz constants of neural networks is a common way to stabilize generative models, there is a provable trade-off between the ability of push-forward models to approximate multimodal distributions and the stability of their training. We validate our findings on one-dimensional and image datasets and empirically show that generative models consisting of stacked networks with stochastic input at each step, such as diffusion models do not suffer of such limitations.  ( 2 min )
    An Auto-Regressive Formulation for Smoothing and Moving Mean with Exponentially Tapered Windows. (arXiv:2206.14749v1 [cs.LG])
    We investigate an auto-regressive formulation for the problem of smoothing time-series by manipulating the inherent objective function of the traditional moving mean smoothers. Not only the auto-regressive smoothers enforce a higher degree of smoothing, they are just as efficient as the traditional moving means and can be optimized accordingly with respect to the input dataset. Interestingly, the auto-regressive models result in moving means with exponentially tapered windows.  ( 2 min )
    Adjoint-aided inference of Gaussian process driven differential equations. (arXiv:2202.04589v2 [stat.ML] UPDATED)
    Linear systems occur throughout engineering and the sciences, most notably as differential equations. In many cases the forcing function for the system is unknown, and interest lies in using noisy observations of the system to infer the forcing, as well as other unknown parameters. In differential equations, the forcing function is an unknown function of the independent variables (typically time and space), and can be modelled as a Gaussian process (GP). In this paper we show how the adjoint of a linear system can be used to efficiently infer forcing functions modelled as GPs, using a truncated basis expansion of the GP kernel. We show how exact conjugate Bayesian inference for the truncated GP can be achieved, in many cases with substantially lower computation than would be required using MCMC methods. We demonstrate the approach on systems of both ordinary and partial differential equations, and show that the basis expansion approach approximates well the true forcing with a modest number of basis vectors. Finally, we show how to infer point estimates for the non-linear model parameters, such as the kernel length-scales, using Bayesian optimisation.  ( 2 min )
    Approximate Data Deletion in Generative Models. (arXiv:2206.14439v1 [cs.LG])
    Users have the right to have their data deleted by third-party learned systems, as codified by recent legislation such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). Such data deletion can be accomplished by full re-training, but this incurs a high computational cost for modern machine learning models. To avoid this cost, many approximate data deletion methods have been developed for supervised learning. Unsupervised learning, in contrast, remains largely an open problem when it comes to (approximate or exact) efficient data deletion. In this paper, we propose a density-ratio-based framework for generative models. Using this framework, we introduce a fast method for approximate data deletion and a statistical test for estimating whether or not training points have been deleted. We provide theoretical guarantees under various learner assumptions and empirically demonstrate our methods across a variety of generative methods.  ( 2 min )
    Open Problem: Properly learning decision trees in polynomial time?. (arXiv:2206.14431v1 [cs.DS])
    The authors recently gave an $n^{O(\log\log n)}$ time membership query algorithm for properly learning decision trees under the uniform distribution (Blanc et al., 2021). The previous fastest algorithm for this problem ran in $n^{O(\log n)}$ time, a consequence of Ehrenfeucht and Haussler (1989)'s classic algorithm for the distribution-free setting. In this article we highlight the natural open problem of obtaining a polynomial-time algorithm, discuss possible avenues towards obtaining it, and state intermediate milestones that we believe are of independent interest.  ( 2 min )
    A Perturbation Bound on the Subspace Estimator from Canonical Projections. (arXiv:2206.14278v1 [stat.ML])
    This paper derives a perturbation bound on the optimal subspace estimator obtained from a subset of its canonical projections contaminated by noise. This fundamental result has important implications in matrix completion, subspace clustering, and related problems.  ( 2 min )
    Active Exploration via Experiment Design in Markov Chains. (arXiv:2206.14332v1 [cs.LG])
    A key challenge in science and engineering is to design experiments to learn about some unknown quantity of interest. Classical experimental design optimally allocates the experimental budget to maximize a notion of utility (e.g., reduction in uncertainty about the unknown quantity). We consider a rich setting, where the experiments are associated with states in a {\em Markov chain}, and we can only choose them by selecting a {\em policy} controlling the state transitions. This problem captures important applications, from exploration in reinforcement learning to spatial monitoring tasks. We propose an algorithm -- \textsc{markov-design} -- that efficiently selects policies whose measurement allocation \emph{provably converges to the optimal one}. The algorithm is sequential in nature, adapting its choice of policies (experiments) informed by past measurements. In addition to our theoretical analysis, we showcase our framework on applications in ecological surveillance and pharmacology.  ( 2 min )
    Theoretical Perspectives on Deep Learning Methods in Inverse Problems. (arXiv:2206.14373v1 [stat.ML])
    In recent years, there have been significant advances in the use of deep learning methods in inverse problems such as denoising, compressive sensing, inpainting, and super-resolution. While this line of works has predominantly been driven by practical algorithms and experiments, it has also given rise to a variety of intriguing theoretical problems. In this paper, we survey some of the prominent theoretical developments in this line of works, focusing in particular on generative priors, untrained neural network priors, and unfolding algorithms. In addition to summarizing existing results in these topics, we highlight several ongoing challenges and open problems.  ( 2 min )
    Matryoshka: Stealing Functionality of Private ML Data by Hiding Models in Model. (arXiv:2206.14371v1 [stat.ML])
    In this paper, we present a novel insider attack called Matryoshka, which employs an irrelevant scheduled-to-publish DNN model as a carrier model for covert transmission of multiple secret models which memorize the functionality of private ML data stored in local data centers. Instead of treating the parameters of the carrier model as bit strings and applying conventional steganography, we devise a novel parameter sharing approach which exploits the learning capacity of the carrier model for information hiding. Matryoshka simultaneously achieves: (i) High Capacity -- With almost no utility loss of the carrier model, Matryoshka can hide a 26x larger secret model or 8 secret models of diverse architectures spanning different application domains in the carrier model, neither of which can be done with existing steganography techniques; (ii) Decoding Efficiency -- once downloading the published carrier model, an outside colluder can exclusively decode the hidden models from the carrier model with only several integer secrets and the knowledge of the hidden model architecture; (iii) Effectiveness -- Moreover, almost all the recovered models have similar performance as if it were trained independently on the private data; (iv) Robustness -- Information redundancy is naturally implemented to achieve resilience against common post-processing techniques on the carrier before its publishing; (v) Covertness -- A model inspector with different levels of prior knowledge could hardly differentiate a carrier model from a normal model.  ( 3 min )
    Optimal Estimation of Generic Dynamics by Path-Dependent Neural Jump ODEs. (arXiv:2206.14284v1 [stat.ML])
    This paper studies the problem of forecasting general stochastic processes using an extension of the Neural Jump ODE (NJ-ODE) framework. While NJ-ODE was the first framework to establish convergence guarantees for the prediction of irregularly observed time-series, these results were limited to data stemming from It\^o-diffusions with complete observations, in particular Markov processes where all coordinates are observed simultaneously. In this work, we generalise these results to generic, possibly non-Markovian or discontinuous, stochastic processes with incomplete observations, by utilising the reconstruction properties of the signature transform. These theoretical results are supported by empirical studies, where it is shown that the path-dependent NJ-ODE outperforms the original NJ-ODE framework in the case of non-Markovian data.  ( 2 min )
    Target alignment in truncated kernel ridge regression. (arXiv:2206.14255v1 [cs.LG])
    Kernel ridge regression (KRR) has recently attracted renewed interest due to its potential for explaining the transient effects, such as double descent, that emerge during neural network training. In this work, we study how the alignment between the target function and the kernel affects the performance of the KRR. We focus on the truncated KRR (TKRR) which utilizes an additional parameter that controls the spectral truncation of the kernel matrix. We show that for polynomial alignment, there is an \emph{over-aligned} regime, in which TKRR can achieve a faster rate than what is achievable by full KRR. The rate of TKRR can improve all the way to the parametric rate, while that of full KRR is capped at a sub-optimal value. This shows that target alignemnt can be better leveraged by utilizing spectral truncation in kernel methods. We also consider the bandlimited alignment setting and show that the regularization surface of TKRR can exhibit transient effects including multiple descent and non-monotonic behavior. Our results show that there is a strong and quantifable relation between the shape of the \emph{alignment spectrum} and the generalization performance of kernel methods, both in terms of rates and in finite samples.  ( 2 min )
    Intrinsic Anomaly Detection for Multi-Variate Time Series. (arXiv:2206.14342v1 [cs.LG])
    We introduce a novel, practically relevant variation of the anomaly detection problem in multi-variate time series: intrinsic anomaly detection. It appears in diverse practical scenarios ranging from DevOps to IoT, where we want to recognize failures of a system that operates under the influence of a surrounding environment. Intrinsic anomalies are changes in the functional dependency structure between time series that represent an environment and time series that represent the internal state of a system that is placed in said environment. We formalize this problem, provide under-studied public and new purpose-built data sets for it, and present methods that handle intrinsic anomaly detection. These address the short-coming of existing anomaly detection methods that cannot differentiate between expected changes in the system's state and unexpected ones, i.e., changes in the system that deviate from the environment's influence. Our most promising approach is fully unsupervised and combines adversarial learning and time series representation learning, thereby addressing problems such as label sparsity and subjectivity, while allowing to navigate and improve notoriously problematic anomaly detection data sets.  ( 2 min )
    No imputation without representation. (arXiv:2206.14254v1 [cs.LG])
    By filling in missing values in datasets, imputation allows these datasets to be used with algorithms that cannot handle missing values by themselves. However, missing values may in principle contribute useful information that is lost through imputation. The missing-indicator approach can be used in combination with imputation to instead represent this information as a part of the dataset. There are several theoretical considerations why missing-indicators may or may not be beneficial, but there has not been any large-scale practical experiment on real-life datasets to test this question for machine learning predictions. We perform this experiment for three imputation strategies and a range of different classification algorithms, on the basis of twenty real-life datasets. We find that on these datasets, missing-indicators generally increase classification performance. In addition, we find no evidence for most algorithms that nearest neighbour and iterative imputation lead to better performance than simple mean/mode imputation. Therefore, we recommend the use of missing-indicators with mean/mode imputation as a safe default, with the caveat that for decision trees, pruning is necessary to prevent overfitting. In a follow-up experiment, we determine attribute-specific missingness thresholds for each classifier above which missing-indicators are more likely than not to increase classification performance, and observe that these thresholds are much lower for categorical than for numerical attributes. Finally, we argue that mean imputation of numerical attributes may preserve some of the information from missing values, and we show that in the absence of missing-indicators, it can similarly be useful to apply mean imputation to one-hot encoded categorical attributes instead of mode imputation.  ( 3 min )

  • Open

    [P] Some best-practice questions about my first project; predicting much I will enjoy backpacking different trails
    I like to go backpacking (multi-day hikes) and I want to build a model to predict how much I will enjoy the trails on my watchlist. I understand it is silly to predict how I will subjectively experience something, but it seems fun to see what gets spat out. I just have some questions on best practice. This was the best dataset I could find. It isn't perfect. I mainly hike in Canada and I only care about backpacking trails. Whereas, this dataset is about trails in the USA and only ~800/3000 are backpacking trails. From it I can get the following features: latitude longitude length elevation gain route type waterfall (boolean) lake (boolean) river (boolean) forest (boolean) cave (boolean) backpacking (boolean) rating (according to AllTrails.com, this is what I will be predicting for my watchlist trails) Another problem with this dataset is with the 5 boolean traits (waterfall, lake, river, forest, cave). If it is unknown whether a trail qualifies for any of these traits, the trait will be set to false. Also, the rating values have been rounded to the nearest 0.5 (on a scale from 0-5). I just have to make the best of it, I couldn't find a better dataset. The plan is to personalize the model to me. I'm going to add my completed trails to the dataset and give each a personal rating. Then I'll add a new feature, called something like "isMe" which will be 1 for my trails and 0 otherwise. Now, time for questions: Does it makes sense to use latitude and longitude when I don't hike in the area covered by the dataset? Should I cut the ~2200/3000 rows from the dataset that aren't backpacking trails since I only want to predict the rating for backpacking trails? Since the rating values have been binned, would that mean I am predicting a category or a numerical value? These are only the questions I can think to ask. Feel free to hit me with any other pointers you have to make this silly model as accurate as I can! submitted by /u/JamesonLKJ [link] [comments]  ( 86 min )
    [P] Neural Network Steganography (implementation) - Hiding secrets and malicious software in any neural network
    I saw a paper called EvilModel on how to hide malicious code in a neural network as we have thousands or millions of parameters that we can alter. This basic technique is based on the modification of the float32 values (but can be adapted to float16) where we modify the fraction bits or part of the fraction. Post/Tutorial on the process GitHub repo for the project EvilModel paper As I saw with my experiments, we could easily hide megabytes of code in a simple ResNet50 and get away with it. A well-trained (and generalized) network should not degrade in performance significantly. The testing of that is planned for a future post. Also, this method could be used for watermarking neural network weights which could help with copyright claims (e.g.: someone is using your open-sourced (and appropriately licensed) weights out of the box in a commercial product) submitted by /u/gabegabe6 [link] [comments]  ( 89 min )
    [D] Training GANs with non-square images
    I am planning to train stylegan2 ada with rectangular images (aspect ratio = 16:9). Is it better to use (zero) padding, resizing, or train a rectangular GAN? Thankyou verymuch! submitted by /u/antarfrica [link] [comments]  ( 84 min )
    [D] Mixed Precision Training: Difference between BF16 and FP16
    What differences in model performance, speed, memory etc. can I expect between choosing BF16 or FP16 for mixed precision training? Is BF16 faster / consumes less memory, since I have seen people say it is "more suitable for Deep Learning". Why is that the case? submitted by /u/optimized-adam [link] [comments]  ( 87 min )
    [D] AI & Big Data Expo; worth it?
    Interested in AI/Machine learning research, hoping to check out their NA expo to learn more. Has anyone here ever been to one of their conventions? What were your experiences like? submitted by /u/nyxrat [link] [comments]  ( 84 min )
    [R] Use pretrained GANs and image classifier to generate images of the class
    Pretrained GANs and CLIP embeddings have been used to created images from arbitrary caption, by backpropagating CLIP similarity of the caption and the generated image down to the generator input noise. I am thinking of something simpler, where I would take a pretrained GAN, and backpropagate through some pretrained classifier (e.g. image et) down to the input noise to the Generator to generate images of that class. Is there any reference that does that? And more generally, i want to understand why this approach works - simple backpropagating the classifier loss to the image (and not through the generator) typically result in deep dream type of weird images. Why does this not happen when using a generator? Is it simply because the output of the generator lives in the manifold of "real" images? Is there more to it? Thanks in advance submitted by /u/ml_rl_questions [link] [comments]  ( 86 min )
    [D] What are the lessons learned in the preparations of the dataset you will use to train a GANs?
    Hello friends, what are the key points we should pay attention to in the datasets you will prepare for GANs, do you have any suggestions? For example the distribution of the dataset should be like this, the images should be the same size, it is important to reduce all the images to this size, many things that I have not thought of at the moment? What are your recommendations? submitted by /u/metover [link] [comments]  ( 84 min )
    [P] Unofficial Gato in TensorFlow
    https://github.com/OrigamiDream/gato I am building Deepmind's Gato imitation in TensorFlow. All necessary layers have been completely implemented. ​ However, I have no idea how to map out the training strategy, and I do not have enough datasets for this. The model seems impossible for end-to-end training because of its conditional and selective tokenizer and embeddings, and differentiable programming. ​ If you are interested in this project, add a star and notification to this repository for further updates. And someone who want to contribute to this project, please create a relevant issue or pull request. ​ Thank you. submitted by /u/AvisStudio [link] [comments]  ( 85 min )
    What is the essence of Diffusion models? [D]
    Coming from a math/stats background the point of much of the machine learning literature can take time to understand fully, in particular I have a couple of quick (interconnected) questions regarding the essence of Diffusion models that I hope somebody may answer (of the many blog posts I have read I can't seem to find a clear answer). As a reference let me take the seminal paper of Ho et al. https://arxiv.org/abs/2006.11239 When fixing the coefficients $\beta_1, \dots, \beta_T$ that govern the forward diffusion process (treating them as hyperparameters) can't we, at least in simple cases, already recover the reverse diffusion process in closed form? If yes why do we even need to find the reverse diffusion process through an optimization procedure when we already have it in closed form? I have read that diffusion models should perform a dimensionality reduction on the data but, even understanding the mathematics, I can't understand how the dimensionality reduction is being achieved by learning the reverse process. What is the usefulness behind the whole procedure? If the forward process converges to an isotropic Gaussian (it destroys all the structure in the data) how can we hope to learn anything significant from it if it becomes simply a bunch of noise. (I suspect that the answer to this question is that we always stop the forward process before it becomes its limit) Thanks to anyone that can clear up these doubts of mine. submitted by /u/Mon0o0 [link] [comments]  ( 85 min )
  • Open

    The Persistence Problem: Lessons learned from illustrating a children's book with GPT-3 and crAIyon.
    submitted by /u/laul_pogan [link] [comments]  ( 83 min )
    "A magical forest full of colourful mushrooms" 🍄 Created on Pixelz.ai
    submitted by /u/pixelz_ai [link] [comments]  ( 82 min )
    Brain Power Level AI Supercomputer With 174 Trillion Parameters | AI Robot Arm Learns By Vision | New System To Train Autonomous Vehicles | Brain Tumor Detection AI Outperforms Humans
    submitted by /u/getrich_or_diemining [link] [comments]  ( 83 min )
    Generating Children's Stories Using GPT-3 and DALL·E
    submitted by /u/BB4evaTB12 [link] [comments]  ( 82 min )
    AI benchmark MLPerf: Nvidia dominates, but Graphcore establishes itself
    submitted by /u/much_successes [link] [comments]  ( 82 min )
    Who needs midjourney invites
    Recent got more added, hmu if you need one submitted by /u/Chemical-Exchange466 [link] [comments]  ( 83 min )
    A Step-by-Step Walkthrough Neural Networks for Time-series Forecasting
    submitted by /u/lucapiccinelli [link] [comments]  ( 82 min )
    Generating "Levels" from data and rules using Artificial Intelligence?
    What is the best approach to creating a video game level (for simplicity sake, just a list of positions/vectors) based on a database of already existing levels, and a set of constraints? My biggest problem is creating an AI that has no input layer, and also a variable length output. If you have any ideas, please let me know (: submitted by /u/iLoveNintend0 [link] [comments]  ( 83 min )
    45 worked examples in machine learning (energy, medicine, banking, retail, physics, finance...)
    submitted by /u/datapablo [link] [comments]  ( 82 min )
    Tutorial Warp/Flow
    ​ Just a basic tutorial on using starting video and a demo of the warp/flow Working on upscaling some videos soon too that will use both warp flow and 3d animation that are looking cool so far. ​ https://www.youtube.com/watch?v=VN6dgVjzOq0 https://preview.redd.it/eul1u8v33j891.jpg?width=1920&format=pjpg&auto=webp&s=0810a191105c52bfa0d04923c9e1ba0b366940a8 submitted by /u/prfitofthesngularity [link] [comments]  ( 83 min )
    Advanced Endpoint Intelligence
    submitted by /u/Peter909098 [link] [comments]  ( 82 min )
    [P] Open source that takes as input a deep learning model and outputs a version that runs faster in inference. Now faster and easier to use (New release)
    nebullvm is an open-source library that takes an AI model as input and outputs an optimized version that runs much faster on your hardware, usually achieving 2 to 5 times faster inference without losing accuracy (benchmarks below for Option A), or even more if you specify that you are willing to sacrifice some accuracy for a lighter model with even lower latency, using compression techniques (Option B, leveraging multiple quantization methods [1], soon also pruning [2] and more) https://github.com/nebuly-ai/nebullvm nebullvm now supports also PyTorch and TensorFlow backends that, together with the already supported deep learning compilers (including ONNX runtime [3], TensorRT [4], OpenVINO [5], Apache TVM [6]), will optimize how your model is mapped to your hardware. Together these techniques will allow nebullvm to explore more paths and find the best way to make the most of your hardware's computing capabilities, making inference as fast as it can run. You can run nebullvm in just a few lines of code, and after many requests from users, I simplified the installation of these deep learning compilers. In addition to the option of installing all compilers with a single command, it is now possible to skip the installation to pull Docker images with compilers already preinstalled. Discover more here. Many more releases are on the way. And if you have questions, ideas and product suggestions, I'm more than happy to discuss them here! And don't forget to leave a small star for all the open-source work to make DL optimization techniques more accessible :) https://preview.redd.it/h9rshzajhh891.png?width=1480&format=png&auto=webp&s=e4d213434a6b1f949751c4b423fe3bc581a1977d [1] Quantization. Techniques and Concept Map. [2] Pruning. Techniques and Concept Map. [3] ONNX Runtime [4] Nvidia TensorRT [5] Intel OpenVINO [6] Apache TVM submitted by /u/emilec___ [link] [comments]  ( 84 min )
    online furniture buying idea: is there a way to guess estimate the length width depth of a room just by taking a photo of it?
    submitted by /u/wilsonckao [link] [comments]  ( 83 min )
    Are there any really good story AI's?
    I have tried a few but most seem very random and unable to really make a ok story. I'm looking for a AI that could maybe be used to get a story started? Maybe AI has just not reached the point it can do this yet? submitted by /u/ryan7251 [link] [comments]  ( 83 min )
    Disco DIffusion Warp
    I am going to be posting a video later looking at Disco Diffusions Warp/Flow along with a basic tutorial on using init videos ,here are a couple of stills from 2 of the videos and some weekly images, all created with disco diffusion ​ https://preview.redd.it/eu3esyv9hg891.png?width=2560&format=png&auto=webp&s=36556998464bbf57694b1be9782632902856fd51 https://preview.redd.it/e82ynuv9hg891.png?width=2560&format=png&auto=webp&s=6dbbf93362a81d4ce897e61e316c8eb68abc7d95 https://preview.redd.it/b9fehwv9hg891.png?width=1920&format=png&auto=webp&s=fc70a0fa607dd4125dd47cce40b4fd820acb9640 submitted by /u/prfitofthesngularity [link] [comments]  ( 82 min )
  • Open

    Use a custom image to bring your own development environment to RStudio on Amazon SageMaker
    RStudio on Amazon SageMaker is the industry’s first fully managed RStudio Workbench in cloud. You can quickly launch the familiar RStudio integrated development environment (IDE), and dial up and down the underlying compute resources without interrupting your work, making it easy to build machine learning (ML) and analytics solutions in R at scale. RStudio on […]  ( 11 min )
    Text classification for online conversations with machine learning on AWS
    Online conversations are ubiquitous in modern life, spanning industries from video games to telecommunications. This has led to an exponential growth in the amount of online conversation data, which has helped in the development of state-of-the-art natural language processing (NLP) systems like chatbots and natural language generation (NLG) models. Over time, various NLP techniques for […]  ( 11 min )
    Hyperparameter optimization for fine-tuning pre-trained transformer models from Hugging Face
    Large attention-based transformer models have obtained massive gains on natural language processing (NLP). However, training these gigantic networks from scratch requires a tremendous amount of data and compute. For smaller NLP datasets, a simple yet effective strategy is to use a pre-trained transformer, usually trained in an unsupervised fashion on very large datasets, and fine-tune […]  ( 7 min )
    Diagnose model performance before deployment for Amazon Fraud Detector
    With the growth in adoption of online applications and the rising number of internet users, digital fraud is on the rise year over year. Amazon Fraud Detector provides a fully managed service to help you better identify potentially fraudulent online activities using advanced machine learning (ML) techniques, and more than 20 years of fraud detection […]  ( 17 min )
  • Open

    What are the top journals for reinforcement learning?
    Hello, I was searching for journals dedicated only for reinforcement learning. To my dissapointment, I found none. I expected there would be one or two as there are some journals that focus on neural newtorks. May I ask for some recommendations on journals for RL? I want to read about the state of the art and get some ideas for my research. I plan to publish my research at the end of the year. It is a study on state representations and models for optimizing a n-step process. So far I have found that a similar approach has been published in the IEEE. Would you be kind enough to recommend me some journals for RL? Is there any ranking that shows the difficulty of publishing in each of the journals? I am new to publishing. Thanks in advance. submitted by /u/ElvishChampion [link] [comments]  ( 83 min )
    Any academic source about Q-table sizes
    Can anyone point me to a source that talks about table sizes reasonable for Q-table learning? I see comments from the experience of people implementing it but I want to cite an academic source that talks about it. I used a Q-table in my work and the size is reasonable but I need to cite a source to support my argument. And all the papers I see only talk about curse of dimensionality and move on to deep neural nets discussion. submitted by /u/Simple-Soil-230 [link] [comments]  ( 85 min )
    Inverted pendulum: How to weight the features?
    The game state of the inverted pendulum problem consists of four variables: cart pos, cart velocity, pole angle and pole velocity. To determine the costs of the current state, the variables have to be aggregated into a single evaluation function. The problem is, that it's possible to weight each feature differently. So the question is, if the cart's position is more important than the pole's angle? submitted by /u/ManuelRodriguez331 [link] [comments]  ( 85 min )
    Continuous action probability calculation in policy gradient
    Hi, I wonder how can we assume y value of gaussian distribution according to action(x) as probability. I understand that the y value of pdf is not probability, but interval integration means probability. someone can explain about this? Thanks import torch from torch.distributions import Normal mean, std = 0, 1 dist = Normal(mean, std) sample = torch.tensor(0) logprob = dist.log_prob(sample) print(logprob.exp()) ## torch(0.3989) import math def normpdf(x, mean, sd): #https://stackoverflow.com/questions/12412895/how-to-calculate-probability-in-a-normal-distribution-given-mean-standard-devi var = float(sd)**2 denom = (2*math.pi*var)**.5 num = math.exp(-(float(x)-float(mean))**2/(2*var)) return num/denom print(normpdf(sample,mean,std)) ## 0.3989 submitted by /u/Spiritual_Fig3632 [link] [comments]  ( 83 min )
  • Open

    The Metaverse Goes Industrial: Siemens, NVIDIA Extend Partnership to Bring Digital Twins Within Easy Reach
    Silicon Valley magic met Wednesday with 175 years of industrial technology leadership as Siemens CEO Roland Busch and NVIDIA Founder and CEO Jensen Huang shared their vision for an “industrial metaverse” at the launch of the Siemens Xcelerator business platform in Munich. “When we combine the real and digital worlds we can achieve new levels Read article > The post The Metaverse Goes Industrial: Siemens, NVIDIA Extend Partnership to Bring Digital Twins Within Easy Reach appeared first on NVIDIA Blog.  ( 8 min )
    NVIDIA, Partners Show Leading AI Performance and Versatility in MLPerf
    NVIDIA and its partners continued to provide the best overall AI training performance and the most submissions across all benchmarks with 90% of all entries coming from the ecosystem, according to MLPerf benchmarks released today. The NVIDIA AI platform covered all eight benchmarks in the MLPerf Training 2.0 round, highlighting its leading versatility. No other Read article > The post NVIDIA, Partners Show Leading AI Performance and Versatility in MLPerf appeared first on NVIDIA Blog.  ( 7 min )
    NVIDIA Studio Driver Elevates Creative Workflows in Blender 3.2, BorisFX Sapphire and Topaz Denoise AI
    The June NVIDIA Studio Driver is available for download today, optimizing the latest creative app updates, all with the stability and reliability that users count on. Creators with NVIDIA RTX GPUs will benefit from faster performance and new features within Blender version 3.2, BorisFX Sapphire release 2022.5 and Topaz Denoise AI 3.7.0. The post NVIDIA Studio Driver Elevates Creative Workflows in Blender 3.2, BorisFX Sapphire and Topaz Denoise AI appeared first on NVIDIA Blog.  ( 7 min )
  • Open

    Brain Power Level AI Supercomputer Has 174 Trillion Parameters | AI Robot Arm Learns With Vision | Vista 2.0 For Autonomous Vehicles | Brain Tumor Detection AI Better Than Humans
    submitted by /u/tohelpyou88 [link] [comments]  ( 83 min )
    A Step-by-Step Walkthrough Neural Networks for Time-series Forecasting
    submitted by /u/lucapiccinelli [link] [comments]  ( 83 min )
  • Open

    Top 12 Logistics Technological Trends to Watch Out in 2022
    Over the past two decades, fast-evolving technology, growing customer expectations, and implementation of new business models have… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 17 min )
  • Open

    Introducing the Microsoft Climate Research Initiative
    Addressing and mitigating the effects of climate change requires a collective effort, bringing our strengths to bear across industry, government, academia, and civil society. The post Introducing the Microsoft Climate Research Initiative appeared first on Microsoft Research.  ( 10 min )
  • Open

    Definitive Guide: An Insight Look At PHP Workers
    Have you ever browsed through your favorite online ecommerce site and, as you were checking out, ended up with a 504 error after a delay? Or perhaps you were browsing your favorite sports site, and as you attempt to load another page, it takes a while to load back with a timeout error? These situations… Read More »Definitive Guide: An Insight Look At PHP Workers The post Definitive Guide: An Insight Look At PHP Workers appeared first on Data Science Central.  ( 20 min )
    DSC Weekly 28 June 2022: Strokes, AI and Cognition
    Regular readers may have noticed that DSC Weekly didn’t come out last week. The reason was personal – a close relative of mine had a series of strokes over the last couple of weeks, and I needed to take some time away to deal with the consequences. In addition, we migrated over to a new… Read More »DSC Weekly 28 June 2022: Strokes, AI and Cognition The post DSC Weekly 28 June 2022: Strokes, AI and Cognition appeared first on Data Science Central.  ( 20 min )
  • Open

    Long Range Language Modeling via Gated State Spaces. (arXiv:2206.13947v1 [cs.LG])
    State space models have shown to be effective at modeling long range dependencies, specially on sequence classification tasks. In this work we focus on autoregressive sequence modeling over English books, Github source code and ArXiv mathematics articles. Based on recent developments around the effectiveness of gated activation functions, we propose a new layer named Gated State Space (GSS) and show that it trains significantly faster than the diagonal version of S4 (i.e. DSS) on TPUs, is fairly competitive with several well-tuned Transformer-based baselines and exhibits zero-shot generalization to longer inputs while being straightforward to implement. Finally, we show that leveraging self-attention to model local dependencies improves the performance of GSS even further.
    Empirical Study of Quality Image Assessment for Synthesis of Fetal Head Ultrasound Imaging with DCGANs. (arXiv:2206.01731v2 [eess.IV] UPDATED)
    In this work, we present an empirical study of DCGANs, including hyperparameter heuristics and image quality assessment, as a way to address the scarcity of datasets to investigate fetal head ultrasound. We present experiments to show the impact of different image resolutions, epochs, dataset size input, and learning rates for quality image assessment on four metrics: mutual information (MI), Fr\'echet inception distance (FID), peak-signal-to-noise ratio (PSNR), and local binary pattern vector (LBPv). The results show that FID and LBPv have stronger relationship with clinical image quality scores. The resources to reproduce this work are available at \url{https://github.com/budai4medtech/miua2022}.
    A View Independent Classification Framework for Yoga Postures. (arXiv:2206.13577v1 [cs.CV])
    Yoga is a globally acclaimed and widely recommended practice for a healthy living. Maintaining correct posture while performing a Yogasana is of utmost importance. In this work, we employ transfer learning from Human Pose Estimation models for extracting 136 key-points spread all over the body to train a Random Forest classifier which is used for estimation of the Yogasanas. The results are evaluated on an in-house collected extensive yoga video database of 51 subjects recorded from 4 different camera angles. We propose a 3 step scheme for evaluating the generalizability of a Yoga classifier by testing it on 1) unseen frames, 2) unseen subjects, and 3) unseen camera angles. We argue that for most of the applications, validation accuracies on unseen subjects and unseen camera angles would be most important. We empirically analyze over three public datasets, the advantage of transfer learning and the possibilities of target leakage. We further demonstrate that the classification accuracies critically depend on the cross validation method employed and can often be misleading. To promote further research, we have made key-points dataset and code publicly available.
    Accurate and fast identification of minimally prepared bacteria phenotypes using Raman spectroscopy assisted by machine learning. (arXiv:2206.13933v1 [cs.LG])
    The worldwide increase of antimicrobial resistance (AMR) is a serious threat to human health. To avert the spread of AMR, fast reliable diagnostics tools that facilitate optimal antibiotic stewardship are an unmet need. In this regard, Raman spectroscopy promises rapid label- and culture-free identification and antimicrobial susceptibility testing (AST) in a single step. However, even though many Raman-based bacteria-identification and AST studies have demonstrated impressive results, some shortcomings must be addressed. To bridge the gap between proof-of-concept studies and clinical application, we have developed machine learning techniques in combination with a novel data-augmentation algorithm, for fast identification of minimally prepared bacteria phenotypes and the distinctions of methicillin-resistant (MR) from methicillin-susceptible (MS) bacteria. For this we have implemented a spectral transformer model for hyper-spectral Raman images of bacteria. We show that our model outperforms the standard convolutional neural network models on a multitude of classification problems, both in terms of accuracy and in terms of training time. We attain more than 96$\%$ classification accuracy on a dataset consisting of 15 different classes and 95.6$\%$ classification accuracy for six MR-MS bacteria species. More importantly, our results are obtained using only fast and easy-to-produce training and test data
    Secure Distributed Training at Scale. (arXiv:2106.11257v3 [cs.LG] UPDATED)
    Many areas of deep learning benefit from using increasingly larger neural networks trained on public data, as is the case for pre-trained models for NLP and computer vision. Training such models requires a lot of computational resources (e.g., HPC clusters) that are not available to small research groups and independent researchers. One way to address it is for several smaller groups to pool their computational resources together and train a model that benefits all participants. Unfortunately, in this case, any participant can jeopardize the entire training run by sending incorrect updates, deliberately or by mistake. Training in presence of such peers requires specialized distributed training algorithms with Byzantine tolerance. These algorithms often sacrifice efficiency by introducing redundant communication or passing all updates through a trusted server, making it infeasible to apply them to large-scale deep learning, where models can have billions of parameters. In this work, we propose a novel protocol for secure (Byzantine-tolerant) decentralized training that emphasizes communication efficiency.
    SkipNode: On Alleviating Over-smoothing for Deep Graph Convolutional Networks. (arXiv:2112.11628v2 [cs.LG] UPDATED)
    Over-smoothing is a challenging problem, which degrades the performance of deep graph convolutional networks (GCNs). However, existing studies for alleviating the over-smoothing problem lack either generality or effectiveness. In this paper, we analyze the underlying issues behind the over-smoothing problem, i.e., feature-diversity degeneration, gradient vanishing, and model weights over-decaying. Inspired by this, we propose a simple yet effective plug-and-play module, SkipNode, to alleviate over-smoothing. Specifically, for each middle layer of a GCN model, SkipNode randomly (or based on node degree) selects nodes to skip the convolutional operation by directly feeding their input features to the nonlinear function. Analytically, 1) skipping the convolutional operation prevents the features from losing diversity; and 2) the "skipped" nodes enable gradients to be directly passed back, thus mitigating the gradient vanishing and model weights over-decaying issues. To demonstrate the superiority of SkipNode, we conduct extensive experiments on nine popular datasets, including both homophilic and heterophilic graphs, with different graph sizes on two typical tasks: node classification and link prediction. Specifically, 1) SkipNode has strong generalizability of being applied to various GCN-based models on different datasets and tasks; and 2) SkipNode outperforms recent state-of-the-art anti-over-smoothing plug-and-play modules, i.e., DropEdge and DropNode, in different settings. Code will be made publicly available on GitHub.
    Critical Investigation of Failure Modes in Physics-informed Neural Networks. (arXiv:2206.09961v2 [cs.LG] UPDATED)
    Several recent works in scientific machine learning have revived interest in the application of neural networks to partial differential equations (PDEs). A popular approach is to aggregate the residual form of the governing PDE and its boundary conditions as soft penalties into a composite objective/loss function for training neural networks, which is commonly referred to as physics-informed neural networks (PINNs). In the present study, we visualize the loss landscapes and distributions of learned parameters and explain the ways this particular formulation of the objective function may hinder or even prevent convergence when dealing with challenging target solutions. We construct a purely data-driven loss function composed of both the boundary loss and the domain loss. Using this data-driven loss function and, separately, a physics-informed loss function, we then train two neural network models with the same architecture. We show that incomparable scales between boundary and domain loss terms are the culprit behind the poor performance. Additionally, we assess the performance of both approaches on two elliptic problems with increasingly complex target solutions. Based on our analysis of their loss landscapes and learned parameter distributions, we observe that a physics-informed neural network with a composite objective function formulation produces highly non-convex loss surfaces that are difficult to optimize and are more prone to the problem of vanishing gradients.
    Personalized Keyword Spotting through Multi-task Learning. (arXiv:2206.13708v1 [cs.SD])
    Keyword spotting (KWS) plays an essential role in enabling speech-based user interaction on smart devices, and conventional KWS (C-KWS) approaches have concentrated on detecting user-agnostic pre-defined keywords. However, in practice, most user interactions come from target users enrolled in the device which motivates to construct personalized keyword spotting. We design two personalized KWS tasks; (1) Target user Biased KWS (TB-KWS) and (2) Target user Only KWS (TO-KWS). To solve the tasks, we propose personalized keyword spotting through multi-task learning (PK-MTL) that consists of multi-task learning and task-adaptation. First, we introduce applying multi-task learning on keyword spotting and speaker verification to leverage user information to the keyword spotting system. Next, we design task-specific scoring functions to adapt to the personalized KWS tasks thoroughly. We evaluate our framework on conventional and personalized scenarios, and the results show that PK-MTL can dramatically reduce the false alarm rate, especially in various practical scenarios.
    Increasing Confidence in Adversarial Robustness Evaluations. (arXiv:2206.13991v1 [cs.LG])
    Hundreds of defenses have been proposed to make deep neural networks robust against minimal (adversarial) input perturbations. However, only a handful of these defenses held up their claims because correctly evaluating robustness is extremely challenging: Weak attacks often fail to find adversarial examples even if they unknowingly exist, thereby making a vulnerable network look robust. In this paper, we propose a test to identify weak attacks, and thus weak defense evaluations. Our test slightly modifies a neural network to guarantee the existence of an adversarial example for every sample. Consequentially, any correct attack must succeed in breaking this modified network. For eleven out of thirteen previously-published defenses, the original evaluation of the defense fails our test, while stronger attacks that break these defenses pass it. We hope that attack unit tests - such as ours - will be a major component in future robustness evaluations and increase confidence in an empirical field that is currently riddled with skepticism.
    Learning the Solution Operator of Boundary Value Problems using Graph Neural Networks. (arXiv:2206.14092v1 [cs.LG])
    As an alternative to classical numerical solvers for partial differential equations (PDEs) subject to boundary value constraints, there has been a surge of interest in investigating neural networks that can solve such problems efficiently. In this work, we design a general solution operator for two different time-independent PDEs using graph neural networks (GNNs) and spectral graph convolutions. We train the networks on simulated data from a finite elements solver on a variety of shapes and inhomogeneities. In contrast to previous works, we focus on the ability of the trained operator to generalize to previously unseen scenarios. Specifically, we test generalization to meshes with different shapes and superposition of solutions for a different number of inhomogeneities. We find that training on a diverse dataset with lots of variation in the finite element meshes is a key ingredient for achieving good generalization results in all cases. With this, we believe that GNNs can be used to learn solution operators that generalize over a range of properties and produce solutions much faster than a generic solver. Our dataset, which we make publicly available, can be used and extended to verify the robustness of these models under varying conditions.
    How to Steer Your Adversary: Targeted and Efficient Model Stealing Defenses with Gradient Redirection. (arXiv:2206.14157v1 [cs.LG])
    Model stealing attacks present a dilemma for public machine learning APIs. To protect financial investments, companies may be forced to withhold important information about their models that could facilitate theft, including uncertainty estimates and prediction explanations. This compromise is harmful not only to users but also to external transparency. Model stealing defenses seek to resolve this dilemma by making models harder to steal while preserving utility for benign users. However, existing defenses have poor performance in practice, either requiring enormous computational overheads or severe utility trade-offs. To meet these challenges, we present a new approach to model stealing defenses called gradient redirection. At the core of our approach is a provably optimal, efficient algorithm for steering an adversary's training updates in a targeted manner. Combined with improvements to surrogate networks and a novel coordinated defense strategy, our gradient redirection defense, called GRAD${}^2$, achieves small utility trade-offs and low computational overhead, outperforming the best prior defenses. Moreover, we demonstrate how gradient redirection enables reprogramming the adversary with arbitrary behavior, which we hope will foster work on new avenues of defense.
    Cost-Efficient Distributed Learning via Combinatorial Multi-Armed Bandits. (arXiv:2202.08302v2 [cs.IT] UPDATED)
    We consider the distributed SGD problem, where a main node distributes gradient calculations among $n$ workers. By assigning tasks to all the workers and waiting only for the $k$ fastest ones, the main node can trade-off the algorithm's error with its runtime by gradually increasing $k$ as the algorithm evolves. However, this strategy, referred to as adaptive $k$-sync, neglects the cost of unused computations and of communicating models to workers that reveal a straggling behavior. We propose a cost-efficient scheme that assigns tasks only to $k$ workers, and gradually increases $k$. We introduce the use of a combinatorial multi-armed bandit model to learn which workers are the fastest while assigning gradient calculations. Assuming workers with exponentially distributed response times parameterized by different means, we give empirical and theoretical guarantees on the regret of our strategy, i.e., the extra time spent to learn the mean response times of the workers. Furthermore, we propose and analyze a strategy applicable to a large class of response time distributions. Compared to adaptive $k$-sync, our scheme achieves significantly lower errors with the same computational efforts and less downlink communication while being inferior in terms of speed.
    Continual Learning with Transformers for Image Classification. (arXiv:2206.14085v1 [cs.LG])
    In many real-world scenarios, data to train machine learning models become available over time. However, neural network models struggle to continually learn new concepts without forgetting what has been learnt in the past. This phenomenon is known as catastrophic forgetting and it is often difficult to prevent due to practical constraints, such as the amount of data that can be stored or the limited computation sources that can be used. Moreover, training large neural networks, such as Transformers, from scratch is very costly and requires a vast amount of training data, which might not be available in the application domain of interest. A recent trend indicates that dynamic architectures based on an expansion of the parameters can reduce catastrophic forgetting efficiently in continual learning, but this needs complex tuning to balance the growing number of parameters and barely share any information across tasks. As a result, they struggle to scale to a large number of tasks without significant overhead. In this paper, we validate in the computer vision domain a recent solution called Adaptive Distillation of Adapters (ADA), which is developed to perform continual learning using pre-trained Transformers and Adapters on text classification tasks. We empirically demonstrate on different classification tasks that this method maintains a good predictive performance without retraining the model or increasing the number of model parameters over the time. Besides it is significantly faster at inference time compared to the state-of-the-art methods.
    Constrained Learning with Non-Convex Losses. (arXiv:2103.05134v4 [cs.LG] UPDATED)
    Though learning has become a core component of modern information processing, there is now ample evidence that it can lead to biased, unsafe, and prejudiced systems. The need to impose requirements on learning is therefore paramount, especially as it reaches critical applications in social, industrial, and medical domains. However, the non-convexity of most modern statistical problems is only exacerbated by the introduction of constraints. Whereas good unconstrained solutions can often be learned using empirical risk minimization, even obtaining a model that satisfies statistical constraints can be challenging. All the more so, a good one. In this paper, we overcome this issue by learning in the empirical dual domain, where constrained statistical learning problems become unconstrained and deterministic. We analyze the generalization properties of this approach by bounding the empirical duality gap -- i.e., the difference between our approximate, tractable solution and the solution of the original (non-convex) statistical problem -- and provide a practical constrained learning algorithm. These results establish a constrained counterpart to classical learning theory, enabling the explicit use of constraints in learning. We illustrate this theory and algorithm in rate-constrained learning applications arising in fairness and adversarial robustness.
    Let Users Decide: Navigating the Trade-offs between Costs and Robustness in Algorithmic Recourse. (arXiv:2203.06768v2 [cs.LG] UPDATED)
    As machine learning (ML) models are increasingly being employed to make consequential decisions, there has been a growing interest in developing techniques which can provide recourse to affected individuals. Majority of these techniques provide recourse under the assumption that the affected individuals will implement the prescribed recourses \emph{exactly}. However, recourses often get implemented in a noisy and inconsistent manner due to a variety of reasons e.g., an individual who was asked to increase their salary by \$500 may get a promotion which comes with a raise of \$505. Motivated by this, we study the problem of recourse invalidation in the face of noisy human responses. More specifically, we theoretically and empirically analyze the behavior of state-of-the-art algorithms, and demonstrate that the recourses generated by these algorithms are very likely to be invalidated (i.e., result in negative outcomes) if small changes are made to them. We further propose a novel framework, EXPECTing noisy responses (\texttt{EXPECT}), which addresses the aforementioned problem by explicitly minimizing the probability of recourse invalidation in the face of noisy responses. Our framework can ensure that the resulting recourses are invalidated at most $r \%$ of the time, where $r$ is provided as input by the end user requesting recourse. By doing so, our framework provides end users with greater control in navigating the trade-offs between recourse costs and robustness to noisy responses. Experimental evaluation with multiple real world datasets demonstrates the efficacy of the proposed framework, and validates our theoretical findings.
    Detecting Arbitrary Order Beneficial Feature Interactions for Recommender Systems. (arXiv:2206.13764v1 [cs.IR])
    Detecting beneficial feature interactions is essential in recommender systems, and existing approaches achieve this by examining all the possible feature interactions. However, the cost of examining all the possible higher-order feature interactions is prohibitive (exponentially growing with the order increasing). Hence existing approaches only detect limited order (e.g., combinations of up to four features) beneficial feature interactions, which may miss beneficial feature interactions with orders higher than the limitation. In this paper, we propose a hypergraph neural network based model named HIRS. HIRS is the first work that directly generates beneficial feature interactions of arbitrary orders and makes recommendation predictions accordingly. The number of generated feature interactions can be specified to be much smaller than the number of all the possible interactions and hence, our model admits a much lower running time. To achieve an effective algorithm, we exploit three properties of beneficial feature interactions, and propose deep-infomax-based methods to guide the interaction generation. Our experimental results show that HIRS outperforms state-of-the-art algorithms by up to 5% in terms of recommendation accuracy.
    Learning Variable Impedance Control for Aerial Sliding on Uneven Heterogeneous Surfaces by Proprioceptive and Tactile Sensing. (arXiv:2206.14122v1 [cs.RO])
    The recent development of novel aerial vehicles capable of physically interacting with the environment leads to new applications such as contact-based inspection. These tasks require the robotic system to exchange forces with partially-known environments, which may contain uncertainties including unknown spatially-varying friction properties and discontinuous variations of the surface geometry. Finding a control strategy that is robust against these environmental uncertainties remains an open challenge. This paper presents a learning-based adaptive control strategy for aerial sliding tasks. In particular, the gains of a standard impedance controller are adjusted in real-time by a policy based on the current control signals, proprioceptive measurements, and tactile sensing. This policy is trained in simulation with simplified actuator dynamics in a student-teacher learning setup. The real-world performance of the proposed approach is verified using a tilt-arm omnidirectional flying vehicle. The proposed controller structure combines data-driven and model-based control methods, enabling our approach to successfully transfer directly and without adaptation from simulation to the real platform. Compared to fine-tuned state of the art interaction control methods we achieve reduced tracking error and improved disturbance rejection.
    MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation. (arXiv:2111.12707v4 [cs.CV] UPDATED)
    Estimating 3D human poses from monocular videos is a challenging task due to depth ambiguity and self-occlusion. Most existing works attempt to solve both issues by exploiting spatial and temporal relationships. However, those works ignore the fact that it is an inverse problem where multiple feasible solutions (i.e., hypotheses) exist. To relieve this limitation, we propose a Multi-Hypothesis Transformer (MHFormer) that learns spatio-temporal representations of multiple plausible pose hypotheses. In order to effectively model multi-hypothesis dependencies and build strong relationships across hypothesis features, the task is decomposed into three stages: (i) Generate multiple initial hypothesis representations; (ii) Model self-hypothesis communication, merge multiple hypotheses into a single converged representation and then partition it into several diverged hypotheses; (iii) Learn cross-hypothesis communication and aggregate the multi-hypothesis features to synthesize the final 3D pose. Through the above processes, the final representation is enhanced and the synthesized pose is much more accurate. Extensive experiments show that MHFormer achieves state-of-the-art results on two challenging datasets: Human3.6M and MPI-INF-3DHP. Without bells and whistles, its performance surpasses the previous best result by a large margin of 3% on Human3.6M. Code and models are available at \url{https://github.com/Vegetebird/MHFormer}.
    Deep Learning-Based Defect Classification and Detection in SEM Images. (arXiv:2206.13505v1 [eess.IV])
    This proposes a novel ensemble deep learning-based model to accurately classify, detect and localize different defect categories for aggressive pitches and thin resists (High NA applications).In particular, we train RetinaNet models using different ResNet, VGGNet architectures as backbone and present the comparison between the accuracies of these models and their performance analysis on SEM images with different types of defect patterns such as bridge, break and line collapses. Finally, we propose a preference-based ensemble strategy to combine the output predictions from different models in order to achieve better performance on classification and detection of defects. As CDSEM images inherently contain a significant level of noise, detailed feature information is often shadowed by noise. For certain resist profiles, the challenge is also to differentiate between a microbridge, footing, break, and zones of probable breaks. Therefore, we have applied an unsupervised machine learning model to denoise the SEM images to remove the False-Positive defects and optimize the effect of stochastic noise on structured pixels for better metrology and enhanced defect inspection. We repeated the defect inspection step with the same trained model and performed a comparative analysis for "robustness" and "accuracy" metric with conventional approach for both noisy/denoised image pair. The proposed ensemble method demonstrates improvement of the average precision metric (mAP) of the most difficult defect classes. In this work we have developed a novel robust supervised deep learning training scheme to accurately classify as well as localize different defect types in SEM images with high degree of accuracy. Our proposed approach demonstrates its effectiveness both quantitatively and qualitatively.
    Graph Condensation via Receptive Field Distribution Matching. (arXiv:2206.13697v1 [cs.LG])
    Graph neural networks (GNNs) enable the analysis of graphs using deep learning, with promising results in capturing structured information in graphs. This paper focuses on creating a small graph to represent the original graph, so that GNNs trained on the size-reduced graph can make accurate predictions. We view the original graph as a distribution of receptive fields and aim to synthesize a small graph whose receptive fields share a similar distribution. Thus, we propose Graph Condesation via Receptive Field Distribution Matching (GCDM), which is accomplished by optimizing the synthetic graph through the use of a distribution matching loss quantified by maximum mean discrepancy (MMD). Additionally, we demonstrate that the synthetic graph generated by GCDM is highly generalizable to a variety of models in evaluation phase and that the condensing speed is significantly improved using this framework.
    Risk Perspective Exploration in Distributional Reinforcement Learning. (arXiv:2206.14170v1 [cs.LG])
    Distributional reinforcement learning demonstrates state-of-the-art performance in continuous and discrete control settings with the features of variance and risk, which can be used to explore. However, the exploration method employing the risk property is hard to find, although numerous exploration methods in Distributional RL employ the variance of return distribution per action. In this paper, we present risk scheduling approaches that explore risk levels and optimistic behaviors from a risk perspective. We demonstrate the performance enhancement of the DMIX algorithm using risk scheduling in a multi-agent setting with comprehensive experiments.
    Molecular Geometry Pretraining with SE(3)-Invariant Denoising Distance Matching. (arXiv:2206.13602v1 [cs.LG])
    Pretraining molecular representations is critical in a variety of applications in drug and material discovery due to the limited number of labeled molecules, yet most of existing work focuses on pretraining on 2D molecular graphs. The power of pretraining on 3D geometric structures, however, has been less explored, owning to the difficulty of finding a sufficient proxy task to empower the pretraining to effectively extract essential features from the geometric structures. Motivated by the dynamic nature of 3D molecules, where the continuous motion of a molecule in the 3D Euclidean space forms a smooth potential energy surface, we propose a 3D coordinate denoising pretraining framework to model such an energy landscape. Leveraging a SE(3)-invariant score matching method, we propose SE(3)-DDM where the coordinate denoising proxy task is effectively boiled down to the denoising of the pairwise atomic distances in a molecule. Our comprehensive experiments confirm the effectiveness and robustness of our proposed method.
    Safe Exploration Incurs Nearly No Additional Sample Complexity for Reward-free RL. (arXiv:2206.14057v1 [cs.LG])
    While the primary goal of the exploration phase in reward-free reinforcement learning (RF-RL) is to reduce the uncertainty in the estimated model with minimum number of trajectories, in practice, the agent often needs to abide by certain safety constraint at the same time. It remains unclear how such safe exploration requirement would affect the corresponding sample complexity to achieve the desired optimality of the obtained policy in planning. In this work, we make a first attempt to answer this question. In particular, we consider the scenario where a safe baseline policy is known beforehand, and propose a unified Safe reWard-frEe ExploraTion (SWEET) framework. We then particularize the SWEET framework to the tabular and the low-rank MDP settings, and develop algorithms coined Tabular-SWEET and Low-rank-SWEET, respectively. Both algorithms leverage the concavity and continuity of the newly introduced truncated value functions, and are guaranteed to achieve zero constraint violation during exploration with high probability. Furthermore, both algorithms can provably find a near-optimal policy subject to any constraint in the planning phase. Remarkably, the sample complexities under both algorithms match or even outperform the state of the art in their constraint-free counterparts up to some constant factors, proving that safety constraint hardly increases the sample complexity for RF-RL.
    Studying Generalization Through Data Averaging. (arXiv:2206.13669v1 [stat.ML])
    The generalization of machine learning models has a complex dependence on the data, model and learning algorithm. We study train and test performance, as well as the generalization gap given by the mean of their difference over different data set samples to understand their ``typical" behavior. We derive an expression for the gap as a function of the covariance between the model parameter distribution and the train loss, and another expression for the average test performance, showing test generalization only depends on data-averaged parameter distribution and the data-averaged loss. We show that for a large class of model parameter distributions a modified generalization gap is always non-negative. By specializing further to parameter distributions produced by stochastic gradient descent (SGD), along with a few approximations and modeling considerations, we are able to predict some aspects about how the generalization gap and model train and test performance vary as a function of SGD noise. We evaluate these predictions empirically on the Cifar10 classification task based on a ResNet architecture.
    Adaptive Multi-view Rule Discovery for Weakly-Supervised Compatible Products Prediction. (arXiv:2206.13749v1 [cs.LG])
    On e-commerce platforms, predicting if two products are compatible with each other is an important functionality to achieve trustworthy product recommendation and search experience for consumers. However, accurately predicting product compatibility is difficult due to the heterogeneous product data and the lack of manually curated training data. We study the problem of discovering effective labeling rules that can enable weakly-supervised product compatibility prediction. We develop AMRule, a multi-view rule discovery framework that can (1) adaptively and iteratively discover novel rulers that can complement the current weakly-supervised model to improve compatibility prediction; (2) discover interpretable rules from both structured attribute tables and unstructured product descriptions. AMRule adaptively discovers labeling rules from large-error instances via a boosting-style strategy, the high-quality rules can remedy the current model's weak spots and refine the model iteratively. For rule discovery from structured product attributes, we generate composable high-order rules from decision trees; and for rule discovery from unstructured product descriptions, we generate prompt-based rules from a pre-trained language model. Experiments on 4 real-world datasets show that AMRule outperforms the baselines by 5.98% on average and improves rule quality and rule proposal efficiency.
    Supervised Learning with General Risk Functionals. (arXiv:2206.13648v1 [stat.ML])
    Standard uniform convergence results bound the generalization gap of the expected loss over a hypothesis class. The emergence of risk-sensitive learning requires generalization guarantees for functionals of the loss distribution beyond the expectation. While prior works specialize in uniform convergence of particular functionals, our work provides uniform convergence for a general class of H\"older risk functionals for which the closeness in the Cumulative Distribution Function (CDF) entails closeness in risk. We establish the first uniform convergence results for estimating the CDF of the loss distribution, yielding guarantees that hold simultaneously both over all H\"older risk functionals and over all hypotheses. Thus licensed to perform empirical risk minimization, we develop practical gradient-based methods for minimizing distortion risks (widely studied subset of H\"older risks that subsumes the spectral risks, including the mean, conditional value at risk, cumulative prospect theory risks, and others) and provide convergence guarantees. In experiments, we demonstrate the efficacy of our learning procedure, both in settings where uniform convergence results hold and in high-dimensional settings with deep networks.
    An Expert System for Redesigning Software for Cloud Applications. (arXiv:2109.14569v3 [cs.LG] UPDATED)
    Cloud-based software has many advantages. When services are divided into many independent components, they are easier to update. Also, during peak demand, it is easier to scale cloud services (just hire more CPUs). Hence, many organizations are partitioning their monolithic enterprise applications into cloud-based microservices. Recently there has been much work using machine learning to simplify this partitioning task. Despite much research, no single partitioning method can be recommended as generally useful. More specifically, those prior solutions are "brittle"; i.e. if they work well for one kind of goal in one dataset, then they can be sub-optimal if applied to many datasets and multiple goals. In order to find a generally useful partitioning method, we propose DEEPLY. This new algorithm extends the CO-GCN deep learning partition generator with (a) a novel loss function and (b) some hyper-parameter optimization. As shown by our experiments, DEEPLY generally outperforms prior work (including CO-GCN, and others) across multiple datasets and goals. To the best of our knowledge, this is the first report in SE of such stable hyper-parameter optimization. To aid reuse of this work, DEEPLY is available on-line at https://bit.ly/2WhfFlB.
    Measuring and Clustering Network Attackers using Medium-Interaction Honeypots. (arXiv:2206.13614v1 [cs.CR])
    Network honeypots are often used by information security teams to measure the threat landscape in order to secure their networks. With the advancement of honeypot development, today's medium-interaction honeypots provide a way for security teams and researchers to deploy these active defense tools that require little maintenance on a variety of protocols. In this work, we deploy such honeypots on five different protocols on the public Internet and study the intent and sophistication of the attacks we observe. We then use the information gained to develop a clustering approach that identifies correlations in attacker behavior to discover IPs that are highly likely to be controlled by a single operator, illustrating the advantage of using these honeypots for data collection.
    Hamiltonian Monte Carlo Particle Swarm Optimizer. (arXiv:2206.14134v1 [cs.LG])
    We introduce the Hamiltonian Monte Carlo Particle Swarm Optimizer (HMC-PSO), an optimization algorithm that reaps the benefits of both Exponentially Averaged Momentum PSO and HMC sampling. The coupling of the position and velocity of each particle with Hamiltonian dynamics in the simulation allows for extensive freedom for exploration and exploitation of the search space. It also provides an excellent technique to explore highly non-convex functions while ensuring efficient sampling. We extend the method to approximate error gradients in closed form for Deep Neural Network (DNN) settings. We discuss possible methods of coupling and compare its performance to that of state-of-the-art optimizers on the Golomb's Ruler problem and Classification tasks.
    Benchopt: Reproducible, efficient and collaborative optimization benchmarks. (arXiv:2206.13424v2 [cs.LG] UPDATED)
    Numerical validation is at the core of machine learning research as it allows to assess the actual impact of new methods, and to confirm the agreement between theory and practice. Yet, the rapid development of the field poses several challenges: researchers are confronted with a profusion of methods to compare, limited transparency and consensus on best practices, as well as tedious re-implementation work. As a result, validation is often very partial, which can lead to wrong conclusions that slow down the progress of research. We propose Benchopt, a collaborative framework to automate, reproduce and publish optimization benchmarks in machine learning across programming languages and hardware architectures. Benchopt simplifies benchmarking for the community by providing an off-the-shelf tool for running, sharing and extending experiments. To demonstrate its broad usability, we showcase benchmarks on three standard learning tasks: $\ell_2$-regularized logistic regression, Lasso, and ResNet18 training for image classification. These benchmarks highlight key practical findings that give a more nuanced view of the state-of-the-art for these problems, showing that for practical evaluation, the devil is in the details. We hope that Benchopt will foster collaborative work in the community hence improving the reproducibility of research findings.
    Dummy Prototypical Networks for Few-Shot Open-Set Keyword Spotting. (arXiv:2206.13691v1 [cs.SD])
    Keyword spotting is the task of detecting a keyword in streaming audio. Conventional keyword spotting targets predefined keywords classification, but there is growing attention in few-shot (query-by-example) keyword spotting, e.g., N-way classification given M-shot support samples. Moreover, in real-world scenarios, there can be utterances from unexpected categories (open-set) which need to be rejected rather than classified as one of the N classes. Combining the two needs, we tackle few-shot open-set keyword spotting with a new benchmark setting, named splitGSC. We propose episode-known dummy prototypes based on metric learning to detect an open-set better and introduce a simple and powerful approach, Dummy Prototypical Networks (D-ProtoNets). Our D-ProtoNets shows clear margins compared to recent few-shot open-set recognition (FSOSR) approaches in the suggested splitGSC. We also verify our method on a standard benchmark, miniImageNet, and D-ProtoNets shows the state-of-the-art open-set detection rate in FSOSR.
    A Proposed Bi-LSTM Method to Fake News Detection. (arXiv:2206.13982v1 [cs.CL])
    Recent years have seen an explosion in social media usage, allowing people to connect with others. Since the appearance of platforms such as Facebook and Twitter, such platforms influence how we speak, think, and behave. This problem negatively undermines confidence in content because of the existence of fake news. For instance, false news was a determining factor in influencing the outcome of the U.S. presidential election and other sites. Because this information is so harmful, it is essential to make sure we have the necessary tools to detect and resist it. We applied Bidirectional Long Short-Term Memory (Bi-LSTM) to determine if the news is false or real in order to showcase this study. A number of foreign websites and newspapers were used for data collection. After creating & running the model, the work achieved 84% model accuracy and 62.0 F1-macro scores with training data.
    Tensor Recovery Based on A Novel Non-convex Function Minimax Logarithmic Concave Penalty Function. (arXiv:2206.13506v1 [eess.IV])
    Non-convex relaxation methods have been widely used in tensor recovery problems, and compared with convex relaxation methods, can achieve better recovery results. In this paper, a new non-convex function, Minimax Logarithmic Concave Penalty (MLCP) function, is proposed, and some of its intrinsic properties are analyzed, among which it is interesting to find that the Logarithmic function is an upper bound of the MLCP function. The proposed function is generalized to tensor cases, yielding tensor MLCP and weighted tensor $L\gamma$-norm. Consider that its explicit solution cannot be obtained when applying it directly to the tensor recovery problem. Therefore, the corresponding equivalence theorems to solve such problem are given, namely, tensor equivalent MLCP theorem and equivalent weighted tensor $L\gamma$-norm theorem. In addition, we propose two EMLCP-based models for classic tensor recovery problems, namely low-rank tensor completion (LRTC) and tensor robust principal component analysis (TRPCA), and design proximal alternate linearization minimization (PALM) algorithms to solve them individually. Furthermore, based on the Kurdyka-{\L}ojasiwicz property, it is proved that the solution sequence of the proposed algorithm has finite length and converges to the critical point globally. Finally, Extensive experiments show that proposed algorithm achieve good results, and it is confirmed that the MLCP function is indeed better than the Logarithmic function in the minimization problem, which is consistent with the analysis of theoretical properties.
    Stochastic linear optimization never overfits with quadratically-bounded losses on general data. (arXiv:2202.06915v2 [cs.LG] UPDATED)
    This work provides test error bounds for iterative fixed point methods on linear predictors -- specifically, stochastic and batch mirror descent (MD), and stochastic temporal difference learning (TD) -- with two core contributions: (a) a single proof technique which gives high probability guarantees despite the absence of projections, regularization, or any equivalents, even when optima have large or infinite norm, for quadratically-bounded losses (e.g., providing unified treatment of squared and logistic losses); (b) locally-adapted rates which depend not on global problem structure (such as condition numbers and maximum margins), but rather on properties of low norm predictors which may suffer some small excess test error. The proof technique is an elementary and versatile coupling argument, and is demonstrated here in the following settings: stochastic MD under realizability; stochastic MD for general Markov data; batch MD for general IID data; stochastic MD on heavy-tailed data (still without projections); stochastic TD on Markov chains (all prior stochastic TD bounds are in expectation).
    Haul Road Mapping from GPS Traces. (arXiv:2206.13936v1 [cs.LG])
    Automation in mining requires accurate maps of road networks on site. Because roads on open-cut mines are dynamic in nature and continuously changing, manually updating road maps is tedious and error-prone. This paper investigates the possibility of automatically deriving an accurate representation of the road network using GPS data available from haul trucks operating on site. We present an overview of approaches proposed in literature and test the performance of publicly available methods on GPS data collected from trucks operating on site. Based on shortcomings seen in all tested algorithms, a post-processing step is developed which geometrically analyses the created road map for artefacts typical of free-drive areas on mine sites and significantly improves the quality of the final road network graph.
    Functional Optimization Reinforcement Learning for Real-Time Bidding. (arXiv:2206.13939v1 [cs.AI])
    Real-time bidding is the new paradigm of programmatic advertising. An advertiser wants to make the intelligent choice of utilizing a \textbf{Demand-Side Platform} to improve the performance of their ad campaigns. Existing approaches are struggling to provide a satisfactory solution for bidding optimization due to stochastic bidding behavior. In this paper, we proposed a multi-agent reinforcement learning architecture for RTB with functional optimization. We designed four agents bidding environment: three Lagrange-multiplier based functional optimization agents and one baseline agent (without any attribute of functional optimization) First, numerous attributes have been assigned to each agent, including biased or unbiased win probability, Lagrange multiplier, and click-through rate. In order to evaluate the proposed RTB strategy's performance, we demonstrate the results on ten sequential simulated auction campaigns. The results show that agents with functional actions and rewards had the most significant average winning rate and winning surplus, given biased and unbiased winning information respectively. The experimental evaluations show that our approach significantly improve the campaign's efficacy and profitability.
    Distributed Bayesian Online Learning for Cooperative Manipulation. (arXiv:2104.04342v2 [cs.RO] UPDATED)
    For tasks where the dynamics of multiple agents are physically coupled, e.g., in cooperative manipulation, the coordination between the individual agents becomes crucial, which requires exact knowledge of the interaction dynamics. This problem is typically addressed using centralized estimators, which can negatively impact the flexibility and robustness of the overall system. To overcome this shortcoming, we propose a novel distributed learning framework for the exemplary task of cooperative manipulation using Bayesian principles. Using only local state information each agent obtains an estimate of the object dynamics and grasp kinematics. These local estimates are combined using dynamic average consensus. Due to the strong probabilistic foundation of the method, each estimate of the object dynamics and grasp kinematics is accompanied by a measure of uncertainty, which allows to guarantee a bounded prediction error with high probability. Moreover, the Bayesian principles directly allow iterative learning with constant complexity, such that the proposed learning method can be used online in real-time applications. The effectiveness of the approach is demonstrated in a simulated cooperative manipulation task.
    TACTiS: Transformer-Attentional Copulas for Time Series. (arXiv:2202.03528v2 [cs.LG] UPDATED)
    The estimation of time-varying quantities is a fundamental component of decision making in fields such as healthcare and finance. However, the practical utility of such estimates is limited by how accurately they quantify predictive uncertainty. In this work, we address the problem of estimating the joint predictive distribution of high-dimensional multivariate time series. We propose a versatile method, based on the transformer architecture, that estimates joint distributions using an attention-based decoder that provably learns to mimic the properties of non-parametric copulas. The resulting model has several desirable properties: it can scale to hundreds of time series, supports both forecasting and interpolation, can handle unaligned and non-uniformly sampled data, and can seamlessly adapt to missing data during training. We demonstrate these properties empirically and show that our model produces state-of-the-art predictions on multiple real-world datasets.
    Data Augmentation techniques in time series domain: A survey and taxonomy. (arXiv:2206.13508v1 [cs.LG])
    With the latest advances in deep learning generative models, it has not taken long to take advantage of their remarkable performance in the area of time series. Deep neural networks used to work with time series depend heavily on the breadth and consistency of the datasets used in training. These types of characteristic are not usually abundant in the real world, where they are usually limited and often with privacy constraints that must be guaranteed. Therefore, an effective way is to increase the number of data using \gls{da} techniques, either by adding noise or permutations and by generating new synthetic data. It is systematically review the current state-of-the-art in the area to provide an overview of all available algorithms and proposes a taxonomy of the most relevant researches. The efficiency of the different variants will be evaluated; as a vital part of the process, the different metrics to evaluate the performance and the main problems concerning each model will be analysed. The ultimate goal of this study is to provide a summary of the evolution and performance of areas that produce better results to guide future researchers in this field.
    Disentangling Embedding Spaces with Minimal Distributional Assumptions. (arXiv:2206.13872v1 [stat.ML])
    Interest in understanding and factorizing learned embedding spaces is growing. For instance, recent concept-based explanation techniques analyze a machine learning model in terms of interpretable latent components. Such components have to be discovered in the model's embedding space, e.g., through independent component analysis (ICA) or modern disentanglement learning techniques. While these unsupervised approaches offer a sound formal framework, they either require access to a data generating function or impose rigid assumptions on the data distribution, such as independence of components, that are often violated in practice. In this work, we link conceptual explainability for vision models with disentanglement learning and ICA. This enables us to provide first theoretical results on how components can be identified without requiring any distributional assumptions. From these insights, we derive the disjoint attributions (DA) concept discovery method that is applicable to a broader class of problems than current approaches but yet possesses a formal identifiability guarantee. In an extensive comparison against component analysis and over 300 state-of-the-art disentanglement models, DA stably maintains superior performance, even under varying distributions and correlation strengths.
    RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network. (arXiv:2206.14098v1 [cs.LG])
    This work introduces the RevSilo, the first reversible module for bidirectional multi-scale feature fusion. Like other reversible methods, RevSilo eliminates the need to store hidden activations by recomputing them. Existing reversible methods, however, do not apply to multi-scale feature fusion and are therefore not applicable to a large class of networks. Bidirectional multi-scale feature fusion promotes local and global coherence and has become a de facto design principle for networks targeting spatially sensitive tasks e.g. HRNet and EfficientDet. When paired with high-resolution inputs, these networks achieve state-of-the-art results across various computer vision tasks, but training them requires substantial accelerator memory for saving large, multi-resolution activations. These memory requirements cap network size and limit progress. Using reversible recomputation, the RevSilo alleviates memory issues while still operating across resolution scales. Stacking RevSilos, we create RevBiFPN, a fully reversible bidirectional feature pyramid network. For classification, RevBiFPN is competitive with networks such as EfficientNet while using up to 19.8x lesser training memory. When fine-tuned on COCO, RevBiFPN provides up to a 2.5% boost in AP over HRNet using fewer MACs and a 2.4x reduction in training-time memory.
    Improved Text Classification via Test-Time Augmentation. (arXiv:2206.13607v1 [cs.LG])
    Test-time augmentation -- the aggregation of predictions across transformed examples of test inputs -- is an established technique to improve the performance of image classification models. Importantly, TTA can be used to improve model performance post-hoc, without additional training. Although test-time augmentation (TTA) can be applied to any data modality, it has seen limited adoption in NLP due in part to the difficulty of identifying label-preserving transformations. In this paper, we present augmentation policies that yield significant accuracy improvements with language models. A key finding is that augmentation policy design -- for instance, the number of samples generated from a single, non-deterministic augmentation -- has a considerable impact on the benefit of TTA. Experiments across a binary classification task and dataset show that test-time augmentation can deliver consistent improvements over current state-of-the-art approaches.
    Integral Transforms in a Physics-Informed (Quantum) Neural Network setting: Applications & Use-Cases. (arXiv:2206.14184v1 [quant-ph])
    In many computational problems in engineering and science, function or model differentiation is essential, but also integration is needed. An important class of computational problems include so-called integro-differential equations which include both integrals and derivatives of a function. In another example, stochastic differential equations can be written in terms of a partial differential equation of a probability density function of the stochastic variable. To learn characteristics of the stochastic variable based on the density function, specific integral transforms, namely moments, of the density function need to be calculated. Recently, the machine learning paradigm of Physics-Informed Neural Networks emerged with increasing popularity as a method to solve differential equations by leveraging automatic differentiation. In this work, we propose to augment the paradigm of Physics-Informed Neural Networks with automatic integration in order to compute complex integral transforms on trained solutions, and to solve integro-differential equations where integrals are computed on-the-fly during training. Furthermore, we showcase the techniques in various application settings, numerically simulating quantum computer-based neural networks as well as classical neural networks.
    Koopman Q-learning: Offline Reinforcement Learning via Symmetries of Dynamics. (arXiv:2111.01365v2 [cs.LG] UPDATED)
    Offline reinforcement learning leverages large datasets to train policies without interactions with the environment. The learned policies may then be deployed in real-world settings where interactions are costly or dangerous. Current algorithms over-fit to the training dataset and as a consequence perform poorly when deployed to out-of-distribution generalizations of the environment. We aim to address these limitations by learning a Koopman latent representation which allows us to infer symmetries of the system's underlying dynamic. The latter is then utilized to extend the otherwise static offline dataset during training; this constitutes a novel data augmentation framework which reflects the system's dynamic and is thus to be interpreted as an exploration of the environments phase space. To obtain the symmetries we employ Koopman theory in which nonlinear dynamics are represented in terms of a linear operator acting on the space of measurement functions of the system and thus symmetries of the dynamics may be inferred directly. We provide novel theoretical results on the existence and nature of symmetries relevant for control systems such as reinforcement learning settings. Moreover, we empirically evaluate our method on several benchmark offline reinforcement learning tasks and datasets including D4RL, Metaworld and Robosuite and find that by using our framework we consistently improve the state-of-the-art of model-free Q-learning methods.
    SHELS: Exclusive Feature Sets for Novelty Detection and Continual Learning Without Class Boundaries. (arXiv:2206.13720v1 [cs.LG])
    While deep neural networks (DNNs) have achieved impressive classification performance in closed-world learning scenarios, they typically fail to generalize to unseen categories in dynamic open-world environments, in which the number of concepts is unbounded. In contrast, human and animal learners have the ability to incrementally update their knowledge by recognizing and adapting to novel observations. In particular, humans characterize concepts via exclusive (unique) sets of essential features, which are used for both recognizing known classes and identifying novelty. Inspired by natural learners, we introduce a Sparse High-level-Exclusive, Low-level-Shared feature representation (SHELS) that simultaneously encourages learning exclusive sets of high-level features and essential, shared low-level features. The exclusivity of the high-level features enables the DNN to automatically detect out-of-distribution (OOD) data, while the efficient use of capacity via sparse low-level features permits accommodating new knowledge. The resulting approach uses OOD detection to perform class-incremental continual learning without known class boundaries. We show that using SHELS for novelty detection results in statistically significant improvements over state-of-the-art OOD detection approaches over a variety of benchmark datasets. Further, we demonstrate that the SHELS model mitigates catastrophic forgetting in a class-incremental learning setting,enabling a combined novelty detection and accommodation framework that supports learning in open-world settings
    RAW-GNN: RAndom Walk Aggregation based Graph Neural Network. (arXiv:2206.13953v1 [cs.LG])
    Graph-Convolution-based methods have been successfully applied to representation learning on homophily graphs where nodes with the same label or similar attributes tend to connect with one another. Due to the homophily assumption of Graph Convolutional Networks (GCNs) that these methods use, they are not suitable for heterophily graphs where nodes with different labels or dissimilar attributes tend to be adjacent. Several methods have attempted to address this heterophily problem, but they do not change the fundamental aggregation mechanism of GCNs because they rely on summation operators to aggregate information from neighboring nodes, which is implicitly subject to the homophily assumption. Here, we introduce a novel aggregation mechanism and develop a RAndom Walk Aggregation-based Graph Neural Network (called RAW-GNN) method. The proposed approach integrates the random walk strategy with graph neural networks. The new method utilizes breadth-first random walk search to capture homophily information and depth-first search to collect heterophily information. It replaces the conventional neighborhoods with path-based neighborhoods and introduces a new path-based aggregator based on Recurrent Neural Networks. These designs make RAW-GNN suitable for both homophily and heterophily graphs. Extensive experimental results showed that the new method achieved state-of-the-art performance on a variety of homophily and heterophily graphs.
    On the universality of the volatility formation process: when machine learning and rough volatility agree. (arXiv:2206.14114v1 [q-fin.ST])
    We train an LSTM network based on a pooled dataset made of hundreds of liquid stocks aiming to forecast the next daily realized volatility for all stocks. Showing the consistent outperformance of this universal LSTM relative to other asset-specific parametric models, we uncover nonparametric evidences of a universal volatility formation mechanism across assets relating past market realizations, including daily returns and volatilities, to current volatilities. A parsimonious parametric forecasting device combining the rough fractional stochastic volatility and quadratic rough Heston models with fixed parameters results in the same level of performance as the universal LSTM, which confirms the universality of the volatility formation process from a parametric perspective.
    Stochastic first-order methods for average-reward Markov decision processes. (arXiv:2205.05800v4 [cs.LG] UPDATED)
    We study the problem of average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy evaluation and optimization. Existing on-policy evaluation methods suffer from sub-optimal convergence rates as well as failure in handling insufficiently random policies, e.g., deterministic policies, for lack of exploration. To remedy these issues, we develop a novel variance-reduced temporal difference (VRTD) method with linear function approximation for randomized policies along with optimal convergence guarantees, and an exploratory variance-reduced temporal difference (EVRTD) method for insufficiently random policies with comparable convergence guarantees. We further establish linear convergence rate on the bias of policy evaluation, which is essential for improving the overall sample complexity of policy optimization. On the other hand, compared with intensive research interest in finite sample analysis of policy gradient methods for discounted MDPs, existing studies on policy gradient methods for AMDPs mostly focus on regret bounds under restrictive assumptions on the underlying Markov processes (see, e.g., Abbasi-Yadkori et al., 2019), and they often lack guarantees on the overall sample complexities. Towards this end, we develop an average-reward variant of the stochastic policy mirror descent (SPMD) (Lan, 2022). We establish the first $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity for solving AMDPs with policy gradient method under both the generative model (with unichain assumption) and Markovian noise model (with ergodic assumption). This bound can be further improved to $\widetilde{\mathcal{O}}(\epsilon^{-1})$ for solving regularized AMDPs. Our theoretical advantages are corroborated by numerical experiments.
    AI-based computer-aided diagnostic system of chest digital tomography synthesis: Demonstrating comparative advantage with X-ray-based AI systems. (arXiv:2206.13504v1 [eess.IV])
    Compared with chest X-ray (CXR) imaging, which is a single image projected from the front of the patient, chest digital tomosynthesis (CDTS) imaging can be more advantageous for lung lesion detection because it acquires multiple images projected from multiple angles of the patient. Various clinical comparative analysis and verification studies have been reported to demonstrate this, but there were no artificial intelligence (AI)-based comparative analysis studies. Existing AI-based computer-aided detection (CAD) systems for lung lesion diagnosis have been developed mainly based on CXR images; however, CAD-based on CDTS, which uses multi-angle images of patients in various directions, has not been proposed and verified for its usefulness compared to CXR-based counterparts. This study develops/tests a CDTS-based AI CAD system to detect lung lesions to demonstrate performance improvements compared to CXR-based AI CAD. We used multiple projection images as input for the CDTS-based AI model and a single-projection image as input for the CXR-based AI model to fairly compare and evaluate the performance between models. The proposed CDTS-based AI CAD system yielded sensitivities of 0.782 and 0.785 and accuracies of 0.895 and 0.837 for the performance of detecting tuberculosis and pneumonia, respectively, against normal subjects. These results show higher performance than sensitivities of 0.728 and 0.698 and accuracies of 0.874 and 0.826 for detecting tuberculosis and pneumonia through the CXR-based AI CAD, which only uses a single projection image in the frontal direction. We found that CDTS-based AI CAD improved the sensitivity of tuberculosis and pneumonia by 5.4% and 8.7% respectively, compared to CXR-based AI CAD without loss of accuracy. Therefore, we comparatively prove that CDTS-based AI CAD technology can improve performance more than CXR, enhancing the clinical applicability of CDTS.
    Improving Clinical Efficiency and Reducing Medical Errors through NLP-enabled diagnosis of Health Conditions from Transcription Reports. (arXiv:2206.13516v1 [cs.LG])
    Misdiagnosis rates are one of the leading causes of medical errors in hospitals, affecting over 12 million adults across the US. To address the high rate of misdiagnosis, this study utilizes 4 NLP-based algorithms to determine the appropriate health condition based on an unstructured transcription report. From the Logistic Regression, Random Forest, LSTM, and CNNLSTM models, the CNN-LSTM model performed the best with an accuracy of 97.89%. We packaged this model into a authenticated web platform for accessible assistance to clinicians. Overall, by standardizing health care diagnosis and structuring transcription reports, our NLP platform drastically improves the clinical efficiency and accuracy of hospitals worldwide.
    POEM: Out-of-Distribution Detection with Posterior Sampling. (arXiv:2206.13687v1 [cs.LG])
    Out-of-distribution (OOD) detection is indispensable for machine learning models deployed in the open world. Recently, the use of an auxiliary outlier dataset during training (also known as outlier exposure) has shown promising performance. As the sample space for potential OOD data can be prohibitively large, sampling informative outliers is essential. In this work, we propose a novel posterior sampling-based outlier mining framework, POEM, which facilitates efficient use of outlier data and promotes learning a compact decision boundary between ID and OOD data for improved detection. We show that POEM establishes state-of-the-art performance on common benchmarks. Compared to the current best method that uses a greedy sampling strategy, POEM improves the relative performance by 42.0% and 24.2% (FPR95) on CIFAR-10 and CIFAR-100, respectively. We further provide theoretical insights on the effectiveness of POEM for OOD detection.
    Online Bootstrap Inference For Policy Evaluation in Reinforcement Learning. (arXiv:2108.03706v3 [stat.ML] UPDATED)
    The recent emergence of reinforcement learning has created a demand for robust statistical inference methods for the parameter estimates computed using these algorithms. Existing methods for statistical inference in online learning are restricted to settings involving independently sampled observations, while existing statistical inference methods in reinforcement learning (RL) are limited to the batch setting. The online bootstrap is a flexible and efficient approach for statistical inference in linear stochastic approximation algorithms, but its efficacy in settings involving Markov noise, such as RL, has yet to be explored. In this paper, we study the use of the online bootstrap method for statistical inference in RL. In particular, we focus on the temporal difference (TD) learning and Gradient TD (GTD) learning algorithms, which are themselves special instances of linear stochastic approximation under Markov noise. The method is shown to be distributionally consistent for statistical inference in policy evaluation, and numerical experiments are included to demonstrate the effectiveness of this algorithm at statistical inference tasks across a range of real RL environments.
    Utilizing Class Separation Distance for the Evaluation of Corruption Robustness of Machine Learning Classifiers. (arXiv:2206.13405v1 [cs.LG] CROSS LISTED)
    Robustness is a fundamental pillar of Machine Learning (ML) classifiers, substantially determining their reliability. Methods for assessing classifier robustness are therefore essential. In this work, we address the challenge of evaluating corruption robustness in a way that allows comparability and interpretability on a given dataset. We propose a test data augmentation method that uses a robustness distance $\epsilon$ derived from the datasets minimal class separation distance. The resulting MSCR (mean statistical corruption robustness) metric allows a dataset-specific comparison of different classifiers with respect to their corruption robustness. The MSCR value is interpretable, as it represents the classifiers avoidable loss of accuracy due to statistical corruptions. On 2D and image data, we show that the metric reflects different levels of classifier robustness. Furthermore, we observe unexpected optima in classifiers robust accuracy through training and testing classifiers with different levels of noise. While researchers have frequently reported on a significant tradeoff on accuracy when training robust models, we strengthen the view that a tradeoff between accuracy and corruption robustness is not inherent. Our results indicate that robustness training through simple data augmentation can already slightly improve accuracy.
    Quantum Neural Architecture Search with Quantum Circuits Metric and Bayesian Optimization. (arXiv:2206.14115v1 [quant-ph])
    Quantum neural networks are promising for a wide range of applications in the Noisy Intermediate-Scale Quantum era. As such, there is an increasing demand for automatic quantum neural architecture search. We tackle this challenge by designing a quantum circuits metric for Bayesian optimization with Gaussian process. To this goal, we propose a new quantum gates distance that characterizes the gates' action over every quantum state and provide a theoretical perspective on its geometrical properties. Our approach significantly outperforms the benchmark on three empirical quantum machine learning problems including training a quantum generative adversarial network, solving combinatorial optimization in the MaxCut problem, and simulating quantum Fourier transform. Our method can be extended to characterize behaviors of various quantum machine learning models.
    Memory Safe Computations with XLA Compiler. (arXiv:2206.14148v1 [cs.LG])
    Software packages like TensorFlow and PyTorch are designed to support linear algebra operations, and their speed and usability determine their success. However, by prioritising speed, they often neglect memory requirements. As a consequence, the implementations of memory-intensive algorithms that are convenient in terms of software design can often not be run for large problems due to memory overflows. Memory-efficient solutions require complex programming approaches with significant logic outside the computational framework. This impairs the adoption and use of such algorithms. To address this, we developed an XLA compiler extension that adjusts the computational data-flow representation of an algorithm according to a user-specified memory limit. We show that k-nearest neighbour and sparse Gaussian process regression methods can be run at a much larger scale on a single device, where standard implementations would have failed. Our approach leads to better use of hardware resources. We believe that further focus on removing memory constraints at a compiler level will widen the range of machine learning methods that can be developed in the future.
    Transparent Single-Cell Set Classification with Kernel Mean Embeddings. (arXiv:2201.07322v5 [cs.LG] UPDATED)
    Modern single-cell flow and mass cytometry technologies measure the expression of several proteins of the individual cells within a blood or tissue sample. Each profiled biological sample is thus represented by a set of hundreds of thousands of multidimensional cell feature vectors, which incurs a high computational cost to predict each biological sample's associated phenotype with machine learning models. Such a large set cardinality also limits the interpretability of machine learning models due to the difficulty in tracking how each individual cell influences the ultimate prediction. We propose using Kernel Mean Embedding to encode the cellular landscape of each profiled biological sample. Although our foremost goal is to make a more transparent model, we find that our method achieves comparable or better accuracies than the state-of-the-art gating-free methods through a simple linear classifier. As a result, our model contains few parameters but still performs similarly to deep learning models with millions of parameters. In contrast with deep learning approaches, the linearity and sub-selection step of our model makes it easy to interpret classification results. Analysis further shows that our method admits rich biological interpretability for linking cellular heterogeneity to clinical phenotype.
    Quantifying and Learning Linear Symmetry-Based Disentanglement. (arXiv:2011.06070v4 [cs.LG] UPDATED)
    The definition of Linear Symmetry-Based Disentanglement (LSBD) formalizes the notion of linearly disentangled representations, but there is currently no metric to quantify LSBD. Such a metric is crucial to evaluate LSBD methods and to compare to previous understandings of disentanglement. We propose $\mathcal{D}_\mathrm{LSBD}$, a mathematically sound metric to quantify LSBD, and provide a practical implementation for $\mathrm{SO}(2)$ groups. Furthermore, from this metric we derive LSBD-VAE, a semi-supervised method to learn LSBD representations. We demonstrate the utility of our metric by showing that (1) common VAE-based disentanglement methods don't learn LSBD representations, (2) LSBD-VAE as well as other recent methods can learn LSBD representations, needing only limited supervision on transformations, and (3) various desirable properties expressed by existing disentanglement metrics are also achieved by LSBD representations.
    Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation. (arXiv:2203.13339v2 [cs.CL] UPDATED)
    End-to-end speech-to-speech translation (S2ST) without relying on intermediate text representations is a rapidly emerging frontier of research. Recent works have demonstrated that the performance of such direct S2ST systems is approaching that of conventional cascade S2ST when trained on comparable datasets. However, in practice, the performance of direct S2ST is bounded by the availability of paired S2ST training data. In this work, we explore multiple approaches for leveraging much more widely available unsupervised and weakly-supervised speech and text data to improve the performance of direct S2ST based on Translatotron 2. With our most effective approaches, the average translation quality of direct S2ST on 21 language pairs on the CVSS-C corpus is improved by +13.6 BLEU (or +113% relatively), as compared to the previous state-of-the-art trained without additional data. The improvements on low-resource language are even more significant (+398% relatively on average). Our comparative studies suggest future research directions for S2ST and speech representation learning.
    Learning Controllable 3D Level Generators. (arXiv:2206.13623v1 [cs.AI])
    Procedural Content Generation via Reinforcement Learning (PCGRL) foregoes the need for large human-authored data-sets and allows agents to train explicitly on functional constraints, using computable, user-defined measures of quality instead of target output. We explore the application of PCGRL to 3D domains, in which content-generation tasks naturally have greater complexity and potential pertinence to real-world applications. Here, we introduce several PCGRL tasks for the 3D domain, Minecraft (Mojang Studios, 2009). These tasks will challenge RL-based generators using affordances often found in 3D environments, such as jumping, multiple dimensional movement, and gravity. We train an agent to optimize each of these tasks to explore the capabilities of previous research in PCGRL. This agent is able to generate relatively complex and diverse levels, and generalize to random initial states and control targets. Controllability tests in the presented tasks demonstrate their utility to analyze success and failure for 3D generators.
    UMBRELLA: Uncertainty-Aware Model-Based Offline Reinforcement Learning Leveraging Planning. (arXiv:2111.11097v3 [cs.RO] UPDATED)
    Offline reinforcement learning (RL) provides a framework for learning decision-making from offline data and therefore constitutes a promising approach for real-world applications as automated driving. Self-driving vehicles (SDV) learn a policy, which potentially even outperforms the behavior in the sub-optimal data set. Especially in safety-critical applications as automated driving, explainability and transferability are key to success. This motivates the use of model-based offline RL approaches, which leverage planning. However, current state-of-the-art methods often neglect the influence of aleatoric uncertainty arising from the stochastic behavior of multi-agent systems. This work proposes a novel approach for Uncertainty-aware Model-Based Offline REinforcement Learning Leveraging plAnning (UMBRELLA), which solves the prediction, planning, and control problem of the SDV jointly in an interpretable learning-based fashion. A trained action-conditioned stochastic dynamics model captures distinctively different future evolutions of the traffic scene. The analysis provides empirical evidence for the effectiveness of our approach in challenging automated driving simulations and based on a real-world public dataset.
    On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods. (arXiv:2206.13503v2 [cs.LG] UPDATED)
    Machine Learning (ML) models now inform a wide range of human decisions, but using ``black box'' models carries risks such as relying on spurious correlations or errant data. To address this, researchers have proposed methods for supplementing models with explanations of their predictions. However, robust evaluations of these methods' usefulness in real-world contexts have remained elusive, with experiments tending to rely on simplified settings or proxy tasks. We present an experimental study extending a prior explainable ML evaluation experiment and bringing the setup closer to the deployment setting by relaxing its simplifying assumptions. Our empirical study draws dramatically different conclusions than the prior work, highlighting how seemingly trivial experimental design choices can yield misleading results. Beyond the present experiment, we believe this work holds lessons about the necessity of situating the evaluation of any ML method and choosing appropriate tasks, data, users, and metrics to match the intended deployment contexts.
    Materials Transformers Language Models for Generative Materials Design: a benchmark study. (arXiv:2206.13578v1 [cond-mat.mtrl-sci])
    Pre-trained transformer language models on large unlabeled corpus have produced state-of-the-art results in natural language processing, organic molecule design, and protein sequence generation. However, no such models have been applied to learn the composition patterns of inorganic materials. Here we train a series of seven modern transformer language models (GPT, GPT-2, GPT-Neo, GPT-J, BLMM, BART, and RoBERTa) using the expanded formulas from material deposited in the ICSD, OQMD, and Materials Projects databases. Six different datasets with/out non-charge-neutral or balanced electronegativity samples are used to benchmark the performances and uncover the generation biases of modern transformer models for the generative design of materials compositions. Our extensive experiments showed that the causal language models based materials transformers can generate chemically valid materials compositions with as high as 97.54\% to be charge neutral and 91.40\% to be electronegativity balanced, which has more than 6 times higher enrichment compared to a baseline pseudo-random sampling algorithm. These models also demonstrate high novelty and their potential in new materials discovery has been proved by their capability to recover the leave-out materials. We also find that the properties of the generated samples can be tailored by training the models with selected training sets such as high-bandgap materials. Our experiments also showed that different models each have their own preference in terms of the properties of the generated samples and their running time complexity varies a lot. We have applied our materials transformer models to discover a set of new materials as validated using DFT calculations.
    Deployment of ML Models using Kubeflow on Different Cloud Providers. (arXiv:2206.13655v1 [cs.LG])
    This project aims to explore the process of deploying Machine learning models on Kubernetes using an open-source tool called Kubeflow [1] - an end-to-end ML Stack orchestration toolkit. We create end-to-end Machine Learning models on Kubeflow in the form of pipelines and analyze various points including the ease of setup, deployment models, performance, limitations and features of the tool. We hope that our project acts almost like a seminar/introductory report that can help vanilla cloud/Kubernetes users with zero knowledge on Kubeflow use Kubeflow to deploy ML models. From setup on different clouds to serving our trained model over the internet - we give details and metrics detailing the performance of Kubeflow.
    Learning to learn online with neuromodulated synaptic plasticity in spiking neural networks. (arXiv:2206.12520v2 [cs.NE] UPDATED)
    We propose that in order to harness our understanding of neuroscience toward machine learning, we must first have powerful tools for training brain-like models of learning. Although substantial progress has been made toward understanding the dynamics of learning in the brain, neuroscience-derived models of learning have yet to demonstrate the same performance capabilities as methods in deep learning such as gradient descent. Inspired by the successes of machine learning using gradient descent, we demonstrate that models of neuromodulated synaptic plasticity from neuroscience can be trained in Spiking Neural Networks (SNNs) with a framework of learning to learn through gradient descent to address challenging online learning problems. This framework opens a new path toward developing neuroscience inspired online learning algorithms.
    Revisiting the Updates of a Pre-trained Model for Few-shot Learning. (arXiv:2205.07874v2 [cs.LG] UPDATED)
    Most of the recent few-shot learning algorithms are based on transfer learning, where a model is pre-trained using a large amount of source data, and the pre-trained model is updated using a small amount of target data afterward. In transfer-based few-shot learning, sophisticated pre-training methods have been widely studied for universal and improved representation. However, there is little study on updating pre-trained models for few-shot learning. In this paper, we compare the two popular updating methods, fine-tuning (i.e., updating the entire network) and linear probing (i.e., updating only the linear classifier), considering the distribution shift between the source and target data. We find that fine-tuning is better than linear probing as the number of samples increases, regardless of distribution shift. Next, we investigate the effectiveness and ineffectiveness of data augmentation when pre-trained models are fine-tuned. Our fundamental analyses demonstrate that careful considerations of the details about updating pre-trained models are required for better few-shot performance.
    Reduced Optimal Power Flow Using Graph Neural Network. (arXiv:2206.13591v1 [eess.SY])
    OPF problems are formulated and solved for power system operations, especially for determining generation dispatch points in real-time. For large and complex power system networks with large numbers of variables and constraints, finding the optimal solution for real-time OPF in a timely manner requires a massive amount of computing power. This paper presents a new method to reduce the number of constraints in the original OPF problem using a graph neural network (GNN). GNN is an innovative machine learning model that utilizes features from nodes, edges, and network topology to maximize its performance. In this paper, we proposed a GNN model to predict which lines would be heavily loaded or congested with given load profiles and generation capacities. Only these critical lines will be monitored in an OPF problem, creating a reduced OPF (ROPF) problem. Significant saving in computing time is expected from the proposed ROPF model. A comprehensive analysis of predictions from the GNN model was also made. It is concluded that the application of GNN for ROPF is able to reduce computing time while retaining solution quality.
    Explaining Any ML Model? -- On Goals and Capabilities of XAI. (arXiv:2206.13888v1 [cs.LG])
    An increasing ubiquity of machine learning (ML) motivates research on algorithms to explain ML models and their predictions -- so-called eXplainable Artificial Intelligence (XAI). Despite many survey papers and discussions, the goals and capabilities of XAI algorithms are far from being well understood. We argue that this is because of a problematic reasoning scheme in XAI literature: XAI algorithms are said to complement ML models with desired properties, such as "interpretability", or "explainability". These properties are in turn assumed to contribute to a goal, like "trust" in an ML system. But most properties lack precise definitions and their relationship to such goals is far from obvious. The result is a reasoning scheme that obfuscates research results and leaves an important question unanswered: What can one expect from XAI algorithms? In this article, we clarify the goals and capabilities of XAI algorithms from a concrete perspective: that of their users. Explaining ML models is only necessary if users have questions about them. We show that users can ask diverse questions, but that only one of them can be answered by current XAI algorithms. Answering this core question can be trivial, difficult or even impossible, depending on the ML application. Based on these insights, we outline which capabilities policymakers, researchers and society can reasonably expect from XAI algorithms.
    Parallel Instance Filtering for Malware Detection. (arXiv:2206.13889v1 [cs.CR])
    Machine learning algorithms are widely used in the area of malware detection. With the growth of sample amounts, training of classification algorithms becomes more and more expensive. In addition, training data sets may contain redundant or noisy instances. The problem to be solved is how to select representative instances from large training data sets without reducing the accuracy. This work presents a new parallel instance selection algorithm called Parallel Instance Filtering (PIF). The main idea of the algorithm is to split the data set into non-overlapping subsets of instances covering the whole data set and apply a filtering process for each subset. Each subset consists of instances that have the same nearest enemy. As a result, the PIF algorithm is fast since subsets are processed independently of each other using parallel computation. We compare the PIF algorithm with several state-of-the-art instance selection algorithms on a large data set of 500,000 malicious and benign samples. The feature set was extracted using static analysis, and it includes metadata from the portable executable file format. Our experimental results demonstrate that the proposed instance selection algorithm reduces the size of a training data set significantly with the only slightly decreased accuracy. The PIF algorithm outperforms existing instance selection methods used in the experiments in terms of the ratio between average classification accuracy and storage percentage.
    Measure Estimation in the Barycentric Coding Model. (arXiv:2201.12195v2 [stat.ML] UPDATED)
    This paper considers the problem of measure estimation under the barycentric coding model (BCM), in which an unknown measure is assumed to belong to the set of Wasserstein-2 barycenters of a finite set of known measures. Estimating a measure under this model is equivalent to estimating the unknown barycentric coordinates. We provide novel geometrical, statistical, and computational insights for measure estimation under the BCM, consisting of three main results. Our first main result leverages the Riemannian geometry of Wasserstein-2 space to provide a procedure for recovering the barycentric coordinates as the solution to a quadratic optimization problem assuming access to the true reference measures. The essential geometric insight is that the parameters of this quadratic problem are determined by inner products between the optimal displacement maps from the given measure to the reference measures defining the BCM. Our second main result then establishes an algorithm for solving for the coordinates in the BCM when all the measures are observed empirically via i.i.d. samples. We prove precise rates of convergence for this algorithm -- determined by the smoothness of the underlying measures and their dimensionality -- thereby guaranteeing its statistical consistency. Finally, we demonstrate the utility of the BCM and associated estimation procedures in three application areas: (i) covariance estimation for Gaussian measures; (ii) image processing; and (iii) natural language processing.
    LiteCON: An All-Photonic Neuromorphic Accelerator for Energy-efficient Deep Learning (Preprint). (arXiv:2206.13861v1 [cs.ET])
    Deep learning is highly pervasive in today's data-intensive era. In particular, convolutional neural networks (CNNs) are being widely adopted in a variety of fields for superior accuracy. However, computing deep CNNs on traditional CPUs and GPUs brings several performance and energy pitfalls. Several novel approaches based on ASIC, FPGA, and resistive-memory devices have been recently demonstrated with promising results. Most of them target only the inference (testing) phase of deep learning. There have been very limited attempts to design a full-fledged deep learning accelerator capable of both training and inference. It is due to the highly compute and memory-intensive nature of the training phase. In this paper, we propose LiteCON, a novel analog photonics CNN accelerator. LiteCON uses silicon microdisk-based convolution, memristor-based memory, and dense-wavelength-division-multiplexing for energy-efficient and ultrafast deep learning. We evaluate LiteCON using a commercial CAD framework (IPKISS) on deep learning benchmark models including LeNet and VGG-Net. Compared to the state-of-the-art, LiteCON improves the CNN throughput, energy efficiency, and computational efficiency by up to 32x, 37x, and 5x respectively with trivial accuracy degradation.
    Feature Learning for Dimensionality Reduction toward Maximal Extraction of Hidden Patterns. (arXiv:2206.13891v1 [cs.LG])
    Dimensionality reduction (DR) plays a vital role in the visual analysis of high-dimensional data. One main aim of DR is to reveal hidden patterns that lie on intrinsic low-dimensional manifolds. However, DR often overlooks important patterns when the manifolds are strongly distorted or hidden by certain influential data attributes. This paper presents a feature learning framework, FEALM, designed to generate an optimized set of data projections for nonlinear DR in order to capture important patterns in the hidden manifolds. These projections produce maximally different nearest-neighbor graphs so that resultant DR outcomes are significantly different. To achieve such a capability, we design an optimization algorithm as well as introduce a new graph dissimilarity measure, called neighbor-shape dissimilarity. Additionally, we develop interactive visualizations to assist comparison of obtained DR results and interpretation of each DR result. We demonstrate FEALM's effectiveness through experiments using synthetic datasets and multiple case studies on real-world datasets.
    Learning Symmetric Rules with SATNet. (arXiv:2206.13998v1 [cs.AI])
    SATNet is a differentiable constraint solver with a custom backpropagation algorithm, which can be used as a layer in a deep-learning system. It is a promising proposal for bridging deep learning and logical reasoning. In fact, SATNet has been successfully applied to learn, among others, the rules of a complex logical puzzle, such as Sudoku, just from input and output pairs where inputs are given as images. In this paper, we show how to improve the learning of SATNet by exploiting symmetries in the target rules of a given but unknown logical puzzle or more generally a logical formula. We present SymSATNet, a variant of SATNet that translates the given symmetries of the target rules to a condition on the parameters of SATNet and requires that the parameters should have a particular parametric form that guarantees the condition. The requirement dramatically reduces the number of parameters to learn for the rules with enough symmetries, and makes the parameter learning of SymSATNet much easier than that of SATNet. We also describe a technique for automatically discovering symmetries of the target rules from examples. Our experiments with Sudoku and Rubik's cube show the substantial improvement of SymSATNet over the baseline SATNet.
    Value Function Decomposition for Iterative Design of Reinforcement Learning Agents. (arXiv:2206.13901v1 [cs.LG])
    Designing reinforcement learning (RL) agents is typically a difficult process that requires numerous design iterations. Learning can fail for a multitude of reasons, and standard RL methods provide too few tools to provide insight into the exact cause. In this paper, we show how to integrate value decomposition into a broad class of actor-critic algorithms and use it to assist in the iterative agent-design process. Value decomposition separates a reward function into distinct components and learns value estimates for each. These value estimates provide insight into an agent's learning and decision-making process and enable new training methods to mitigate common problems. As a demonstration, we introduce SAC-D, a variant of soft actor-critic (SAC) adapted for value decomposition. SAC-D maintains similar performance to SAC, while learning a larger set of value predictions. We also introduce decomposition-based tools that exploit this information, including a new reward influence metric, which measures each reward component's effect on agent decision-making. Using these tools, we provide several demonstrations of decomposition's use in identifying and addressing problems in the design of both environments and agents. Value decomposition is broadly applicable and easy to incorporate into existing algorithms and workflows, making it a powerful tool in an RL practitioner's toolbox.
    Conditional Contrastive Learning for Improving Fairness in Self-Supervised Learning. (arXiv:2106.02866v2 [cs.LG] UPDATED)
    Contrastive self-supervised learning (SSL) learns an embedding space that maps similar data pairs closer and dissimilar data pairs farther apart. Despite its success, one issue has been overlooked: the fairness aspect of representations learned using contrastive SSL. Without mitigation, contrastive SSL techniques can incorporate sensitive information such as gender or race and cause potentially unfair predictions on downstream tasks. In this paper, we propose a Conditional Contrastive Learning (CCL) approach to improve the fairness of contrastive SSL methods. Our approach samples positive and negative pairs from distributions conditioning on the sensitive attribute, or empirically speaking, sampling positive and negative pairs from the same gender or the same race. We show that our approach provably maximizes the conditional mutual information between the learned representations of the positive pairs, and reduces the effect of the sensitive attribute by taking it as the conditional variable. On seven fairness and vision datasets, we empirically demonstrate that the proposed approach achieves state-of-the-art downstream performances compared to unsupervised baselines and significantly improves the fairness of contrastive SSL models on multiple fairness metrics.
    EMVLight: A Decentralized Reinforcement Learning Framework for Efficient Passage of Emergency Vehicles. (arXiv:2109.05429v3 [cs.LG] UPDATED)
    Emergency vehicles (EMVs) play a crucial role in responding to time-critical events such as medical emergencies and fire outbreaks in an urban area. The less time EMVs spend traveling through the traffic, the more likely it would help save people's lives and reduce property loss. To reduce the travel time of EMVs, prior work has used route optimization based on historical traffic-flow data and traffic signal pre-emption based on the optimal route. However, traffic signal pre-emption dynamically changes the traffic flow which, in turn, modifies the optimal route of an EMV. In addition, traffic signal pre-emption practices usually lead to significant disturbances in traffic flow and subsequently increase the travel time for non-EMVs. In this paper, we propose EMVLight, a decentralized reinforcement learning (RL) framework for simultaneous dynamic routing and traffic signal control. EMVLight extends Dijkstra's algorithm to efficiently update the optimal route for the EMVs in real time as it travels through the traffic network. The decentralized RL agents learn network-level cooperative traffic signal phase strategies that not only reduce EMV travel time but also reduce the average travel time of non-EMVs in the network. This benefit has been demonstrated through comprehensive experiments with synthetic and real-world maps. These experiments show that EMVLight outperforms benchmark transportation engineering techniques and existing RL-based signal control methods.
    Activation Functions in Deep Learning: A Comprehensive Survey and Benchmark. (arXiv:2109.14545v3 [cs.LG] UPDATED)
    Neural networks have shown tremendous growth in recent years to solve numerous problems. Various types of neural networks have been introduced to deal with different types of problems. However, the main goal of any neural network is to transform the non-linearly separable input data into more linearly separable abstract features using a hierarchy of layers. These layers are combinations of linear and nonlinear functions. The most popular and common non-linearity layers are activation functions (AFs), such as Logistic Sigmoid, Tanh, ReLU, ELU, Swish and Mish. In this paper, a comprehensive overview and survey is presented for AFs in neural networks for deep learning. Different classes of AFs such as Logistic Sigmoid and Tanh based, ReLU based, ELU based, and Learning based are covered. Several characteristics of AFs such as output range, monotonicity, and smoothness are also pointed out. A performance comparison is also performed among 18 state-of-the-art AFs with different networks on different types of data. The insights of AFs are presented to benefit the researchers for doing further research and practitioners to select among different choices. The code used for experimental comparison is released at: \url{https://github.com/shivram1987/ActivationFunctions}.
    Learning from human perception to improve automatic speaker verification in style-mismatched conditions. (arXiv:2206.13684v1 [eess.AS])
    Our prior experiments show that humans and machines seem to employ different approaches to speaker discrimination, especially in the presence of speaking style variability. The experiments examined read versus conversational speech. Listeners focused on speaker-specific idiosyncrasies while "telling speakers together", and on relative distances in a shared acoustic space when "telling speakers apart". However, automatic speaker verification (ASV) systems use the same loss function irrespective of target or non-target trials. To improve ASV performance in the presence of style variability, insights learnt from human perception are used to design a new training loss function that we refer to as "CllrCE loss". CllrCE loss uses both speaker-specific idiosyncrasies and relative acoustic distances between speakers to train the ASV system. When using the UCLA speaker variability database, in the x-vector and conditioning setups, CllrCE loss results in significant relative improvements in EER by 1-66%, and minDCF by 1-31% and 1-56%, respectively, when compared to the x-vector baseline. Using the SITW evaluation tasks, which involve different conversational speech tasks, the proposed loss combined with self-attention conditioning results in significant relative improvements in EER by 2-5% and minDCF by 6-12% over baseline. In the SITW case, performance improvements were consistent only with conditioning.
    Persistent homology-based descriptor for machine-learning potential. (arXiv:2206.13727v1 [cs.LG])
    Constructing efficient descriptors that represent atomic configurations is crucial for developing a superior machine-learning potential. Widely used conventional descriptors are based on two- or three-body correlations of atomic distribution. Recently, several limitations of these many-body descriptors in classifying different configurations were revealed, which have detrimental effects on the prediction of physical properties. We proposed a new class of descriptors based on persistent homology. We focused on the two-dimensional visualization of persistent homology, that is, a persistence diagram, as a descriptor of atomic configurations in the form of an image. We demonstrated that convolutional neural network models based on this descriptor provide sufficient accuracy in predicting the mean energies per atom of amorphous graphene and amorphous carbon. Our results provide an avenue for improving machine-learning potential using descriptors that depict both topological and geometric information.
    Improved Certified Defenses against Data Poisoning with (Deterministic) Finite Aggregation. (arXiv:2202.02628v2 [cs.LG] UPDATED)
    Data poisoning attacks aim at manipulating model behaviors through distorting training data. Previously, an aggregation-based certified defense, Deep Partition Aggregation (DPA), was proposed to mitigate this threat. DPA predicts through an aggregation of base classifiers trained on disjoint subsets of data, thus restricting its sensitivity to dataset distortions. In this work, we propose an improved certified defense against general poisoning attacks, namely Finite Aggregation. In contrast to DPA, which directly splits the training set into disjoint subsets, our method first splits the training set into smaller disjoint subsets and then combines duplicates of them to build larger (but not disjoint) subsets for training base classifiers. This reduces the worst-case impacts of poison samples and thus improves certified robustness bounds. In addition, we offer an alternative view of our method, bridging the designs of deterministic and stochastic aggregation-based certified defenses. Empirically, our proposed Finite Aggregation consistently improves certificates on MNIST, CIFAR-10, and GTSRB, boosting certified fractions by up to 3.05%, 3.87% and 4.77%, respectively, while keeping the same clean accuracies as DPA's, effectively establishing a new state of the art in (pointwise) certified robustness against data poisoning.
    Efficient Deep Learning Using Non-Volatile Memory Technology. (arXiv:2206.13601v1 [cs.AR])
    Embedded machine learning (ML) systems have now become the dominant platform for deploying ML serving tasks and are projected to become of equal importance for training ML models. With this comes the challenge of overall efficient deployment, in particular low power and high throughput implementations, under stringent memory constraints. In this context, non-volatile memory (NVM) technologies such as STT-MRAM and SOT-MRAM have significant advantages compared to conventional SRAM due to their non-volatility, higher cell density, and scalability features. While prior work has investigated several architectural implications of NVM for generic applications, in this work we present DeepNVM++, a comprehensive framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL) applications by combining technology-specific circuit-level models and the actual memory behavior of various DL workloads. DeepNVM++ relies on iso-capacity and iso-area performance and energy models for last-level caches implemented using conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 3.8x and 4.7x energy-delay product (EDP) reduction and 2.4x and 2.8x area reduction compared to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and SOT-MRAM provide up to 2.2x and 2.4x EDP reduction and accommodate 2.3x and 3.3x cache capacity when compared to SRAM, respectively. We also perform a scalability analysis and show that STT-MRAM and SOT-MRAM achieve orders of magnitude EDP reduction when compared to SRAM for large cache capacities. DeepNVM++ is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in GPUs for DL applications.
    Harnessing the Power of Ego Network Layers for Link Prediction in Online Social Networks. (arXiv:2109.09190v2 [cs.SI] UPDATED)
    Being able to recommend links between users in online social networks is important for users to connect with like-minded individuals as well as for the platforms themselves and third parties leveraging social media information to grow their business. Predictions are typically based on unsupervised or supervised learning, often leveraging simple yet effective graph topological information, such as the number of common neighbors. However, we argue that richer information about personal social structure of individuals might lead to better predictions. In this paper, we propose to leverage well-established social cognitive theories to improve link prediction performance. According to these theories, individuals arrange their social relationships along, on average, five concentric circles of decreasing intimacy. We postulate that relationships in different circles have different importance in predicting new links. In order to validate this claim, we focus on popular feature-extraction prediction algorithms (both unsupervised and supervised) and we extend them to include social-circles awareness. We validate the prediction performance of these circle-aware algorithms against several benchmarks (including their baseline versions as well as node-embedding- and GNN-based link prediction), leveraging two Twitter datasets comprising a community of video gamers and generic users. We show that social-awareness generally provides significant improvements in the prediction performance, beating also state-of-the-art solutions like node2vec and SEAL, and without increasing the computational complexity. Finally, we show that social-awareness can be used in place of using a classifier (which may be costly or impractical) for targeting a specific category of users.
    Survey on the Convergence of Machine Learning and Blockchain. (arXiv:2201.00976v2 [cs.LG] UPDATED)
    Machine learning (ML) has been pervasively researched nowadays and it has been applied in many aspects of real life. Nevertheless, issues of model and data still accompany the development of ML. For instance, training of traditional ML models is limited to the access of data sets, which are generally proprietary; published ML models may soon be out of date without an update of new data and continuous training; malicious data contributors may upload wrongly labeled data that leads to undesirable training results; and the abuse of private data and data leakage also exit. With the utilization of blockchain, an emerging and swiftly developing technology, these problems can be efficiently solved. In this paper, we survey the convergence of collaborative ML and blockchain. Different ways of the combination of these two technologies are investigated and their fields of application are examined. Discussion on the limitations of current research and their future directions are also included.
    Topology-aware Generalization of Decentralized SGD. (arXiv:2206.12680v2 [cs.LG] UPDATED)
    This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is $\mathcal{O}{(m/N+1/m+\lambda^2)}$-stable in expectation in the non-convex non-smooth setting, where $N$ is the total sample size of the whole system, $m$ is the worker number, and $1-\lambda$ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an $\mathcal{O}{(1/N+{({(m^{-1}\lambda^2)}^{\frac{\alpha}{2}}+ m^{-\alpha})}/{N^{1-\frac{\alpha}{2}}})}$ in-average generalization bound, which is non-vacuous even when $\lambda$ is closed to $1$, in contrast to vacuous as suggested by existing literature on the projected version of D-SGD. Our theory indicates that the generalizability of D-SGD has a positive correlation with the spectral gap, and can explain why consensus control in initial training phase can ensure better generalization. Experiments of VGG-11 and ResNet-18 on CIFAR-10, CIFAR-100 and Tiny-ImageNet justify our theory. To our best knowledge, this is the first work on the topology-aware generalization of vanilla D-SGD. Code is available at https://github.com/Raiden-Zhu/Generalization-of-DSGD.
    Graph-Based Machine Learning Improves Just-in-Time Defect Prediction. (arXiv:2110.05371v2 [cs.SE] UPDATED)
    The increasing complexity of today's software requires the contribution of thousands of developers. This complex collaboration structure makes developers more likely to introduce defect-prone changes that lead to software faults. Determining when these defect-prone changes are introduced has proven challenging, and using traditional machine learning (ML) methods to make these determinations seems to have reached a plateau. In this work, we build contribution graphs consisting of developers and source files to capture the nuanced complexity of changes required to build software. By leveraging these contribution graphs, our research shows the potential of using graph-based ML to improve Just-In-Time (JIT) defect prediction. We hypothesize that features extracted from the contribution graphs may be better predictors of defect-prone changes than intrinsic features derived from software characteristics. We corroborate our hypothesis using graph-based ML for classifying edges that represent defect-prone changes. This new framing of the JIT defect prediction problem leads to remarkably better results. We test our approach on 14 open-source projects and show that our best model can predict whether or not a code change will lead to a defect with an F1 score as high as 77.55$\%$. This represents an increase of as much as 46.72$\%$ over the state-of-the-art in JIT defect prediction. We describe limitations, open challenges, and how this method can be used for operational JIT defect prediction.
    Offline Reinforcement Learning with Realizability and Single-policy Concentrability. (arXiv:2202.04634v3 [cs.LG] UPDATED)
    Sample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e.g., Bellman-completeness) and the data coverage (e.g., all-policy concentrability). Despite the recent efforts on relaxing these assumptions, existing works are only able to relax one of the two factors, leaving the strong assumption on the other factor intact. As an important open problem, can we achieve sample-efficient offline RL with weak assumptions on both factors? In this paper we answer the question in the positive. We analyze a simple algorithm based on the primal-dual formulation of MDPs, where the dual variables (discounted occupancy) are modeled using a density-ratio function against offline data. With proper regularization, we show that the algorithm enjoys polynomial sample complexity, under only realizability and single-policy concentrability. We also provide alternative analyses based on different assumptions to shed light on the nature of primal-dual algorithms for offline RL.
    Zero-Shot Building Control. (arXiv:2206.14191v1 [eess.SY])
    Heating and cooling systems in buildings account for 31% of global energy use, much of which are regulated by Rule Based Controllers (RBCs) that neither maximise energy efficiency nor minimise emissions by interacting optimally with the grid. Control via Reinforcement Learning (RL) has been shown to significantly improve building energy efficiency, but existing solutions require pre-training in simulators that are prohibitively expensive to obtain for every building in the world. In response, we show it is possible to perform safe, zero-shot control of buildings by combining ideas from system identification and model-based RL. We call this combination PEARL (Probabilistic Emission-Abating Reinforcement Learning) and show it reduces emissions without pre-training, needing only a three hour commissioning period. In experiments across three varied building energy simulations, we show PEARL outperforms an existing RBC once, and popular RL baselines in all cases, reducing building emissions by as much as 31% whilst maintaining thermal comfort.
    Differentially Private Algorithms for Statistical Verification of Cyber-Physical Systems. (arXiv:2004.00275v2 [cs.LG] UPDATED)
    Statistical model checking is a class of sequential algorithms that can verify specifications of interest on an ensemble of cyber-physical systems (e.g., whether 99% of cars from a batch meet a requirement on their energy efficiency). These algorithms infer the probability that given specifications are satisfied by the systems with provable statistical guarantees by drawing sufficient numbers of independent and identically distributed samples. During the process of statistical model checking, the values of the samples (e.g., a user's car energy efficiency) may be inferred by intruders, causing privacy concerns in consumer-level applications (e.g., automobiles and medical devices). This paper addresses the privacy of statistical model checking algorithms from the point of view of differential privacy. These algorithms are sequential, drawing samples until a condition on their values is met. We show that revealing the number of the samples drawn can violate privacy. We also show that the standard exponential mechanism that randomizes the output of an algorithm to achieve differential privacy fails to do so in the context of sequential algorithms. Instead, we relax the conservative requirement in differential privacy that the sensitivity of the output of the algorithm should be bounded to any perturbation for any data set. We propose a new notion of differential privacy which we call expected differential privacy. Then, we propose a novel expected sensitivity analysis for the sequential algorithm and proposed a corresponding exponential mechanism that randomizes the termination time to achieve the expected differential privacy. We apply the proposed mechanism to statistical model checking algorithms to preserve the privacy of the samples they draw. The utility of the proposed algorithm is demonstrated in a case study.
    Visual Adversarial Imitation Learning using Variational Models. (arXiv:2107.08829v2 [cs.LG] UPDATED)
    Reward function specification, which requires considerable human effort and iteration, remains a major impediment for learning behaviors through deep reinforcement learning. In contrast, providing visual demonstrations of desired behaviors often presents an easier and more natural way to teach agents. We consider a setting where an agent is provided a fixed dataset of visual demonstrations illustrating how to perform a task, and must learn to solve the task using the provided demonstrations and unsupervised environment interactions. This setting presents a number of challenges including representation learning for visual observations, sample complexity due to high dimensional spaces, and learning instability due to the lack of a fixed reward or learning signal. Towards addressing these challenges, we develop a variational model-based adversarial imitation learning (V-MAIL) algorithm. The model-based approach provides a strong signal for representation learning, enables sample efficiency, and improves the stability of adversarial training by enabling on-policy learning. Through experiments involving several vision-based locomotion and manipulation tasks, we find that V-MAIL learns successful visuomotor policies in a sample-efficient manner, has better stability compared to prior work, and also achieves higher asymptotic performance. We further find that by transferring the learned models, V-MAIL can learn new tasks from visual demonstrations without any additional environment interactions. All results including videos can be found online at \url{https://sites.google.com/view/variational-mail}.
    Hybrid Ensemble for Fake News Detection: An attempt. (arXiv:2206.13981v1 [cs.CL])
    Fake News Detection has been a challenging problem in the field of Machine Learning. Researchers have approached it via several techniques using old Statistical Classification models and modern Deep Learning. Today, with the growing amount of data, developments in the field of NLP and ML, and an increase in the computation power at disposal, there are infinite permutations and combinations to approach this problem from a different perspective. In this paper, we try different methods to tackle Fake News, and try to build, and propose the possibilities of a Hybrid Ensemble combining the classical Machine Learning techniques with the modern Deep Learning Approaches
    An Artificial Neural Network-Based Model Predictive Control for Three-phase Flying Capacitor Multi-Level Inverter. (arXiv:2110.08101v3 [eess.SY] UPDATED)
    Model predictive control (MPC) has been used widely in power electronics due to its simple concept, fast dynamic response, and good reference tracking. However, it suffers from parametric uncertainties, since it directly relies on the mathematical model of the system to predict the optimal switching states to be used at the next sampling time. As a result, uncertain parameters lead to an ill-designed MPC. Thus, this paper offers a model-free control strategy on the basis of artificial neural networks (ANNs), for mitigating the effects of parameter mismatching while having a little negative impact on the inverter's performance. This method includes two related stages. First, MPC is used as an expert to control the studied converter in order to provide a dataset, while, in the second stage, the obtained dataset is utilized to train the proposed ANN. The case study herein is based on a four-level three-cell flying capacitor inverter. In this study, MATLAB/Simulink is used to simulate the performance of the proposed method, taking into account various operating conditions. Afterward, the simulation results are reported in comparison with the conventional MPC scheme, demonstrating the superior performance of the proposed control strategy in terms of robustness against parameters mismatch and low total harmonic distortion (THD), especially when changes occur in the system parameters, compared to the conventional MPC. Furthermore, the experimental validation of the proposed method is provided based on the Hardware-in-the-Loop (HIL) simulation using the C2000TM-microcontroller-LaunchPadXL TMS320F28379D kit, demonstrating the applicability of the ANN-based control strategy to be implemented on a DSP controller.
    QTI Submission to DCASE 2021: residual normalization for device-imbalanced acoustic scene classification with efficient design. (arXiv:2206.13909v1 [cs.SD])
    This technical report describes the details of our TASK1A submission of the DCASE2021 challenge. The goal of the task is to design an audio scene classification system for device-imbalanced datasets under the constraints of model complexity. This report introduces four methods to achieve the goal. First, we propose Residual Normalization, a novel feature normalization method that uses instance normalization with a shortcut path to discard unnecessary device-specific information without losing useful information for classification. Second, we design an efficient architecture, BC-ResNet-Mod, a modified version of the baseline architecture with a limited receptive field. Third, we exploit spectrogram-to-spectrogram translation from one to multiple devices to augment training data. Finally, we utilize three model compression schemes: pruning, quantization, and knowledge distillation to reduce model complexity. The proposed system achieves an average test accuracy of 76.3% in TAU Urban Acoustic Scenes 2020 Mobile, development dataset with 315k parameters, and average test accuracy of 75.3% after compression to 61.0KB of non-zero parameters.
    BAGEL: A Benchmark for Assessing Graph Neural Network Explanations. (arXiv:2206.13983v1 [cs.LG])
    The problem of interpreting the decisions of machine learning is a well-researched and important. We are interested in a specific type of machine learning model that deals with graph data called graph neural networks. Evaluating interpretability approaches for graph neural networks (GNN) specifically are known to be challenging due to the lack of a commonly accepted benchmark. Given a GNN model, several interpretability approaches exist to explain GNN models with diverse (sometimes conflicting) evaluation methodologies. In this paper, we propose a benchmark for evaluating the explainability approaches for GNNs called Bagel. In Bagel, we firstly propose four diverse GNN explanation evaluation regimes -- 1) faithfulness, 2) sparsity, 3) correctness. and 4) plausibility. We reconcile multiple evaluation metrics in the existing literature and cover diverse notions for a holistic evaluation. Our graph datasets range from citation networks, document graphs, to graphs from molecules and proteins. We conduct an extensive empirical study on four GNN models and nine post-hoc explanation approaches for node and graph classification tasks. We open both the benchmarks and reference implementations and make them available at https://github.com/Mandeep-Rathee/Bagel-benchmark.
    Modeling Extraneous Activity Delays in Business Process Simulation. (arXiv:2206.14051v1 [cs.SE])
    Business Process Simulation (BPS) is a common approach to estimate the impact of changes to a business process on its performance measures. For example, BPS allows us to estimate what would be the cycle time of a process if we automated one of its activities. The starting point of BPS is a business process model annotated with simulation parameters (a BPS model). Several studies have proposed methods to automatically discover BPS models from event logs via process mining. However, current techniques in this space discover BPS models that only capture waiting times caused by resource contention or resource unavailability. Oftentimes, a considerable portion of the waiting time in a business process is caused by extraneous delays, e.g. a resource waits for the customer to return a phone call. This paper proposes a method that discovers extraneous delays from input data, and injects timer events into a BPS model to capture the discovered delays. An empirical evaluation involving synthetic and real-life logs shows that the approach produces BPS models that better reflect the temporal dynamics of the process, relative to BPS models that do not capture extraneous delays.
    Exact Spectral Norm Regularization for Neural Networks. (arXiv:2206.13581v1 [stat.ML])
    We pursue a line of research that seeks to regularize the spectral norm of the Jacobian of the input-output mapping for deep neural networks. While previous work rely on upper bounding techniques, we provide a scheme that targets the exact spectral norm. We showcase that our algorithm achieves an improved generalization performance compared to previous spectral regularization techniques while simultaneously maintaining a strong safeguard against natural and adversarial noise. Moreover, we further explore some previous reasoning concerning the strong adversarial protection that Jacobian regularization provides and show that it can be misleading.
    Towards a Grounded Theory of Causation for Embodied AI. (arXiv:2206.13973v1 [cs.AI])
    There exist well-developed frameworks for causal modelling, but these require rather a lot of human domain expertise to define causal variables and perform interventions. In order to enable autonomous agents to learn abstract causal models through interactive experience, the existing theoretical foundations need to be extended and clarified. Existing frameworks give no guidance regarding variable choice / representation, and more importantly, give no indication as to which behaviour policies or physical transformations of state space shall count as interventions. The framework sketched in this paper describes actions as transformations of state space, for instance induced by an agent running a policy. This makes it possible to describe in a uniform way both transformations of the micro-state space and abstract models thereof, and say when the latter is veridical / grounded / natural. We then introduce (causal) variables, define a mechanism as an invariant predictor, and say when an action can be viewed as a ``surgical intervention'', thus bringing the objective of causal representation & intervention skill learning into clearer focus.
    Deep Structured Prediction for Facial Landmark Detection. (arXiv:2010.09035v1 [cs.CV] CROSS LISTED)
    Existing deep learning based facial landmark detection methods have achieved excellent performance. These methods, however, do not explicitly embed the structural dependencies among landmark points. They hence cannot preserve the geometric relationships between landmark points or generalize well to challenging conditions or unseen data. This paper proposes a method for deep structured facial landmark detection based on combining a deep Convolutional Network with a Conditional Random Field. We demonstrate its superior performance to existing state-of-the-art techniques in facial landmark detection, especially a better generalization ability on challenging datasets that include large pose and occlusion.
    Utility Theory for Sequential Decision Making. (arXiv:2206.13637v1 [cs.AI])
    The von Neumann-Morgenstern (VNM) utility theorem shows that under certain axioms of rationality, decision-making is reduced to maximizing the expectation of some utility function. We extend these axioms to increasingly structured sequential decision making settings and identify the structure of the corresponding utility functions. In particular, we show that memoryless preferences lead to a utility in the form of a per transition reward and multiplicative factor on the future return. This result motivates a generalization of Markov Decision Processes (MDPs) with this structure on the agent's returns, which we call Affine-Reward MDPs. A stronger constraint on preferences is needed to recover the commonly used cumulative sum of scalar rewards in MDPs. A yet stronger constraint simplifies the utility function for goal-seeking agents in the form of a difference in some function of states that we call potential functions. Our necessary and sufficient conditions demystify the reward hypothesis that underlies the design of rational agents in reinforcement learning by adding an axiom to the VNM rationality axioms and motivates new directions for AI research involving sequential decision making.
    Value Function Approximations via Kernel Embeddings for No-Regret Reinforcement Learning. (arXiv:2011.07881v3 [cs.LG] UPDATED)
    We consider the regret minimization problem in reinforcement learning (RL) in the episodic setting. In many real-world RL environments, the state and action spaces are continuous or very large. Existing approaches establish regret guarantees by either a low-dimensional representation of the stochastic transition model or an approximation of the $Q$-functions. However, the understanding of function approximation schemes for state-value functions largely remains missing. In this paper, we propose an online model-based RL algorithm, namely the CME-RL, that learns representations of transition distributions as embeddings in a reproducing kernel Hilbert space while carefully balancing the exploitation-exploration tradeoff. We demonstrate the efficiency of our algorithm by proving a frequentist (worst-case) regret bound that is of order $\tilde{O}\big(H\gamma_N\sqrt{N}\big)$\footnote{ $\tilde{O}(\cdot)$ hides only absolute constant and poly-logarithmic factors.}, where $H$ is the episode length, $N$ is the total number of time steps and $\gamma_N$ is an information theoretic quantity relating the effective dimension of the state-action feature space. Our method bypasses the need for estimating transition probabilities and applies to any domain on which kernels can be defined. It also brings new insights into the general theory of kernel methods for approximate inference and RL regret minimization.
    Equivariant Priors for Compressed Sensing with Unknown Orientation. (arXiv:2206.14069v1 [cs.LG])
    In compressed sensing, the goal is to reconstruct the signal from an underdetermined system of linear measurements. Thus, prior knowledge about the signal of interest and its structure is required. Additionally, in many scenarios, the signal has an unknown orientation prior to measurements. To address such recovery problems, we propose using equivariant generative models as a prior, which encapsulate orientation information in their latent space. Thereby, we show that signals with unknown orientations can be recovered with iterative gradient descent on the latent space of these models and provide additional theoretical recovery guarantees. We construct an equivariant variational autoencoder and use the decoder as generative prior for compressed sensing. We discuss additional potential gains of the proposed approach in terms of convergence and latency.
    Neural Tangent Kernel Analysis of Deep Narrow Neural Networks. (arXiv:2202.02981v2 [cs.LG] UPDATED)
    The tremendous recent progress in analyzing the training dynamics of overparameterized neural networks has primarily focused on wide networks and therefore does not sufficiently address the role of depth in deep learning. In this work, we present the first trainability guarantee of infinitely deep but narrow neural networks. We study the infinite-depth limit of a multilayer perceptron (MLP) with a specific initialization and establish a trainability guarantee using the NTK theory. We then extend the analysis to an infinitely deep convolutional neural network (CNN) and perform brief experiments.
    Label-enhanced Prototypical Network with Contrastive Learning for Multi-label Few-shot Aspect Category Detection. (arXiv:2206.13980v1 [cs.CL])
    Multi-label aspect category detection allows a given review sentence to contain multiple aspect categories, which is shown to be more practical in sentiment analysis and attracting increasing attention. As annotating large amounts of data is time-consuming and labor-intensive, data scarcity occurs frequently in real-world scenarios, which motivates multi-label few-shot aspect category detection. However, research on this problem is still in infancy and few methods are available. In this paper, we propose a novel label-enhanced prototypical network (LPN) for multi-label few-shot aspect category detection. The highlights of LPN can be summarized as follows. First, it leverages label description as auxiliary knowledge to learn more discriminative prototypes, which can retain aspect-relevant information while eliminating the harmful effect caused by irrelevant aspects. Second, it integrates with contrastive learning, which encourages that the sentences with the same aspect label are pulled together in embedding space while simultaneously pushing apart the sentences with different aspect labels. In addition, it introduces an adaptive multi-label inference module to predict the aspect count in the sentence, which is simple yet effective. Extensive experimental results on three datasets demonstrate that our proposed model LPN can consistently achieve state-of-the-art performance.
    Detecting potentially harmful and protective suicide-related content on twitter: A machine learning approach. (arXiv:2112.04796v3 [cs.CL] UPDATED)
    Research shows that exposure to suicide-related news media content is associated with suicide rates, with some content characteristics likely having harmful and others potentially protective effects. Although good evidence exists for a few selected characteristics, systematic large scale investigations are missing in general, and in particular for social media data. We apply machine learning methods to classify large quantities of Twitter data according to a novel annotation scheme that distinguishes 12 categories of suicide-related tweets. We then trained a benchmark of machine learning models including a majority classifier, an approach based on word frequency (TF-IDF with a linear SVM) and two state-of-the-art deep learning models (BERT, XLNet). The two deep learning models achieved the best performance in two classification tasks: In the first task, we classified six main content categories, including personal stories about either suicidal ideation and attempts or coping, calls for action intending to spread either problem awareness or prevention-related information, reporting of suicide cases, and other tweets irrelevant to these categories. The deep learning models reached accuracy scores above 73% on average across the six categories, and F1-scores in between 0.70 and 0.85 for all but the suicidal ideation and attempts category (0.51-0.55). In the second task, separating tweets referring to actual suicide from off-topic tweets, they correctly labeled around 88% of tweets, with BERT achieving F1-scores of 0.93 and 0.74 for the two categories, respectively. These classification performances are comparable to the state-of-the-art on similar tasks. By making data labeling more efficient, this work has enabled large-scale investigations on harmful and protective associations of social media content with suicide rates and help-seeking behavior.
    Continuous Treatment Recommendation with Deep Survival Dose Response Function. (arXiv:2108.10453v4 [stat.ML] UPDATED)
    We propose a general formulation for continuous treatment recommendation problems in settings with clinical survival data, which we call the Deep Survival Dose Response Function (DeepSDRF). That is, we consider the problem of learning the conditional average dose response (CADR) function solely from historical data in which observed factors (confounders) affect both observed treatment and time-to-event outcomes. The estimated treatment effect from DeepSDRF enables us to develop recommender algorithms with the correction for selection bias. We compared two recommender approaches based on random search and reinforcement learning and found similar performance in terms of patient outcome. We tested the DeepSDRF and the corresponding recommender on extensive simulation studies and the eICU Research Institute (eRI) database. To the best of our knowledge, this is the first time that causal models are used to address the continuous treatment effect with observational data in a medical context.
    Learning by Transference: Training Graph Neural Networks on Growing Graphs. (arXiv:2106.03693v3 [cs.LG] UPDATED)
    Graph neural networks (GNNs) use graph convolutions to exploit network invariances and learn meaningful feature representations from network data. However, on large-scale graphs convolutions incur in high computational cost, leading to scalability limitations. Leveraging the graphon -- the limit object of a graph -- in this paper we consider the problem of learning a graphon neural network (WNN) -- the limit object of a GNN -- by training GNNs on graphs sampled from the graphon. Under smoothness conditions, we show that: (i) the expected distance between the learning steps on the GNN and on the WNN decreases asymptotically with the size of the graph, and (ii) when training on a sequence of growing graphs, gradient descent follows the learning direction of the WNN. Inspired by these results, we propose a novel algorithm to learn GNNs on large-scale graphs that, starting from a moderate number of nodes, successively increases the size of the graph during training. This algorithm is further benchmarked on a decentralized control problem, where it retains comparable performance to its large-scale counterpart at a reduced computational cost.
    Compressive Clustering with an Optical Processing Unit. (arXiv:2206.05928v2 [cs.LG] UPDATED)
    We explore the use of Optical Processing Units (OPU) to compute random Fourier features for sketching, and adapt the overall compressive clustering pipeline to this setting. We also propose some tools to help tuning a critical hyper-parameter of compressive clustering.
    BeamsNet: A data-driven Approach Enhancing Doppler Velocity Log Measurements for Autonomous Underwater Vehicle Navigation. (arXiv:2206.13603v1 [cs.RO])
    Autonomous underwater vehicles (AUV) perform various applications such as seafloor mapping and underwater structure health monitoring. Commonly, an inertial navigation system aided by a Doppler velocity log (DVL) is used to provide the vehicle's navigation solution. In such fusion, the DVL provides the velocity vector of the AUV, which determines the navigation solution's accuracy and helps estimate the navigation states. This paper proposes BeamsNet, an end-to-end deep learning framework to regress the estimated DVL velocity vector that improves the accuracy of the velocity vector estimate, and could replace the model-based approach. Two versions of BeamsNet, differing in their input to the network, are suggested. The first uses the current DVL beam measurements and inertial sensors data, while the other utilizes only DVL data, taking the current and past DVL measurements for the regression process. Both simulation and sea experiments were made to validate the proposed learning approach relative to the model-based approach. Sea experiments were made with the Snapir AUV in the Mediterranean Sea, collecting approximately four hours of DVL and inertial sensor data. Our results show that the proposed approach achieved an improvement of more than 60% in estimating the DVL velocity vector.
    Deep Neural Networks pruning via the Structured Perspective Regularization. (arXiv:2206.14056v1 [cs.LG])
    In Machine Learning, Artificial Neural Networks (ANNs) are a very powerful tool, broadly used in many applications. Often, the selected (deep) architectures include many layers, and therefore a large amount of parameters, which makes training, storage and inference expensive. This motivated a stream of research about compressing the original networks into smaller ones without excessively sacrificing performances. Among the many proposed compression approaches, one of the most popular is \emph{pruning}, whereby entire elements of the ANN (links, nodes, channels, \ldots) and the corresponding weights are deleted. Since the nature of the problem is inherently combinatorial (what elements to prune and what not), we propose a new pruning method based on Operational Research tools. We start from a natural Mixed-Integer-Programming model for the problem, and we use the Perspective Reformulation technique to strengthen its continuous relaxation. Projecting away the indicator variables from this reformulation yields a new regularization term, which we call the Structured Perspective Regularization, that leads to structured pruning of the initial architecture. We test our method on some ResNet architectures applied to CIFAR-10, CIFAR-100 and ImageNet datasets, obtaining competitive performances w.r.t.~the state of the art for structured pruning.
    Learning the Evolutionary and Multi-scale Graph Structure for Multivariate Time Series Forecasting. (arXiv:2206.13816v1 [cs.LG])
    Recent studies have shown great promise in applying graph neural networks for multivariate time series forecasting, where the interactions of time series are described as a graph structure and the variables are represented as the graph nodes. Along this line, existing methods usually assume that the graph structure (or the adjacency matrix), which determines the aggregation manner of graph neural network, is fixed either by definition or self-learning. However, the interactions of variables can be dynamic and evolutionary in real-world scenarios. Furthermore, the interactions of time series are quite different if they are observed at different time scales. To equip the graph neural network with a flexible and practical graph structure, in this paper, we investigate how to model the evolutionary and multi-scale interactions of time series. In particular, we first provide a hierarchical graph structure cooperated with the dilated convolution to capture the scale-specific correlations among time series. Then, a series of adjacency matrices are constructed under a recurrent manner to represent the evolving correlations at each layer. Moreover, a unified neural network is provided to integrate the components above to get the final prediction. In this way, we can capture the pair-wise correlations and temporal dependency simultaneously. Finally, experiments on both single-step and multi-step forecasting tasks demonstrate the superiority of our method over the state-of-the-art approaches.
    Perceived Overlap: A Prerequisite for VAE Disentanglement. (arXiv:2202.13341v2 [cs.LG] UPDATED)
    Learning disentangled representations with variational autoencoders (VAEs) is often attributed to the regularisation component of the loss. In this work, we highlight the interaction between data and the reconstruction term of the loss as the main contributor to disentanglement in VAEs. We note that standardised benchmark datasets are constructed in ways that are conducive to learning what appear to be disentangled representations. We design an intuitive adversarial dataset that exploits this mechanism to break existing state-of-the-art disentanglement frameworks. Finally, we supply a solution that enables disentanglement by modifying the reconstruction loss, affecting how VAEs perceive distances between data points.
    Structural Entropy Guided Graph Hierarchical Pooling. (arXiv:2206.13510v1 [cs.LG])
    Following the success of convolution on non-Euclidean space, the corresponding pooling approaches have also been validated on various tasks regarding graphs. However, because of the fixed compression quota and stepwise pooling design, these hierarchical pooling methods still suffer from local structure damage and suboptimal problem. In this work, inspired by structural entropy, we propose a hierarchical pooling approach, SEP, to tackle the two issues. Specifically, without assigning the layer-specific compression quota, a global optimization algorithm is designed to generate the cluster assignment matrices for pooling at once. Then, we present an illustration of the local structure damage from previous methods in the reconstruction of ring and grid synthetic graphs. In addition to SEP, we further design two classification models, SEP-G and SEP-N for graph classification and node classification, respectively. The results show that SEP outperforms state-of-the-art graph pooling methods on graph classification benchmarks and obtains superior performance on node classifications.
    Fast Simulation of Particulate Suspensions Enabled by Graph Neural Network. (arXiv:2206.13905v1 [cs.LG])
    Predicting the dynamic behaviors of particles in suspension subject to hydrodynamic interaction (HI) and external drive can be critical for many applications. By harvesting advanced deep learning techniques, the present work introduces a new framework, hydrodynamic interaction graph neural network (HIGNN), for inferring and predicting the particles' dynamics in Stokes suspensions. It overcomes the limitations of traditional approaches in computational efficiency, accuracy, and/or transferability. In particular, by uniting the data structure represented by a graph and the neural networks with learnable parameters, the HIGNN constructs surrogate modeling for the mobility tensor of particles which is the key to predicting the dynamics of particles subject to HI and external forces. To account for the many-body nature of HI, we generalize the state-of-the-art GNN by introducing higher-order connectivity into the graph and the corresponding convolutional operation. For training the HIGNN, we only need the data for a small number of particles in the domain of interest, and hence the training cost can be maintained low. Once constructed, the HIGNN permits fast predictions of the particles' velocities and is transferable to suspensions of different numbers/concentrations of particles in the same domain and to any external forcing. It has the ability to accurately capture both the long-range HI and short-range lubrication effects. We demonstrate the accuracy, efficiency, and transferability of the proposed HIGNN framework in a variety of systems. The requirement on computing resource is minimum: most simulations only require a desktop with one GPU; the simulations for a large suspension of 100,000 particles call for up to 6 GPUs.
    Information Entropy Initialized Concrete Autoencoder for Optimal Sensor Placement and Reconstruction of Geophysical Fields. (arXiv:2206.13968v1 [cs.LG])
    We propose a new approach to the optimal placement of sensors for the problem of reconstructing geophysical fields from sparse measurements. Our method consists of two stages. In the first stage, we estimate the variability of the physical field as a function of spatial coordinates by approximating its information entropy through the Conditional PixelCNN network. To calculate the entropy, a new ordering of a two-dimensional data array (spiral ordering) is proposed, which makes it possible to obtain the entropy of a physical field simultaneously for several spatial scales. In the second stage, the entropy of the physical field is used to initialize the distribution of optimal sensor locations. This distribution is further optimized with the Concrete Autoencoder architecture with the straight-through gradient estimator and adversarial loss to simultaneously minimize the number of sensors and maximize reconstruction accuracy. Our method scales linearly with data size, unlike commonly used Principal Component Analysis. We demonstrate our method on the two examples: (a) temperature and (b) salinity fields around the Barents Sea and the Svalbard group of islands. For these examples, we compute the reconstruction error of our method and a few baselines. We test our approach against two baselines (1) PCA with QR factorization and (2) climatology. We find out that the obtained optimal sensor locations have clear physical interpretation and correspond to the boundaries between sea currents.
    Learning to Iteratively Solve Routing Problems with Dual-Aspect Collaborative Transformer. (arXiv:2110.02544v2 [cs.LG] UPDATED)
    Recently, Transformer has become a prevailing deep architecture for solving vehicle routing problems (VRPs). However, it is less effective in learning improvement models for VRP because its positional encoding (PE) method is not suitable in representing VRP solutions. This paper presents a novel Dual-Aspect Collaborative Transformer (DACT) to learn embeddings for the node and positional features separately, instead of fusing them together as done in existing ones, so as to avoid potential noises and incompatible correlations. Moreover, the positional features are embedded through a novel cyclic positional encoding (CPE) method to allow Transformer to effectively capture the circularity and symmetry of VRP solutions (i.e., cyclic sequences). We train DACT using Proximal Policy Optimization and design a curriculum learning strategy for better sample efficiency. We apply DACT to solve the traveling salesman problem (TSP) and capacitated vehicle routing problem (CVRP). Results show that our DACT outperforms existing Transformer based improvement models, and exhibits much better generalization performance across different problem sizes on synthetic and benchmark instances, respectively.
    DayDreamer: World Models for Physical Robot Learning. (arXiv:2206.14176v1 [cs.RO])
    To solve tasks in complex environments, robots need to learn from experience. Deep reinforcement learning is a common approach to robot learning but requires a large amount of trial and error to learn, limiting its deployment in the physical world. As a consequence, many advances in robot learning rely on simulators. On the other hand, learning inside of simulators fails to capture the complexity of the real world, is prone to simulator inaccuracies, and the resulting behaviors do not adapt to changes in the world. The Dreamer algorithm has recently shown great promise for learning from small amounts of interaction by planning within a learned world model, outperforming pure reinforcement learning in video games. Learning a world model to predict the outcomes of potential actions enables planning in imagination, reducing the amount of trial and error needed in the real environment. However, it is unknown whether Dreamer can facilitate faster learning on physical robots. In this paper, we apply Dreamer to 4 robots to learn online and directly in the real world, without simulators. Dreamer trains a quadruped robot to roll off its back, stand up, and walk from scratch and without resets in only 1 hour. We then push the robot and find that Dreamer adapts within 10 minutes to withstand perturbations or quickly roll over and stand back up. On two different robotic arms, Dreamer learns to pick and place multiple objects directly from camera images and sparse rewards, approaching human performance. On a wheeled robot, Dreamer learns to navigate to a goal position purely from camera images, automatically resolving ambiguity about the robot orientation. Using the same hyperparameters across all experiments, we find that Dreamer is capable of online learning in the real world, establishing a strong baseline. We release our infrastructure for future applications of world models to robot learning.
    Deep Symbolic Regression for Recurrent Sequences. (arXiv:2201.04600v2 [cs.LG] UPDATED)
    Symbolic regression, i.e. predicting a function from the observation of its values, is well-known to be a challenging task. In this paper, we train Transformers to infer the function or recurrence relation underlying sequences of integers or floats, a typical task in human IQ tests which has hardly been tackled in the machine learning literature. We evaluate our integer model on a subset of OEIS sequences, and show that it outperforms built-in Mathematica functions for recurrence prediction. We also demonstrate that our float model is able to yield informative approximations of out-of-vocabulary functions and constants, e.g. $\operatorname{bessel0}(x)\approx \frac{\sin(x)+\cos(x)}{\sqrt{\pi x}}$ and $1.644934\approx \pi^2/6$. An interactive demonstration of our models is provided at https://symbolicregression.metademolab.com.
    Fire Together Wire Together: A Dynamic Pruning Approach with Self-Supervised Mask Prediction. (arXiv:2110.08232v3 [cs.CV] UPDATED)
    Dynamic model pruning is a recent direction that allows for the inference of a different sub-network for each input sample during deployment. However, current dynamic methods rely on learning a continuous channel gating through regularization by inducing sparsity loss. This formulation introduces complexity in balancing different losses (e.g task loss, regularization loss). In addition, regularization based methods lack transparent tradeoff hyperparameter selection to realize computational budget. Our contribution is two-fold: 1) decoupled task and pruning training. 2) Simple hyperparameter selection that enables FLOPs reduction estimation before training. Inspired by the Hebbian theory in Neuroscience: "neurons that fire together wire together", we propose to predict a mask to process k filters in a layer based on the activation of its previous layer. We pose the problem as a self-supervised binary classification problem. Each mask predictor module is trained to predict if the log-likelihood for each filter in the current layer belongs to the top-k activated filters. The value k is dynamically estimated for each input based on a novel criterion using the mass of heatmaps. We show experiments on several neural architectures, such as VGG, ResNet and MobileNet on CIFAR and ImageNet datasets. On CIFAR, we reach similar accuracy to SOTA methods with 15% and 24% higher FLOPs reduction. Similarly in ImageNet, we achieve lower drop in accuracy with up to 13% improvement in FLOPs reduction.
    Solving the Real Robot Challenge using Deep Reinforcement Learning. (arXiv:2109.15233v3 [cs.RO] UPDATED)
    This paper details our winning submission to Phase 1 of the 2021 Real Robot Challenge; a challenge in which a three-fingered robot must carry a cube along specified goal trajectories. To solve Phase 1, we use a pure reinforcement learning approach which requires minimal expert knowledge of the robotic system, or of robotic grasping in general. A sparse, goal-based reward is employed in conjunction with Hindsight Experience Replay to teach the control policy to move the cube to the desired x and y coordinates of the goal. Simultaneously, a dense distance-based reward is employed to teach the policy to lift the cube to the z coordinate (the height component) of the goal. The policy is trained in simulation with domain randomisation before being transferred to the real robot for evaluation. Although performance tends to worsen after this transfer, our best policy can successfully lift the real cube along goal trajectories via an effective pinching grasp. Our approach outperforms all other submissions, including those leveraging more traditional robotic control techniques, and is the first pure learning-based method to solve this challenge.
    On bounds for norms of reparameterized ReLU artificial neural network parameters: sums of fractional powers of the Lipschitz norm control the network parameter vector. (arXiv:2206.13646v1 [cs.LG])
    It is an elementary fact in the scientific literature that the Lipschitz norm of the realization function of a feedforward fully-connected rectified linear unit (ReLU) artificial neural network (ANN) can, up to a multiplicative constant, be bounded from above by sums of powers of the norm of the ANN parameter vector. Roughly speaking, in this work we reveal in the case of shallow ANNs that the converse inequality is also true. More formally, we prove that the norm of the equivalence class of ANN parameter vectors with the same realization function is, up to a multiplicative constant, bounded from above by the sum of powers of the Lipschitz norm of the ANN realization function (with the exponents $ 1/2 $ and $ 1 $). Moreover, we prove that this upper bound only holds when employing the Lipschitz norm but does neither hold for H\"older norms nor for Sobolev-Slobodeckij norms. Furthermore, we prove that this upper bound only holds for sums of powers of the Lipschitz norm with the exponents $ 1/2 $ and $ 1 $ but does not hold for the Lipschitz norm alone.
    Discrete Morse Sandwich: Fast Computation of Persistence Diagrams for Scalar Data -- An Algorithm and A Benchmark. (arXiv:2206.13932v1 [cs.LG])
    This paper introduces an efficient algorithm for persistence diagram computation, given an input piecewise linear scalar field f defined on a d-dimensional simplicial complex K, with $d \leq 3$. Our method extends the seminal "PairCells" algorithm by introducing three main accelerations. First, we express this algorithm within the setting of discrete Morse theory, which considerably reduces the number of input simplices to consider. Second, we introduce a stratification approach to the problem, that we call "sandwiching". Specifically, minima-saddle persistence pairs ($D_0(f)$) and saddle-maximum persistence pairs ($D_{d-1}(f)$) are efficiently computed by respectively processing with a Union-Find the unstable sets of 1-saddles and the stable sets of (d-1)-saddles. This fast processing of the dimensions 0 and (d-1) further reduces, and drastically, the number of critical simplices to consider for the computation of $D_1(f)$, the intermediate layer of the sandwich. Third, we document several performance improvements via shared-memory parallelism. We provide an open-source implementation of our algorithm for reproducibility purposes. We also contribute a reproducible benchmark package, which exploits three-dimensional data from a public repository and compares our algorithm to a variety of publicly available implementations. Extensive experiments indicate that our algorithm improves by two orders of magnitude the time performance of the seminal "PairCells" algorithm it extends. Moreover, it also improves memory footprint and time performance over a selection of 14 competing approaches, with a substantial gain over the fastest available approaches, while producing a strictly identical output. We illustrate the utility of our contributions with an application to the fast and robust extraction of persistent 1-dimensional generators on surfaces, volume data and high-dimensional point clouds.
    Verifiable Goal Recognition for Autonomous Driving with Occlusions. (arXiv:2206.14163v1 [cs.RO])
    When used in autonomous driving, goal recognition allows the future behaviour of other vehicles to be more accurately predicted. A recent goal recognition method for autonomous vehicles, GRIT, has been shown to be fast, accurate, interpretable and verifiable. In autonomous driving, vehicles can encounter novel scenarios that were unseen during training, and the environment is partially observable due to occlusions. However, GRIT can only operate in fixed frame scenarios, with full observability. We present a novel goal recognition method named Goal Recognition with Interpretable Trees under Occlusion (OGRIT), which solves these shortcomings of GRIT. We demonstrate that OGRIT can generalise between different scenarios and handle missing data due to occlusions, while still being fast, accurate, interpretable and verifiable.
    Efficient Algorithms For Fair Clustering with a New Fairness Notion. (arXiv:2109.00708v3 [cs.LG] UPDATED)
    We revisit the problem of fair clustering, first introduced by Chierichetti et al., that requires each protected attribute to have approximately equal representation in every cluster; i.e., a balance property. Existing solutions to fair clustering are either not scalable or do not achieve an optimal trade-off between clustering objective and fairness. In this paper, we propose a new notion of fairness, which we call $tau$-fair fairness, that strictly generalizes the balance property and enables a fine-grained efficiency vs. fairness trade-off. Furthermore, we show that simple greedy round-robin based algorithms achieve this trade-off efficiently. Under a more general setting of multi-valued protected attributes, we rigorously analyze the theoretical properties of the our algorithms. Our experimental results suggest that the proposed solution outperforms all the state-of-the-art algorithms and works exceptionally well even for a large number of clusters.
    Finite-sample analysis of identification of switched linear systems with arbitrary or restricted switching. (arXiv:2203.09862v2 [eess.SY] UPDATED)
    For the identification of switched systems with a measured switching signal, this work aims to analyze the effect of switching strategies on the estimation error. The data for identification is assumed to be collected from globally asymptotically or marginally stable switched systems under switches that are arbitrary or subject to an average dwell time constraint. Then the switched system is estimated by the least-squares (LS) estimator. To capture the effect of the parameters of the switching strategies on the LS estimation error, finite-sample error bounds are developed in this work. The obtained error bounds show that the estimation error is logarithmic of the switching parameters when there are only stable modes; however, when there are unstable modes, the estimation error bound can increase linearly as the switching parameter changes. This suggests that in the presence of unstable modes, the switching strategy should be properly designed to avoid the significant increase of the estimation error.
    Bellman Residual Orthogonalization for Offline Reinforcement Learning. (arXiv:2203.12786v2 [cs.LG] UPDATED)
    We propose and analyze a reinforcement learning principle that approximates the Bellman equations by enforcing their validity only along an user-defined space of test functions. Focusing on applications to model-free offline RL with function approximation, we exploit this principle to derive confidence intervals for off-policy evaluation, as well as to optimize over policies within a prescribed policy class. We prove an oracle inequality on our policy optimization procedure in terms of a trade-off between the value and uncertainty of an arbitrary comparator policy. Different choices of test function spaces allow us to tackle different problems within a common framework. We characterize the loss of efficiency in moving from on-policy to off-policy data using our procedures, and establish connections to concentrability coefficients studied in past work. We examine in depth the implementation of our methods with linear function approximation, and provide theoretical guarantees with polynomial-time implementations even when Bellman closure does not hold.
    Epidemic Control Modeling using Parsimonious Models and Markov Decision Processes. (arXiv:2206.13910v1 [q-bio.PE])
    Many countries have experienced at least two waves of the COVID-19 pandemic. The second wave is far more dangerous as distinct strains appear more harmful to human health, but it stems from the complacency about the first wave. This paper introduces a parsimonious yet representative stochastic epidemic model that simulates the uncertain spread of the disease regardless of the latency and recovery time distributions. We also propose a Markov decision process to seek an optimal trade-off between the usage of the healthcare system and the economic costs of an epidemic. We apply the model to COVID-19 data from New Delhi, India and simulate the epidemic spread with different policy review times. The results show that the optimal policy acts swiftly to curb the epidemic in the first wave, thus avoiding the collapse of the healthcare system and the future costs of posterior outbreaks. An analysis of the recent collapse of the healthcare system of India during the second COVID-19 wave suggests that many lives could have been preserved if swift mitigation was promoted after the first wave.
    HyperNTF: A Hypergraph Regularized Nonnegative Tensor Factorization for Dimensionality Reduction. (arXiv:2101.06827v3 [cs.LG] UPDATED)
    Tensor decomposition is an effective tool for learning multi-way structures and heterogeneous features from high-dimensional data, such as the multi-view images and multichannel electroencephalography (EEG) signals, are often represented by tensors. However, most of tensor decomposition methods are the linear feature extraction techniques, which are unable to reveal the nonlinear structure within high-dimensional data. To address such problem, a lot of algorithms have been proposed for simultaneously performs linear and non-linear feature extraction. A representative algorithm is the Graph Regularized Non-negative Matrix Factorization (GNMF) for image clustering. However, the normal 2-order graph can only models the pairwise similarity of objects, which cannot sufficiently exploit the complex structures of samples. Thus, we propose a novel method, named Hypergraph Regularized Non-negative Tensor Factorization (HyperNTF), which utilizes hypergraph to encode the complex connections among samples and employs the factor matrix corresponding with last mode of Canonical Polyadic (CP) decomposition as low-dimensional representation. Extensive experiments on synthetic manifolds, real-world image datasets, and EEG signals, demonstrating that HyperNTF outperforms the state-of-the-art methods in terms of dimensionality reduction, clustering, and classification.
    SurvTRACE: Transformers for Survival Analysis with Competing Events. (arXiv:2110.00855v2 [cs.LG] UPDATED)
    In medicine, survival analysis studies the time duration to events of interest such as mortality. One major challenge is how to deal with multiple competing events (e.g., multiple disease diagnoses). In this work, we propose a transformer-based model that does not make the assumption for the underlying survival distribution and is capable of handling competing events, namely SurvTRACE. We account for the implicit \emph{confounders} in the observational setting in multi-events scenarios, which causes selection bias as the predicted survival probability is influenced by irrelevant factors. To sufficiently utilize the survival data to train transformers from scratch, multiple auxiliary tasks are designed for multi-task learning. The model hence learns a strong shared representation from all these tasks and in turn serves for better survival analysis. We further demonstrate how to inspect the covariate relevance and importance through interpretable attention mechanisms of SurvTRACE, which suffices to great potential in enhancing clinical trial design and new treatment development. Experiments on METABRIC, SUPPORT, and SEER data with 470k patients validate the all-around superiority of our method.
    Attack Agnostic Dataset: Towards Generalization and Stabilization of Audio DeepFake Detection. (arXiv:2206.13979v1 [cs.SD])
    Audio DeepFakes allow the creation of high-quality, convincing utterances and therefore pose a threat due to its potential applications such as impersonation or fake news. Methods for detecting these manipulations should be characterized by good generalization and stability leading to robustness against attacks conducted with techniques that are not explicitly included in the training. In this work, we introduce Attack Agnostic Dataset - a combination of two audio DeepFakes and one anti-spoofing datasets that, thanks to the disjoint use of attacks, can lead to better generalization of detection methods. We present a thorough analysis of current DeepFake detection methods and consider different audio features (front-ends). In addition, we propose a model based on LCNN with LFCC and mel-spectrogram front-end, which not only is characterized by a good generalization and stability results but also shows improvement over LFCC-based mode - we decrease standard deviation on all folds and EER in two folds by up to 5%.
    mcBERT: Momentum Contrastive Learning with BERT for Zero-Shot Slot Filling. (arXiv:2203.12940v2 [cs.CL] UPDATED)
    Zero-shot slot filling has received considerable attention to cope with the problem of limited available data for the target domain. One of the important factors in zero-shot learning is to make the model learn generalized and reliable representations. For this purpose, we present mcBERT, which stands for momentum contrastive learning with BERT, to develop a robust zero-shot slot filling model. mcBERT uses BERT to initialize the two encoders, the query encoder and key encoder, and is trained by applying momentum contrastive learning. Our experimental results on the SNIPS benchmark show that mcBERT substantially outperforms the previous models, recording a new state-of-the-art. Besides, we also show that each component composing mcBERT contributes to the performance improvement.
    Understanding Gradient Descent on Edge of Stability in Deep Learning. (arXiv:2205.09745v2 [cs.LG] UPDATED)
    Deep learning experiments by Cohen et al. [2021] using deterministic Gradient Descent (GD) revealed an Edge of Stability (EoS) phase when learning rate (LR) and sharpness (i.e., the largest eigenvalue of Hessian) no longer behave as in traditional optimization. Sharpness stabilizes around $2/$LR and loss goes up and down across iterations, yet still with an overall downward trend. The current paper mathematically analyzes a new mechanism of implicit regularization in the EoS phase, whereby GD updates due to non-smooth loss landscape turn out to evolve along some deterministic flow on the manifold of minimum loss. This is in contrast to many previous results about implicit bias either relying on infinitesimal updates or noise in gradient. Formally, for any smooth function $L$ with certain regularity condition, this effect is demonstrated for (1) Normalized GD, i.e., GD with a varying LR $\eta_t =\frac{\eta}{|| \nabla L(x(t)) ||}$ and loss $L$; (2) GD with constant LR and loss $\sqrt{L- \min_x L(x)}$. Both provably enter the Edge of Stability, with the associated flow on the manifold minimizing $\lambda_{1}(\nabla^2 L)$. The above theoretical results have been corroborated by an experimental study.
    Extracting Targeted Training Data from ASR Models, and How to Mitigate It. (arXiv:2204.08345v2 [cs.SD] UPDATED)
    Recent work has designed methods to demonstrate that model updates in ASR training can leak potentially sensitive attributes of the utterances used in computing the updates. In this work, we design the first method to demonstrate information leakage about training data from trained ASR models. We design Noise Masking, a fill-in-the-blank style method for extracting targeted parts of training data from trained ASR models. We demonstrate the success of Noise Masking by using it in four settings for extracting names from the LibriSpeech dataset used for training a state-of-the-art Conformer model. In particular, we show that we are able to extract the correct names from masked training utterances with 11.8% accuracy, while the model outputs some name from the train set 55.2% of the time. Further, we show that even in a setting that uses synthetic audio and partial transcripts from the test set, our method achieves 2.5% correct name accuracy (47.7% any name success rate). Lastly, we design Word Dropout, a data augmentation method that we show when used in training along with Multistyle TRaining (MTR), provides comparable utility as the baseline, along with significantly mitigating extraction via Noise Masking across the four evaluated settings.
    AutoInit: Automatic Initialization via Jacobian Tuning. (arXiv:2206.13568v1 [stat.ML])
    Good initialization is essential for training Deep Neural Networks (DNNs). Oftentimes such initialization is found through a trial and error approach, which has to be applied anew every time an architecture is substantially modified, or inherited from smaller size networks leading to sub-optimal initialization. In this work we introduce a new and cheap algorithm, that allows one to find a good initialization automatically, for general feed-forward DNNs. The algorithm utilizes the Jacobian between adjacent network blocks to tune the network hyperparameters to criticality. We solve the dynamics of the algorithm for fully connected networks with ReLU and derive conditions for its convergence. We then extend the discussion to more general architectures with BatchNorm and residual connections. Finally, we apply our method to ResMLP and VGG architectures, where the automatic one-shot initialization found by our method shows good performance on vision tasks.
    Patch Selection for Melanoma Classification. (arXiv:2206.13626v1 [cs.CV])
    In medical image processing, the most important information is often located on small parts of the image. Patch-based approaches aim at using only the most relevant parts of the image. Finding ways to automatically select the patches is a challenge. In this paper, we investigate two criteria to choose patches: entropy and a spectral similarity criterion. We perform experiments at different levels of patch size. We train a Convolutional Neural Network on the subsets of patches and analyze the training time. We find that, in addition to requiring less preprocessing time, the classifiers trained on the datasets of patches selected based on entropy converge faster than on those selected based on the spectral similarity criterion and, furthermore, lead to higher accuracy. Moreover, patches of high entropy lead to faster convergence and better accuracy than patches of low entropy.
    Exploring linguistic feature and model combination for speech recognition based automatic AD detection. (arXiv:2206.13758v1 [cs.LG])
    Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care and delay progression. Speech based automatic AD screening systems provide a non-intrusive and more scalable alternative to other clinical screening techniques. Scarcity of such specialist data leads to uncertainty in both model selection and feature learning when developing such systems. To this end, this paper investigates the use of feature and model combination approaches to improve the robustness of domain fine-tuning of BERT and Roberta pre-trained text encoders on limited data, before the resulting embedding features being fed into an ensemble of backend classifiers to produce the final AD detection decision via majority voting. Experiments conducted on the ADReSS20 Challenge dataset suggest consistent performance improvements were obtained using model and feature combination in system development. State-of-the-art AD detection accuracies of 91.67 percent and 93.75 percent were obtained using manual and ASR speech transcripts respectively on the ADReSS20 test set consisting of 48 elderly speakers.
    Fundamental Limits of Communication Efficiency for Model Aggregation in Distributed Learning: A Rate-Distortion Approach. (arXiv:2206.13984v1 [cs.IT])
    One of the main focuses in distributed learning is communication efficiency, since model aggregation at each round of training can consist of millions to billions of parameters. Several model compression methods, such as gradient quantization and sparsification, have been proposed to improve the communication efficiency of model aggregation. However, the information-theoretic minimum communication cost for a given distortion of gradient estimators is still unknown. In this paper, we study the fundamental limit of communication cost of model aggregation in distributed learning from a rate-distortion perspective. By formulating the model aggregation as a vector Gaussian CEO problem, we derive the rate region bound and sum-rate-distortion function for the model aggregation problem, which reveals the minimum communication rate at a particular gradient distortion upper bound. We also analyze the communication cost at each iteration and total communication cost based on the sum-rate-distortion function with the gradient statistics of real-world datasets. It is found that the communication gain by exploiting the correlation between worker nodes is significant for SignSGD, and a high distortion of gradient estimator can achieve low total communication cost in gradient compression.
    Statistical inference with implicit SGD: proximal Robbins-Monro vs. Polyak-Ruppert. (arXiv:2206.12663v2 [stat.ML] UPDATED)
    The implicit stochastic gradient descent (ISGD), a proximal version of SGD, is gaining interest in the literature due to its stability over (explicit) SGD. In this paper, we conduct an in-depth analysis of the two modes of ISGD for smooth convex functions, namely proximal Robbins-Monro (proxRM) and proximal Poylak-Ruppert (proxPR) procedures, for their use in statistical inference on model parameters. Specifically, we derive non-asymptotic point estimation error bounds of both proxRM and proxPR iterates and their limiting distributions, and propose on-line estimators of their asymptotic covariance matrices that require only a single run of ISGD. The latter estimators are used to construct valid confidence intervals for the model parameters. Our analysis is free of the generalized linear model assumption that has limited the preceding analyses, and employs feasible procedures. Our on-line covariance matrix estimators appear to be the first of this kind in the ISGD literature.
    Multi-Agent Reinforcement Learning is a Sequence Modeling Problem. (arXiv:2205.14953v2 [cs.MA] UPDATED)
    Large sequence model (SM) such as GPT series and BERT has displayed outstanding performance and generalization capabilities on vision, language, and recently reinforcement learning tasks. A natural follow-up question is how to abstract multi-agent decision making into an SM problem and benefit from the prosperous development of SMs. In this paper, we introduce a novel architecture named Multi-Agent Transformer (MAT) that effectively casts cooperative multi-agent reinforcement learning (MARL) into SM problems wherein the task is to map agents' observation sequence to agents' optimal action sequence. Our goal is to build the bridge between MARL and SMs so that the modeling power of modern sequence models can be unleashed for MARL. Central to our MAT is an encoder-decoder architecture which leverages the multi-agent advantage decomposition theorem to transform the joint policy search problem into a sequential decision making process; this renders only linear time complexity for multi-agent problems and, most importantly, endows MAT with monotonic performance improvement guarantee. Unlike prior arts such as Decision Transformer fit only pre-collected offline data, MAT is trained by online trials and errors from the environment in an on-policy fashion. To validate MAT, we conduct extensive experiments on StarCraftII, Multi-Agent MuJoCo, Dexterous Hands Manipulation, and Google Research Football benchmarks. Results demonstrate that MAT achieves superior performance and data efficiency compared to strong baselines including MAPPO and HAPPO. Furthermore, we demonstrate that MAT is an excellent few-short learner on unseen tasks regardless of changes in the number of agents. See our project page at https://sites.google.com/view/multi-agent-transformer.
    Domain Agnostic Few-shot Learning for Speaker Verification. (arXiv:2206.13700v1 [cs.SD])
    Deep learning models for verification systems often fail to generalize to new users and new environments, even though they learn highly discriminative features. To address this problem, we propose a few-shot domain generalization framework that learns to tackle distribution shift for new users and new domains. Our framework consists of domain-specific and domain-aggregation networks, which are the experts on specific and combined domains, respectively. By using these networks, we generate episodes that mimic the presence of both novel users and novel domains in the training phase to eventually produce better generalization. To save memory, we reduce the number of domain-specific networks by clustering similar domains together. Upon extensive evaluation on artificially generated noise domains, we can explicitly show generalization ability of our framework. In addition, we apply our proposed methods to the existing competitive architecture on the standard benchmark, which shows further performance improvements.
    On the amplification of security and privacy risks by post-hoc explanations in machine learning models. (arXiv:2206.14004v1 [cs.LG])
    A variety of explanation methods have been proposed in recent years to help users gain insights into the results returned by neural networks, which are otherwise complex and opaque black-boxes. However, explanations give rise to potential side-channels that can be leveraged by an adversary for mounting attacks on the system. In particular, post-hoc explanation methods that highlight input dimensions according to their importance or relevance to the result also leak information that weakens security and privacy. In this work, we perform the first systematic characterization of the privacy and security risks arising from various popular explanation techniques. First, we propose novel explanation-guided black-box evasion attacks that lead to 10 times reduction in query count for the same success rate. We show that the adversarial advantage from explanations can be quantified as a reduction in the total variance of the estimated gradient. Second, we revisit the membership information leaked by common explanations. Contrary to observations in prior studies, via our modified attacks we show significant leakage of membership information (above 100% improvement over prior results), even in a much stricter black-box setting. Finally, we study explanation-guided model extraction attacks and demonstrate adversarial gains through a large reduction in query count.
    Dynamic Memory for Interpretable Sequential Optimisation. (arXiv:2206.13960v1 [cs.LG])
    Real-world applications of reinforcement learning for recommendation and experimentation faces a practical challenge: the relative reward of different bandit arms can evolve over the lifetime of the learning agent. To deal with these non-stationary cases, the agent must forget some historical knowledge, as it may no longer be relevant to minimise regret. We present a solution to handling non-stationarity that is suitable for deployment at scale, to provide business operators with automated adaptive optimisation. Our solution aims to provide interpretable learning that can be trusted by humans, whilst responding to non-stationarity to minimise regret. To this end, we develop an adaptive Bayesian learning agent that employs a novel form of dynamic memory. It enables interpretability through statistical hypothesis testing, by targeting a set point of statistical power when comparing rewards and adjusting its memory dynamically to achieve this power. By design, the agent is agnostic to different kinds of non-stationarity. Using numerical simulations, we compare its performance against an existing proposal and show that, under multiple non-stationary scenarios, our agent correctly adapts to real changes in the true rewards. In all bandit solutions, there is an explicit trade-off between learning and achieving maximal performance. Our solution sits on a different point on this trade-off when compared to another similarly robust approach: we prioritise interpretability, which relies on more learning, at the cost of some regret. We describe the architecture of a large-scale deployment of automatic optimisation-as-a-service where our agent achieves interpretability whilst adapting to changing circumstances.
    Toward an ImageNet Library of Functions for Global Optimization Benchmarking. (arXiv:2206.13630v1 [cs.AI])
    Knowledge of search-landscape features of BlackBox Optimization (BBO) problems offers valuable information in light of the Algorithm Selection and/or Configuration problems. Exploratory Landscape Analysis (ELA) models have gained success in identifying predefined human-derived features and in facilitating portfolio selectors to address those challenges. Unlike ELA approaches, the current study proposes to transform the identification problem into an image recognition problem, with a potential to detect conception-free, machine-driven landscape features. To this end, we introduce the notion of Landscape Images, which enables us to generate imagery instances per a benchmark function, and then target the classification challenge over a diverse generalized dataset of functions. We address it as a supervised multi-class image recognition problem and apply basic artificial neural network models to solve it. The efficacy of our approach is numerically validated on the noise free BBOB and IOHprofiler benchmarking suites. This evident successful learning is another step toward automated feature extraction and local structure deduction of BBO problems. By using this definition of landscape images, and by capitalizing on existing capabilities of image recognition algorithms, we foresee the construction of an ImageNet-like library of functions for training generalized detectors that rely on machine-driven features.
    Envelope imbalanced ensemble model with deep sample learning and local-global structure consistency. (arXiv:2206.13507v1 [cs.LG])
    The class imbalance problem is important and challenging. Ensemble approaches are widely used to tackle this problem because of their effectiveness. However, existing ensemble methods are always applied into original samples, while not considering the structure information among original samples. The limitation will prevent the imbalanced learning from being better. Besides, research shows that the structure information among samples includes local and global structure information. Based on the analysis above, an imbalanced ensemble algorithm with the deep sample pre-envelope network (DSEN) and local-global structure consistency mechanism (LGSCM) is proposed here to solve the problem.This algorithm can guarantee high-quality deep envelope samples for considering the local manifold and global structures information, which is helpful for imbalance learning. First, the deep sample envelope pre-network (DSEN) is designed to mine structure information among samples.Then, the local manifold structure metric (LMSM) and global structure distribution metric (GSDM) are designed to construct LGSCM to enhance distribution consistency of interlayer samples. Next, the DSEN and LGSCM are put together to form the final deep sample envelope network (DSEN-LG). After that, base classifiers are applied on the layers of deep samples respectively.Finally, the predictive results from base classifiers are fused through bagging ensemble learning mechanism. To demonstrate the effectiveness of the proposed method, forty-four public datasets and more than ten representative relevant algorithms are chosen for verification. The experimental results show that the algorithm is significantly better than other imbalanced ensemble algorithms.
    SLOVA: Uncertainty Estimation Using Single Label One-Vs-All Classifier. (arXiv:2206.13923v1 [cs.LG])
    Deep neural networks present impressive performance, yet they cannot reliably estimate their predictive confidence, limiting their applicability in high-risk domains. We show that applying a multi-label one-vs-all loss reveals classification ambiguity and reduces model overconfidence. The introduced SLOVA (Single Label One-Vs-All) model redefines typical one-vs-all predictive probabilities to a single label situation, where only one class is the correct answer. The proposed classifier is confident only if a single class has a high probability and other probabilities are negligible. Unlike the typical softmax function, SLOVA naturally detects out-of-distribution samples if the probabilities of all other classes are small. The model is additionally fine-tuned with exponential calibration, which allows us to precisely align the confidence score with model accuracy. We verify our approach on three tasks. First, we demonstrate that SLOVA is competitive with the state-of-the-art on in-distribution calibration. Second, the performance of SLOVA is robust under dataset shifts. Finally, our approach performs extremely well in the detection of out-of-distribution samples. Consequently, SLOVA is a tool that can be used in various applications where uncertainty modeling is required.
    Survival Kernets: Scalable and Interpretable Deep Kernel Survival Analysis with an Accuracy Guarantee. (arXiv:2206.10477v2 [cs.LG] UPDATED)
    Kernel survival analysis models estimate individual survival distributions with the help of a kernel function, which measures the similarity between any two data points. Such a kernel function can be learned using deep kernel survival models. In this paper, we present a new deep kernel survival model called a survival kernet, which scales to large datasets in a manner that is amenable to model interpretation and also theoretical analysis. Specifically, the training data are partitioned into clusters based on a recently developed training set compression scheme for classification and regression called kernel netting that we extend to the survival analysis setting. At test-time, each data point is represented as a weighted combination of these clusters, and each such cluster can be visualized. For a special case of survival kernets, we establish a finite-sample error bound on predicted survival distributions that is, up to a log factor, optimal. Whereas scalability at test time is achieved using the aforementioned kernel netting compression strategy, scalability during training is achieved by a warm-start procedure based on tree ensembles such as XGBoost and a heuristic approach to accelerating neural architecture search. On three standard survival analysis datasets of varying sizes (up to roughly 3 million data points), we show that survival kernets are highly competitive with the best of baselines tested in terms of concordance index. Our code is available at: https://github.com/georgehc/survival-kernets
    Sublinear-Time Clustering Oracle for Signed Graphs. (arXiv:2206.13813v1 [cs.DS])
    Social networks are often modeled using signed graphs, where vertices correspond to users and edges have a sign that indicates whether an interaction between users was positive or negative. The arising signed graphs typically contain a clear community structure in the sense that the graph can be partitioned into a small number of polarized communities, each defining a sparse cut and indivisible into smaller polarized sub-communities. We provide a local clustering oracle for signed graphs with such a clear community structure, that can answer membership queries, i.e., "Given a vertex $v$, which community does $v$ belong to?", in sublinear time by reading only a small portion of the graph. Formally, when the graph has bounded maximum degree and the number of communities is at most $O(\log n)$, then with $\tilde{O}(\sqrt{n}\operatorname{poly}(1/\varepsilon))$ preprocessing time, our oracle can answer each membership query in $\tilde{O}(\sqrt{n}\operatorname{poly}(1/\varepsilon))$ time, and it correctly classifies a $(1-\varepsilon)$-fraction of vertices w.r.t. a set of hidden planted ground-truth communities. Our oracle is desirable in applications where the clustering information is needed for only a small number of vertices. Previously, such local clustering oracles were only known for unsigned graphs; our generalization to signed graphs requires a number of new ideas and gives a novel spectral analysis of the behavior of random walks with signs. We evaluate our algorithm for constructing such an oracle and answering membership queries on both synthetic and real-world datasets, validating its performance in practice.
    Short-Term Plasticity Neurons Learning to Learn and Forget. (arXiv:2206.14048v1 [cs.NE])
    Short-term plasticity (STP) is a mechanism that stores decaying memories in synapses of the cerebral cortex. In computing practice, STP has been used, but mostly in the niche of spiking neurons, even though theory predicts that it is the optimal solution to certain dynamic tasks. Here we present a new type of recurrent neural unit, the STP Neuron (STPN), which indeed turns out strikingly powerful. Its key mechanism is that synapses have a state, propagated through time by a self-recurrent connection-within-the-synapse. This formulation enables training the plasticity with backpropagation through time, resulting in a form of learning to learn and forget in the short term. The STPN outperforms all tested alternatives, i.e. RNNs, LSTMs, other models with fast weights, and differentiable plasticity. We confirm this in both supervised and reinforcement learning (RL), and in tasks such as Associative Retrieval, Maze Exploration, Atari video games, and MuJoCo robotics. Moreover, we calculate that, in neuromorphic or biological circuits, the STPN minimizes energy consumption across models, as it depresses individual synapses dynamically. Based on these, biological STP may have been a strong evolutionary attractor that maximizes both efficiency and computational power. The STPN now brings these neuromorphic advantages also to a broad spectrum of machine learning practice. Code is available at https://github.com/NeuromorphicComputing/stpn
    Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment. (arXiv:2206.13951v1 [cs.CV])
    Vision Transformer (ViT) is becoming more popular in image processing. Specifically, we investigate the effectiveness of test-time adaptation (TTA) on ViT, a technique that has emerged to correct its prediction during test-time by itself. First, we benchmark various test-time adaptation approaches on ViT-B16 and ViT-L16. It is shown that the TTA is effective on ViT and the prior-convention (sensibly selecting modulation parameters) is not necessary when using proper loss function. Based on the observation, we propose a new test-time adaptation method called class-conditional feature alignment (CFA), which minimizes both the class-conditional distribution differences and the whole distribution differences of the hidden representation between the source and target in an online manner. Experiments of image classification tasks on common corruption (CIFAR-10-C, CIFAR-100-C, and ImageNet-C) and domain adaptation (digits datasets and ImageNet-Sketch) show that CFA stably outperforms the existing baselines on various datasets. We also verify that CFA is model agnostic by experimenting on ResNet, MLP-Mixer, and several ViT variants (ViT-AugReg, DeiT, and BeiT). Using BeiT backbone, CFA achieves 19.8% top-1 error rate on ImageNet-C, outperforming the existing test-time adaptation baseline 44.0%. This is a state-of-the-art result among TTA methods that do not need to alter training phase.
    Learning Generalizable Dexterous Manipulation from Human Grasp Affordance. (arXiv:2204.02320v3 [cs.RO] UPDATED)
    Dexterous manipulation with a multi-finger hand is one of the most challenging problems in robotics. While recent progress in imitation learning has largely improved the sample efficiency compared to Reinforcement Learning, the learned policy can hardly generalize to manipulate novel objects, given limited expert demonstrations. In this paper, we propose to learn dexterous manipulation using large-scale demonstrations with diverse 3D objects in a category, which are generated from a human grasp affordance model. This generalizes the policy to novel object instances within the same category. To train the policy, we propose a novel imitation learning objective jointly with a geometric representation learning objective using our demonstrations. By experimenting with relocating diverse objects in simulation, we show that our approach outperforms baselines with a large margin when manipulating novel objects. We also ablate the importance on 3D object representation learning for manipulation. We include videos, code, and additional information on the project website - https://kristery.github.io/ILAD/ .
    Evaluating Understanding on Conceptual Abstraction Benchmarks. (arXiv:2206.14187v1 [cs.AI])
    A long-held objective in AI is to build systems that understand concepts in a humanlike way. Setting aside the difficulty of building such a system, even trying to evaluate one is a challenge, due to present-day AI's relative opacity and its proclivity for finding shortcut solutions. This is exacerbated by humans' tendency to anthropomorphize, assuming that a system that can recognize one instance of a concept must also understand other instances, as a human would. In this paper, we argue that understanding a concept requires the ability to use it in varied contexts. Accordingly, we propose systematic evaluations centered around concepts, by probing a system's ability to use a given concept in many different instantiations. We present case studies of such an evaluations on two domains -- RAVEN (inspired by Raven's Progressive Matrices) and the Abstraction and Reasoning Corpus (ARC) -- that have been used to develop and assess abstraction abilities in AI systems. Our concept-based approach to evaluation reveals information about AI systems that conventional test sets would have left hidden.
    Improving Correlation Capture in Generating Imbalanced Data using Differentially Private Conditional GANs. (arXiv:2206.13787v1 [cs.LG])
    Despite the remarkable success of Generative Adversarial Networks (GANs) on text, images, and videos, generating high-quality tabular data is still under development owing to some unique challenges such as capturing dependencies in imbalanced data, optimizing the quality of synthetic patient data while preserving privacy. In this paper, we propose DP-CGANS, a differentially private conditional GAN framework consisting of data transformation, sampling, conditioning, and networks training to generate realistic and privacy-preserving tabular data. DP-CGANS distinguishes categorical and continuous variables and transforms them to latent space separately. Then, we structure a conditional vector as an additional input to not only presents the minority class in the imbalanced data, but also capture the dependency between variables. We inject statistical noise to the gradients in the networking training process of DP-CGANS to provide a differential privacy guarantee. We extensively evaluate our model with state-of-the-art generative models on three public datasets and two real-world personal health datasets in terms of statistical similarity, machine learning performance, and privacy measurement. We demonstrate that our model outperforms other comparable models, especially in capturing dependency between variables. Finally, we present the balance between data utility and privacy in synthetic data generation considering the different data structure and characteristics of real-world datasets such as imbalance variables, abnormal distributions, and sparsity of data.
    Nonparametric, Nonasymptotic Confidence Bands with Paley-Wiener Kernels for Band-Limited Functions. (arXiv:2206.13629v1 [stat.ML])
    The paper introduces a method to construct confidence bands for bounded, band-limited functions based on a finite sample of input-output pairs. The approach is distribution-free w.r.t. the observation noises and only the knowledge of the input distribution is assumed. It is nonparametric, that is, it does not require a parametric model of the regression function and the regions have non-asymptotic guarantees. The algorithm is based on the theory of Paley-Wiener reproducing kernel Hilbert spaces. The paper first studies the fully observable variant, when there are no noises on the observations and only the inputs are random; then it generalizes the ideas to the noisy case using gradient-perturbation methods. Finally, numerical experiments demonstrating both cases are presented.
    Detecting Unintended Memorization in Language-Model-Fused ASR. (arXiv:2204.09606v2 [cs.CL] UPDATED)
    End-to-end (E2E) models are often being accompanied by language models (LMs) via shallow fusion for boosting their overall quality as well as recognition of rare words. At the same time, several prior works show that LMs are susceptible to unintentionally memorizing rare or unique sequences in the training data. In this work, we design a framework for detecting memorization of random textual sequences (which we call canaries) in the LM training data when one has only black-box (query) access to LM-fused speech recognizer, as opposed to direct access to the LM. On a production-grade Conformer RNN-T E2E model fused with a Transformer LM, we show that detecting memorization of singly-occurring canaries from the LM training data of 300M examples is possible. Motivated to protect privacy, we also show that such memorization gets significantly reduced by per-example gradient-clipped LM training without compromising overall quality.
    Generalized Policy Improvement Algorithms with Theoretically Supported Sample Reuse. (arXiv:2206.13714v1 [cs.LG])
    Real-world sequential decision making requires data-driven algorithms that provide practical guarantees on performance throughout training while also making efficient use of data. Model-free deep reinforcement learning represents a framework for such data-driven decision making, but existing algorithms typically only focus on one of these goals while sacrificing performance with respect to the other. On-policy algorithms guarantee policy improvement throughout training but suffer from high sample complexity, while off-policy algorithms make efficient use of data through sample reuse but lack theoretical guarantees. In order to balance these competing goals, we develop a class of Generalized Policy Improvement algorithms that combines the policy improvement guarantees of on-policy methods with the efficiency of theoretically supported sample reuse. We demonstrate the benefits of this new class of algorithms through extensive experimental analysis on a variety of continuous control tasks from the DeepMind Control Suite.
    H-GCN: A Graph Convolutional Network Accelerator on Versal ACAP Architecture. (arXiv:2206.13734v1 [cs.AR])
    Graph Neural Networks (GNNs) have drawn tremendous attention due to their unique capability to extend Machine Learning (ML) approaches to applications broadly-defined as having unstructured data, especially graphs. Compared with other Machine Learning (ML) modalities, the acceleration of Graph Neural Networks (GNNs) is more challenging due to the irregularity and heterogeneity derived from graph typologies. Existing efforts, however, have focused mainly on handling graphs' irregularity and have not studied their heterogeneity. To this end we propose H-GCN, a PL (Programmable Logic) and AIE (AI Engine) based hybrid accelerator that leverages the emerging heterogeneity of Xilinx Versal Adaptive Compute Acceleration Platforms (ACAPs) to achieve high-performance GNN inference. In particular, H-GCN partitions each graph into three subgraphs based on its inherent heterogeneity, and processes them using PL and AIE, respectively. To further improve performance, we explore the sparsity support of AIE and develop an efficient density-aware method to automatically map tiles of sparse matrix-matrix multiplication (SpMM) onto the systolic tensor array. Compared with state-of-the-art GCN accelerators, H-GCN achieves, on average, speedups of 1.1~2.3X.  ( 2 min )
    Classification of ADHD Patients Using Kernel Hierarchical Extreme Learning Machine. (arXiv:2206.13761v1 [cs.LG])
    Recently, the application of deep learning models to diagnose neuropsychiatric diseases from brain imaging data has received more and more attention. However, in practice, exploring interactions in brain functional connectivity based on operational magnetic resonance imaging data is critical for studying mental illness. Since Attention-Deficit and Hyperactivity Disorder (ADHD) is a type of chronic disease that is very difficult to diagnose in the early stages, it is necessary to improve the diagnosis accuracy of such illness using machine learning models treating patients before the critical condition. In this study, we utilize the dynamics of brain functional connectivity to model features from medical imaging data, which can extract the differences in brain function interactions between Normal Control (NC) and ADHD. To meet that requirement, we employ the Bayesian connectivity change-point model to detect brain dynamics using the local binary encoding approach and kernel hierarchical extreme learning machine for classifying features. To verify our model, we experimented with it on several real-world children's datasets, and our results achieved superior classification rates compared to the state-of-the-art models.  ( 2 min )
    Rankings from multimodal pairwise comparisons. (arXiv:2206.13580v1 [stat.ML])
    The task of ranking individuals or teams, based on a set of comparisons between pairs, arises in various contexts, including sporting competitions and the analysis of dominance hierarchies among animals and humans. Given data on which competitors beat which others, the challenge is to rank the competitors from best to worst. Here we study the problem of computing rankings when there are multiple, potentially conflicting modes of comparison, such as multiple types of dominance behaviors among animals. We assume that we do not know a priori what information each behavior conveys about the ranking, or even whether they convey any information at all. Nonetheless we show that it is possible to compute a ranking in this situation and present a fast method for doing so, based on a combination of an expectation-maximization algorithm and a modified Bradley-Terry model. We give a selection of example applications to both animal and human competition.  ( 2 min )
    Heterogeneous mixtures of dictionary functions to approximate subspace invariance in Koopman operators. (arXiv:2206.13585v1 [eess.SY])
    Koopman operators model nonlinear dynamics as a linear dynamic system acting on a nonlinear function as the state. This nonstandard state is often called a Koopman observable and is usually approximated numerically by a superposition of functions drawn from a \textit{dictionary}. A widely used algorithm, is \textit{Extended Dynamic Mode Decomposition}, where the dictionary functions are drawn from a fixed, homogeneous class of functions. Recently, deep learning combined with EDMD has been used to learn novel dictionary functions in an algorithm called deep dynamic mode decomposition (deepDMD). The learned representation both (1) accurately models and (2) scales well with the dimension of the original nonlinear system. In this paper we analyze the learned dictionaries from deepDMD and explore the theoretical basis for their strong performance. We discover a novel class of dictionary functions to approximate Koopman observables. Error analysis of these dictionary functions show they satisfy a property of subspace approximation, which we define as uniform finite approximate closure. We discover that structured mixing of heterogeneous dictionary functions drawn from different classes of nonlinear functions achieve the same accuracy and dimensional scaling as deepDMD. This mixed dictionary does so with an order of magnitude reduction in parameters, while maintaining geometric interpretability. Our results provide a hypothesis to explain the success of deep neural networks in learning numerical approximations to Koopman operators.  ( 3 min )
    DistSPECTRL: Distributing Specifications in Multi-Agent Reinforcement Learning Systems. (arXiv:2206.13754v1 [cs.MA])
    While notable progress has been made in specifying and learning objectives for general cyber-physical systems, applying these methods to distributed multi-agent systems still pose significant challenges. Among these are the need to (a) craft specification primitives that allow expression and interplay of both local and global objectives, (b) tame explosion in the state and action spaces to enable effective learning, and (c) minimize coordination frequency and the set of engaged participants for global objectives. To address these challenges, we propose a novel specification framework that allows natural composition of local and global objectives used to guide training of a multi-agent system. Our technique enables learning expressive policies that allow agents to operate in a coordination-free manner for local objectives, while using a decentralized communication protocol for enforcing global ones. Experimental results support our claim that sophisticated multi-agent distributed planning problems can be effectively realized using specification-guided learning.  ( 2 min )
    Attention-based conditioning methods using variable frame rate for style-robust speaker verification. (arXiv:2206.13680v1 [eess.AS])
    We propose an approach to extract speaker embeddings that are robust to speaking style variations in text-independent speaker verification. Typically, speaker embedding extraction includes training a DNN for speaker classification and using the bottleneck features as speaker representations. Such a network has a pooling layer to transform frame-level to utterance-level features by calculating statistics over all utterance frames, with equal weighting. However, self-attentive embeddings perform weighted pooling such that the weights correspond to the importance of the frames in a speaker classification task. Entropy can capture acoustic variability due to speaking style variations. Hence, an entropy-based variable frame rate vector is proposed as an external conditioning vector for the self-attention layer to provide the network with information that can address style effects. This work explores five different approaches to conditioning. The best conditioning approach, concatenation with gating, provided statistically significant improvements over the x-vector baseline in 12/23 tasks and was the same as the baseline in 11/23 tasks when using the UCLA speaker variability database. It also significantly outperformed self-attention without conditioning in 9/23 tasks and was worse in 1/23. The method also showed significant improvements in multi-speaker scenarios of SITW.  ( 2 min )
    ProGen2: Exploring the Boundaries of Protein Language Models. (arXiv:2206.13517v1 [cs.LG])
    Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional finetuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. We release the ProGen2 models and code at https://github.com/salesforce/progen.  ( 2 min )
    TTS-CGAN: A Transformer Time-Series Conditional GAN for Biosignal Data Augmentation. (arXiv:2206.13676v1 [cs.LG])
    Signal measurement appearing in the form of time series is one of the most common types of data used in medical machine learning applications. Such datasets are often small in size, expensive to collect and annotate, and might involve privacy issues, which hinders our ability to train large, state-of-the-art deep learning models for biomedical applications. For time-series data, the suite of data augmentation strategies we can use to expand the size of the dataset is limited by the need to maintain the basic properties of the signal. Generative Adversarial Networks (GANs) can be utilized as another data augmentation tool. In this paper, we present TTS-CGAN, a transformer-based conditional GAN model that can be trained on existing multi-class datasets and generate class-specific synthetic time-series sequences of arbitrary length. We elaborate on the model architecture and design strategies. Synthetic sequences generated by our model are indistinguishable from real ones, and can be used to complement or replace real signals of the same type, thus achieving the goal of data augmentation. To evaluate the quality of the generated data, we modify the wavelet coherence metric to be able to compare the similarity between two sets of signals, and also conduct a case study where a mix of synthetic and real data are used to train a deep learning model for sequence classification. Together with other visualization techniques and qualitative evaluation approaches, we demonstrate that TTS-CGAN generated synthetic data are similar to real data, and that our model performs better than the other state-of-the-art GAN models built for time-series data generation.  ( 3 min )
    Online Resource Allocation under Horizon Uncertainty. (arXiv:2206.13606v1 [cs.DS])
    We study stochastic online resource allocation: a decision maker needs to allocate limited resources to stochastically-generated sequentially-arriving requests in order to maximize reward. Motivated by practice, we consider a data-driven setting in which requests are drawn independently from a distribution that is unknown to the decision maker. Online resource allocation and its special cases have been studied extensively in the past, but these previous results crucially and universally rely on a practically-untenable assumption: the total number of requests (the horizon) is known to the decision maker in advance. In many applications, such as revenue management and online advertising, the number of requests can vary widely because of fluctuations in demand or user traffic intensity. In this work, we develop online algorithms that are robust to horizon uncertainty. In sharp contrast to the known-horizon setting, we show that no algorithm can achieve a constant asymptotic competitive ratio that is independent of the horizon uncertainty. We then introduce a novel algorithm that combines dual mirror descent with a carefully-chosen target consumption sequence and prove that it achieves a bounded competitive ratio. Our algorithm is near-optimal in the sense that its competitive ratio attains the optimal rate of growth when the horizon uncertainty grows large.  ( 2 min )
  • Open

    Improved Certified Defenses against Data Poisoning with (Deterministic) Finite Aggregation. (arXiv:2202.02628v2 [cs.LG] UPDATED)
    Data poisoning attacks aim at manipulating model behaviors through distorting training data. Previously, an aggregation-based certified defense, Deep Partition Aggregation (DPA), was proposed to mitigate this threat. DPA predicts through an aggregation of base classifiers trained on disjoint subsets of data, thus restricting its sensitivity to dataset distortions. In this work, we propose an improved certified defense against general poisoning attacks, namely Finite Aggregation. In contrast to DPA, which directly splits the training set into disjoint subsets, our method first splits the training set into smaller disjoint subsets and then combines duplicates of them to build larger (but not disjoint) subsets for training base classifiers. This reduces the worst-case impacts of poison samples and thus improves certified robustness bounds. In addition, we offer an alternative view of our method, bridging the designs of deterministic and stochastic aggregation-based certified defenses. Empirically, our proposed Finite Aggregation consistently improves certificates on MNIST, CIFAR-10, and GTSRB, boosting certified fractions by up to 3.05%, 3.87% and 4.77%, respectively, while keeping the same clean accuracies as DPA's, effectively establishing a new state of the art in (pointwise) certified robustness against data poisoning.
    Survival Kernets: Scalable and Interpretable Deep Kernel Survival Analysis with an Accuracy Guarantee. (arXiv:2206.10477v2 [cs.LG] UPDATED)
    Kernel survival analysis models estimate individual survival distributions with the help of a kernel function, which measures the similarity between any two data points. Such a kernel function can be learned using deep kernel survival models. In this paper, we present a new deep kernel survival model called a survival kernet, which scales to large datasets in a manner that is amenable to model interpretation and also theoretical analysis. Specifically, the training data are partitioned into clusters based on a recently developed training set compression scheme for classification and regression called kernel netting that we extend to the survival analysis setting. At test-time, each data point is represented as a weighted combination of these clusters, and each such cluster can be visualized. For a special case of survival kernets, we establish a finite-sample error bound on predicted survival distributions that is, up to a log factor, optimal. Whereas scalability at test time is achieved using the aforementioned kernel netting compression strategy, scalability during training is achieved by a warm-start procedure based on tree ensembles such as XGBoost and a heuristic approach to accelerating neural architecture search. On three standard survival analysis datasets of varying sizes (up to roughly 3 million data points), we show that survival kernets are highly competitive with the best of baselines tested in terms of concordance index. Our code is available at: https://github.com/georgehc/survival-kernets
    Fundamental Limits of Communication Efficiency for Model Aggregation in Distributed Learning: A Rate-Distortion Approach. (arXiv:2206.13984v1 [cs.IT])
    One of the main focuses in distributed learning is communication efficiency, since model aggregation at each round of training can consist of millions to billions of parameters. Several model compression methods, such as gradient quantization and sparsification, have been proposed to improve the communication efficiency of model aggregation. However, the information-theoretic minimum communication cost for a given distortion of gradient estimators is still unknown. In this paper, we study the fundamental limit of communication cost of model aggregation in distributed learning from a rate-distortion perspective. By formulating the model aggregation as a vector Gaussian CEO problem, we derive the rate region bound and sum-rate-distortion function for the model aggregation problem, which reveals the minimum communication rate at a particular gradient distortion upper bound. We also analyze the communication cost at each iteration and total communication cost based on the sum-rate-distortion function with the gradient statistics of real-world datasets. It is found that the communication gain by exploiting the correlation between worker nodes is significant for SignSGD, and a high distortion of gradient estimator can achieve low total communication cost in gradient compression.
    SurvTRACE: Transformers for Survival Analysis with Competing Events. (arXiv:2110.00855v2 [cs.LG] UPDATED)
    In medicine, survival analysis studies the time duration to events of interest such as mortality. One major challenge is how to deal with multiple competing events (e.g., multiple disease diagnoses). In this work, we propose a transformer-based model that does not make the assumption for the underlying survival distribution and is capable of handling competing events, namely SurvTRACE. We account for the implicit \emph{confounders} in the observational setting in multi-events scenarios, which causes selection bias as the predicted survival probability is influenced by irrelevant factors. To sufficiently utilize the survival data to train transformers from scratch, multiple auxiliary tasks are designed for multi-task learning. The model hence learns a strong shared representation from all these tasks and in turn serves for better survival analysis. We further demonstrate how to inspect the covariate relevance and importance through interpretable attention mechanisms of SurvTRACE, which suffices to great potential in enhancing clinical trial design and new treatment development. Experiments on METABRIC, SUPPORT, and SEER data with 470k patients validate the all-around superiority of our method.
    Equivariant Priors for Compressed Sensing with Unknown Orientation. (arXiv:2206.14069v1 [cs.LG])
    In compressed sensing, the goal is to reconstruct the signal from an underdetermined system of linear measurements. Thus, prior knowledge about the signal of interest and its structure is required. Additionally, in many scenarios, the signal has an unknown orientation prior to measurements. To address such recovery problems, we propose using equivariant generative models as a prior, which encapsulate orientation information in their latent space. Thereby, we show that signals with unknown orientations can be recovered with iterative gradient descent on the latent space of these models and provide additional theoretical recovery guarantees. We construct an equivariant variational autoencoder and use the decoder as generative prior for compressed sensing. We discuss additional potential gains of the proposed approach in terms of convergence and latency.
    Neural Tangent Kernel Analysis of Deep Narrow Neural Networks. (arXiv:2202.02981v2 [cs.LG] UPDATED)
    The tremendous recent progress in analyzing the training dynamics of overparameterized neural networks has primarily focused on wide networks and therefore does not sufficiently address the role of depth in deep learning. In this work, we present the first trainability guarantee of infinitely deep but narrow neural networks. We study the infinite-depth limit of a multilayer perceptron (MLP) with a specific initialization and establish a trainability guarantee using the NTK theory. We then extend the analysis to an infinitely deep convolutional neural network (CNN) and perform brief experiments.
    Exact Spectral Norm Regularization for Neural Networks. (arXiv:2206.13581v1 [stat.ML])
    We pursue a line of research that seeks to regularize the spectral norm of the Jacobian of the input-output mapping for deep neural networks. While previous work rely on upper bounding techniques, we provide a scheme that targets the exact spectral norm. We showcase that our algorithm achieves an improved generalization performance compared to previous spectral regularization techniques while simultaneously maintaining a strong safeguard against natural and adversarial noise. Moreover, we further explore some previous reasoning concerning the strong adversarial protection that Jacobian regularization provides and show that it can be misleading.
    Entropy-based Characterization of Modeling Constraints. (arXiv:2206.14105v1 [stat.ME])
    In most data-scientific approaches, the principle of Maximum Entropy (MaxEnt) is used to a posteriori justify some parametric model which has been already chosen based on experience, prior knowledge or computational simplicity. In a perpendicular formulation to conventional model building, we start from the linear system of phenomenological constraints and asymptotically derive the distribution over all viable distributions that satisfy the provided set of constraints. The MaxEnt distribution plays a special role, as it is the most typical among all phenomenologically viable distributions representing a good expansion point for large-N techniques. This enables us to consistently formulate hypothesis testing in a fully-data driven manner. The appropriate parametric model which is supported by the data can be always deduced at the end of model selection. In the MaxEnt framework, we recover major scores and selection procedures used in multiple applications and assess their ability to capture associations in the data-generating process and identify the most generalizable model. This data-driven counterpart of standard model selection demonstrates the unifying prospective of the deductive logic advocated by MaxEnt principle, while potentially shedding new insights to the inverse problem.
    Disentangling Embedding Spaces with Minimal Distributional Assumptions. (arXiv:2206.13872v1 [stat.ML])
    Interest in understanding and factorizing learned embedding spaces is growing. For instance, recent concept-based explanation techniques analyze a machine learning model in terms of interpretable latent components. Such components have to be discovered in the model's embedding space, e.g., through independent component analysis (ICA) or modern disentanglement learning techniques. While these unsupervised approaches offer a sound formal framework, they either require access to a data generating function or impose rigid assumptions on the data distribution, such as independence of components, that are often violated in practice. In this work, we link conceptual explainability for vision models with disentanglement learning and ICA. This enables us to provide first theoretical results on how components can be identified without requiring any distributional assumptions. From these insights, we derive the disjoint attributions (DA) concept discovery method that is applicable to a broader class of problems than current approaches but yet possesses a formal identifiability guarantee. In an extensive comparison against component analysis and over 300 state-of-the-art disentanglement models, DA stably maintains superior performance, even under varying distributions and correlation strengths.
    Detecting Distributional Differences in Labeled Sequence Data with Application to Tropical Cyclone Satellite Imagery. (arXiv:2202.02253v3 [stat.AP] UPDATED)
    Our goal is to quantify whether, and if so how, spatio-temporal patterns in tropical cyclone (TC) satellite imagery signal an upcoming rapid intensity change event. To address this question, we propose a new nonparametric test of association between a time series of images and a series of binary event labels. We ask whether there is a difference in distribution between (dependent but identically distributed) 24-h sequences of images preceding an event versus a non-event. By rewriting the statistical test as a regression problem, we leverage neural networks to infer modes of structural evolution of TC convection that are representative of the lead-up to rapid intensity change events. Dependencies between nearby sequences are handled by a bootstrap procedure that estimates the marginal distribution of the label series. We prove that type I error control is guaranteed as long as the distribution of the label series is well-estimated, which is made easier by the extensive historical data for binary TC event labels. We show empirical evidence that our proposed method identifies archetypes of infrared imagery associated with elevated rapid intensification risk, typically marked by deep or deepening core convection over time. Such results provide a foundation for improved forecasts of rapid intensification.
    Stochastic first-order methods for average-reward Markov decision processes. (arXiv:2205.05800v4 [cs.LG] UPDATED)
    We study the problem of average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy evaluation and optimization. Existing on-policy evaluation methods suffer from sub-optimal convergence rates as well as failure in handling insufficiently random policies, e.g., deterministic policies, for lack of exploration. To remedy these issues, we develop a novel variance-reduced temporal difference (VRTD) method with linear function approximation for randomized policies along with optimal convergence guarantees, and an exploratory variance-reduced temporal difference (EVRTD) method for insufficiently random policies with comparable convergence guarantees. We further establish linear convergence rate on the bias of policy evaluation, which is essential for improving the overall sample complexity of policy optimization. On the other hand, compared with intensive research interest in finite sample analysis of policy gradient methods for discounted MDPs, existing studies on policy gradient methods for AMDPs mostly focus on regret bounds under restrictive assumptions on the underlying Markov processes (see, e.g., Abbasi-Yadkori et al., 2019), and they often lack guarantees on the overall sample complexities. Towards this end, we develop an average-reward variant of the stochastic policy mirror descent (SPMD) (Lan, 2022). We establish the first $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity for solving AMDPs with policy gradient method under both the generative model (with unichain assumption) and Markovian noise model (with ergodic assumption). This bound can be further improved to $\widetilde{\mathcal{O}}(\epsilon^{-1})$ for solving regularized AMDPs. Our theoretical advantages are corroborated by numerical experiments.
    A Global Stochastic Optimization Particle Filter Algorithm. (arXiv:2007.04803v8 [stat.ML] UPDATED)
    We introduce a new online algorithm for expected log-likelihood maximization in situations where the objective function is multi-modal and/or has saddle points, that we term G-PFSO. The key element underpinning G-PFSO is a probability distribution which (a) is shown to concentrate on the target parameter value as the sample size increases and (b) can be efficiently estimated by means of a standard particle filter algorithm. This distribution depends on a learning rate, where the faster the learning rate the quicker it concentrates on the desired element of the search space, but the less likely G-PFSO is to escape from a local optimum of the objective function. In order to achieve a fast convergence rate with a slow learning rate, G-PFSO exploits the acceleration property of averaging, well-known in the stochastic gradient literature. Considering several challenging estimation problems, the numerical experiments show that, with high probability, G-PFSO successfully finds the highest mode of the objective function and converges to its global maximizer at the optimal rate. While the focus of this work is expected log-likelihood maximization, the proposed methodology and its theory apply more generally for optimizing a function defined through an expectation.
    Offline Reinforcement Learning with Realizability and Single-policy Concentrability. (arXiv:2202.04634v3 [cs.LG] UPDATED)
    Sample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e.g., Bellman-completeness) and the data coverage (e.g., all-policy concentrability). Despite the recent efforts on relaxing these assumptions, existing works are only able to relax one of the two factors, leaving the strong assumption on the other factor intact. As an important open problem, can we achieve sample-efficient offline RL with weak assumptions on both factors? In this paper we answer the question in the positive. We analyze a simple algorithm based on the primal-dual formulation of MDPs, where the dual variables (discounted occupancy) are modeled using a density-ratio function against offline data. With proper regularization, we show that the algorithm enjoys polynomial sample complexity, under only realizability and single-policy concentrability. We also provide alternative analyses based on different assumptions to shed light on the nature of primal-dual algorithms for offline RL.
    TACTiS: Transformer-Attentional Copulas for Time Series. (arXiv:2202.03528v2 [cs.LG] UPDATED)
    The estimation of time-varying quantities is a fundamental component of decision making in fields such as healthcare and finance. However, the practical utility of such estimates is limited by how accurately they quantify predictive uncertainty. In this work, we address the problem of estimating the joint predictive distribution of high-dimensional multivariate time series. We propose a versatile method, based on the transformer architecture, that estimates joint distributions using an attention-based decoder that provably learns to mimic the properties of non-parametric copulas. The resulting model has several desirable properties: it can scale to hundreds of time series, supports both forecasting and interpolation, can handle unaligned and non-uniformly sampled data, and can seamlessly adapt to missing data during training. We demonstrate these properties empirically and show that our model produces state-of-the-art predictions on multiple real-world datasets.
    Measure Estimation in the Barycentric Coding Model. (arXiv:2201.12195v2 [stat.ML] UPDATED)
    This paper considers the problem of measure estimation under the barycentric coding model (BCM), in which an unknown measure is assumed to belong to the set of Wasserstein-2 barycenters of a finite set of known measures. Estimating a measure under this model is equivalent to estimating the unknown barycentric coordinates. We provide novel geometrical, statistical, and computational insights for measure estimation under the BCM, consisting of three main results. Our first main result leverages the Riemannian geometry of Wasserstein-2 space to provide a procedure for recovering the barycentric coordinates as the solution to a quadratic optimization problem assuming access to the true reference measures. The essential geometric insight is that the parameters of this quadratic problem are determined by inner products between the optimal displacement maps from the given measure to the reference measures defining the BCM. Our second main result then establishes an algorithm for solving for the coordinates in the BCM when all the measures are observed empirically via i.i.d. samples. We prove precise rates of convergence for this algorithm -- determined by the smoothness of the underlying measures and their dimensionality -- thereby guaranteeing its statistical consistency. Finally, we demonstrate the utility of the BCM and associated estimation procedures in three application areas: (i) covariance estimation for Gaussian measures; (ii) image processing; and (iii) natural language processing.
    On the universality of the volatility formation process: when machine learning and rough volatility agree. (arXiv:2206.14114v1 [q-fin.ST])
    We train an LSTM network based on a pooled dataset made of hundreds of liquid stocks aiming to forecast the next daily realized volatility for all stocks. Showing the consistent outperformance of this universal LSTM relative to other asset-specific parametric models, we uncover nonparametric evidences of a universal volatility formation mechanism across assets relating past market realizations, including daily returns and volatilities, to current volatilities. A parsimonious parametric forecasting device combining the rough fractional stochastic volatility and quadratic rough Heston models with fixed parameters results in the same level of performance as the universal LSTM, which confirms the universality of the volatility formation process from a parametric perspective.
    Cost-Efficient Distributed Learning via Combinatorial Multi-Armed Bandits. (arXiv:2202.08302v2 [cs.IT] UPDATED)
    We consider the distributed SGD problem, where a main node distributes gradient calculations among $n$ workers. By assigning tasks to all the workers and waiting only for the $k$ fastest ones, the main node can trade-off the algorithm's error with its runtime by gradually increasing $k$ as the algorithm evolves. However, this strategy, referred to as adaptive $k$-sync, neglects the cost of unused computations and of communicating models to workers that reveal a straggling behavior. We propose a cost-efficient scheme that assigns tasks only to $k$ workers, and gradually increases $k$. We introduce the use of a combinatorial multi-armed bandit model to learn which workers are the fastest while assigning gradient calculations. Assuming workers with exponentially distributed response times parameterized by different means, we give empirical and theoretical guarantees on the regret of our strategy, i.e., the extra time spent to learn the mean response times of the workers. Furthermore, we propose and analyze a strategy applicable to a large class of response time distributions. Compared to adaptive $k$-sync, our scheme achieves significantly lower errors with the same computational efforts and less downlink communication while being inferior in terms of speed.
    Continuous Treatment Recommendation with Deep Survival Dose Response Function. (arXiv:2108.10453v4 [stat.ML] UPDATED)
    We propose a general formulation for continuous treatment recommendation problems in settings with clinical survival data, which we call the Deep Survival Dose Response Function (DeepSDRF). That is, we consider the problem of learning the conditional average dose response (CADR) function solely from historical data in which observed factors (confounders) affect both observed treatment and time-to-event outcomes. The estimated treatment effect from DeepSDRF enables us to develop recommender algorithms with the correction for selection bias. We compared two recommender approaches based on random search and reinforcement learning and found similar performance in terms of patient outcome. We tested the DeepSDRF and the corresponding recommender on extensive simulation studies and the eICU Research Institute (eRI) database. To the best of our knowledge, this is the first time that causal models are used to address the continuous treatment effect with observational data in a medical context.
    Graph-Based Machine Learning Improves Just-in-Time Defect Prediction. (arXiv:2110.05371v2 [cs.SE] UPDATED)
    The increasing complexity of today's software requires the contribution of thousands of developers. This complex collaboration structure makes developers more likely to introduce defect-prone changes that lead to software faults. Determining when these defect-prone changes are introduced has proven challenging, and using traditional machine learning (ML) methods to make these determinations seems to have reached a plateau. In this work, we build contribution graphs consisting of developers and source files to capture the nuanced complexity of changes required to build software. By leveraging these contribution graphs, our research shows the potential of using graph-based ML to improve Just-In-Time (JIT) defect prediction. We hypothesize that features extracted from the contribution graphs may be better predictors of defect-prone changes than intrinsic features derived from software characteristics. We corroborate our hypothesis using graph-based ML for classifying edges that represent defect-prone changes. This new framing of the JIT defect prediction problem leads to remarkably better results. We test our approach on 14 open-source projects and show that our best model can predict whether or not a code change will lead to a defect with an F1 score as high as 77.55$\%$. This represents an increase of as much as 46.72$\%$ over the state-of-the-art in JIT defect prediction. We describe limitations, open challenges, and how this method can be used for operational JIT defect prediction.
    An Expert System for Redesigning Software for Cloud Applications. (arXiv:2109.14569v3 [cs.LG] UPDATED)
    Cloud-based software has many advantages. When services are divided into many independent components, they are easier to update. Also, during peak demand, it is easier to scale cloud services (just hire more CPUs). Hence, many organizations are partitioning their monolithic enterprise applications into cloud-based microservices. Recently there has been much work using machine learning to simplify this partitioning task. Despite much research, no single partitioning method can be recommended as generally useful. More specifically, those prior solutions are "brittle"; i.e. if they work well for one kind of goal in one dataset, then they can be sub-optimal if applied to many datasets and multiple goals. In order to find a generally useful partitioning method, we propose DEEPLY. This new algorithm extends the CO-GCN deep learning partition generator with (a) a novel loss function and (b) some hyper-parameter optimization. As shown by our experiments, DEEPLY generally outperforms prior work (including CO-GCN, and others) across multiple datasets and goals. To the best of our knowledge, this is the first report in SE of such stable hyper-parameter optimization. To aid reuse of this work, DEEPLY is available on-line at https://bit.ly/2WhfFlB.
    Generalized Policy Improvement Algorithms with Theoretically Supported Sample Reuse. (arXiv:2206.13714v1 [cs.LG])
    Real-world sequential decision making requires data-driven algorithms that provide practical guarantees on performance throughout training while also making efficient use of data. Model-free deep reinforcement learning represents a framework for such data-driven decision making, but existing algorithms typically only focus on one of these goals while sacrificing performance with respect to the other. On-policy algorithms guarantee policy improvement throughout training but suffer from high sample complexity, while off-policy algorithms make efficient use of data through sample reuse but lack theoretical guarantees. In order to balance these competing goals, we develop a class of Generalized Policy Improvement algorithms that combines the policy improvement guarantees of on-policy methods with the efficiency of theoretically supported sample reuse. We demonstrate the benefits of this new class of algorithms through extensive experimental analysis on a variety of continuous control tasks from the DeepMind Control Suite.
    Benchopt: Reproducible, efficient and collaborative optimization benchmarks. (arXiv:2206.13424v2 [cs.LG] UPDATED)
    Numerical validation is at the core of machine learning research as it allows to assess the actual impact of new methods, and to confirm the agreement between theory and practice. Yet, the rapid development of the field poses several challenges: researchers are confronted with a profusion of methods to compare, limited transparency and consensus on best practices, as well as tedious re-implementation work. As a result, validation is often very partial, which can lead to wrong conclusions that slow down the progress of research. We propose Benchopt, a collaborative framework to automate, reproduce and publish optimization benchmarks in machine learning across programming languages and hardware architectures. Benchopt simplifies benchmarking for the community by providing an off-the-shelf tool for running, sharing and extending experiments. To demonstrate its broad usability, we showcase benchmarks on three standard learning tasks: $\ell_2$-regularized logistic regression, Lasso, and ResNet18 training for image classification. These benchmarks highlight key practical findings that give a more nuanced view of the state-of-the-art for these problems, showing that for practical evaluation, the devil is in the details. We hope that Benchopt will foster collaborative work in the community hence improving the reproducibility of research findings.
    Towards a Grounded Theory of Causation for Embodied AI. (arXiv:2206.13973v1 [cs.AI])
    There exist well-developed frameworks for causal modelling, but these require rather a lot of human domain expertise to define causal variables and perform interventions. In order to enable autonomous agents to learn abstract causal models through interactive experience, the existing theoretical foundations need to be extended and clarified. Existing frameworks give no guidance regarding variable choice / representation, and more importantly, give no indication as to which behaviour policies or physical transformations of state space shall count as interventions. The framework sketched in this paper describes actions as transformations of state space, for instance induced by an agent running a policy. This makes it possible to describe in a uniform way both transformations of the micro-state space and abstract models thereof, and say when the latter is veridical / grounded / natural. We then introduce (causal) variables, define a mechanism as an invariant predictor, and say when an action can be viewed as a ``surgical intervention'', thus bringing the objective of causal representation & intervention skill learning into clearer focus.
    Integral Transforms in a Physics-Informed (Quantum) Neural Network setting: Applications & Use-Cases. (arXiv:2206.14184v1 [quant-ph])
    In many computational problems in engineering and science, function or model differentiation is essential, but also integration is needed. An important class of computational problems include so-called integro-differential equations which include both integrals and derivatives of a function. In another example, stochastic differential equations can be written in terms of a partial differential equation of a probability density function of the stochastic variable. To learn characteristics of the stochastic variable based on the density function, specific integral transforms, namely moments, of the density function need to be calculated. Recently, the machine learning paradigm of Physics-Informed Neural Networks emerged with increasing popularity as a method to solve differential equations by leveraging automatic differentiation. In this work, we propose to augment the paradigm of Physics-Informed Neural Networks with automatic integration in order to compute complex integral transforms on trained solutions, and to solve integro-differential equations where integrals are computed on-the-fly during training. Furthermore, we showcase the techniques in various application settings, numerically simulating quantum computer-based neural networks as well as classical neural networks.
    Supervised Learning with General Risk Functionals. (arXiv:2206.13648v1 [stat.ML])
    Standard uniform convergence results bound the generalization gap of the expected loss over a hypothesis class. The emergence of risk-sensitive learning requires generalization guarantees for functionals of the loss distribution beyond the expectation. While prior works specialize in uniform convergence of particular functionals, our work provides uniform convergence for a general class of H\"older risk functionals for which the closeness in the Cumulative Distribution Function (CDF) entails closeness in risk. We establish the first uniform convergence results for estimating the CDF of the loss distribution, yielding guarantees that hold simultaneously both over all H\"older risk functionals and over all hypotheses. Thus licensed to perform empirical risk minimization, we develop practical gradient-based methods for minimizing distortion risks (widely studied subset of H\"older risks that subsumes the spectral risks, including the mean, conditional value at risk, cumulative prospect theory risks, and others) and provide convergence guarantees. In experiments, we demonstrate the efficacy of our learning procedure, both in settings where uniform convergence results hold and in high-dimensional settings with deep networks.
    Safe Exploration Incurs Nearly No Additional Sample Complexity for Reward-free RL. (arXiv:2206.14057v1 [cs.LG])
    While the primary goal of the exploration phase in reward-free reinforcement learning (RF-RL) is to reduce the uncertainty in the estimated model with minimum number of trajectories, in practice, the agent often needs to abide by certain safety constraint at the same time. It remains unclear how such safe exploration requirement would affect the corresponding sample complexity to achieve the desired optimality of the obtained policy in planning. In this work, we make a first attempt to answer this question. In particular, we consider the scenario where a safe baseline policy is known beforehand, and propose a unified Safe reWard-frEe ExploraTion (SWEET) framework. We then particularize the SWEET framework to the tabular and the low-rank MDP settings, and develop algorithms coined Tabular-SWEET and Low-rank-SWEET, respectively. Both algorithms leverage the concavity and continuity of the newly introduced truncated value functions, and are guaranteed to achieve zero constraint violation during exploration with high probability. Furthermore, both algorithms can provably find a near-optimal policy subject to any constraint in the planning phase. Remarkably, the sample complexities under both algorithms match or even outperform the state of the art in their constraint-free counterparts up to some constant factors, proving that safety constraint hardly increases the sample complexity for RF-RL.
    Studying Generalization Through Data Averaging. (arXiv:2206.13669v1 [stat.ML])
    The generalization of machine learning models has a complex dependence on the data, model and learning algorithm. We study train and test performance, as well as the generalization gap given by the mean of their difference over different data set samples to understand their ``typical" behavior. We derive an expression for the gap as a function of the covariance between the model parameter distribution and the train loss, and another expression for the average test performance, showing test generalization only depends on data-averaged parameter distribution and the data-averaged loss. We show that for a large class of model parameter distributions a modified generalization gap is always non-negative. By specializing further to parameter distributions produced by stochastic gradient descent (SGD), along with a few approximations and modeling considerations, we are able to predict some aspects about how the generalization gap and model train and test performance vary as a function of SGD noise. We evaluate these predictions empirically on the Cifar10 classification task based on a ResNet architecture.
    Differentially Private Algorithms for Statistical Verification of Cyber-Physical Systems. (arXiv:2004.00275v2 [cs.LG] UPDATED)
    Statistical model checking is a class of sequential algorithms that can verify specifications of interest on an ensemble of cyber-physical systems (e.g., whether 99% of cars from a batch meet a requirement on their energy efficiency). These algorithms infer the probability that given specifications are satisfied by the systems with provable statistical guarantees by drawing sufficient numbers of independent and identically distributed samples. During the process of statistical model checking, the values of the samples (e.g., a user's car energy efficiency) may be inferred by intruders, causing privacy concerns in consumer-level applications (e.g., automobiles and medical devices). This paper addresses the privacy of statistical model checking algorithms from the point of view of differential privacy. These algorithms are sequential, drawing samples until a condition on their values is met. We show that revealing the number of the samples drawn can violate privacy. We also show that the standard exponential mechanism that randomizes the output of an algorithm to achieve differential privacy fails to do so in the context of sequential algorithms. Instead, we relax the conservative requirement in differential privacy that the sensitivity of the output of the algorithm should be bounded to any perturbation for any data set. We propose a new notion of differential privacy which we call expected differential privacy. Then, we propose a novel expected sensitivity analysis for the sequential algorithm and proposed a corresponding exponential mechanism that randomizes the termination time to achieve the expected differential privacy. We apply the proposed mechanism to statistical model checking algorithms to preserve the privacy of the samples they draw. The utility of the proposed algorithm is demonstrated in a case study.
    Statistical inference with implicit SGD: proximal Robbins-Monro vs. Polyak-Ruppert. (arXiv:2206.12663v2 [stat.ML] UPDATED)
    The implicit stochastic gradient descent (ISGD), a proximal version of SGD, is gaining interest in the literature due to its stability over (explicit) SGD. In this paper, we conduct an in-depth analysis of the two modes of ISGD for smooth convex functions, namely proximal Robbins-Monro (proxRM) and proximal Poylak-Ruppert (proxPR) procedures, for their use in statistical inference on model parameters. Specifically, we derive non-asymptotic point estimation error bounds of both proxRM and proxPR iterates and their limiting distributions, and propose on-line estimators of their asymptotic covariance matrices that require only a single run of ISGD. The latter estimators are used to construct valid confidence intervals for the model parameters. Our analysis is free of the generalized linear model assumption that has limited the preceding analyses, and employs feasible procedures. Our on-line covariance matrix estimators appear to be the first of this kind in the ISGD literature.
    Understanding Benign Overfitting in Nested Meta Learning. (arXiv:2206.13482v1 [cs.LG] CROSS LISTED)
    Meta learning has demonstrated tremendous success in few-shot learning with limited supervised data. In those settings, the meta model is usually overparameterized. While the conventional statistical learning theory suggests that overparameterized models tend to overfit, empirical evidence reveals that overparameterized meta learning methods still work well -- a phenomenon often called ``benign overfitting.'' To understand this phenomenon, we focus on the meta learning settings with a challenging nested structure that we term the nested meta learning, and analyze its generalization performance under an overparameterized meta learning model. While our analysis uses the relatively tractable linear models, our theory contributes to understanding the delicate interplay among data heterogeneity, model adaptation and benign overfitting in nested meta learning tasks. We corroborate our theoretical claims through numerical simulations.
    Topology-aware Generalization of Decentralized SGD. (arXiv:2206.12680v2 [cs.LG] UPDATED)
    This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is $\mathcal{O}{(m/N+1/m+\lambda^2)}$-stable in expectation in the non-convex non-smooth setting, where $N$ is the total sample size of the whole system, $m$ is the worker number, and $1-\lambda$ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an $\mathcal{O}{(1/N+{({(m^{-1}\lambda^2)}^{\frac{\alpha}{2}}+ m^{-\alpha})}/{N^{1-\frac{\alpha}{2}}})}$ in-average generalization bound, which is non-vacuous even when $\lambda$ is closed to $1$, in contrast to vacuous as suggested by existing literature on the projected version of D-SGD. Our theory indicates that the generalizability of D-SGD has a positive correlation with the spectral gap, and can explain why consensus control in initial training phase can ensure better generalization. Experiments of VGG-11 and ResNet-18 on CIFAR-10, CIFAR-100 and Tiny-ImageNet justify our theory. To our best knowledge, this is the first work on the topology-aware generalization of vanilla D-SGD. Code is available at https://github.com/Raiden-Zhu/Generalization-of-DSGD.  ( 2 min )
    Online Bootstrap Inference For Policy Evaluation in Reinforcement Learning. (arXiv:2108.03706v3 [stat.ML] UPDATED)
    The recent emergence of reinforcement learning has created a demand for robust statistical inference methods for the parameter estimates computed using these algorithms. Existing methods for statistical inference in online learning are restricted to settings involving independently sampled observations, while existing statistical inference methods in reinforcement learning (RL) are limited to the batch setting. The online bootstrap is a flexible and efficient approach for statistical inference in linear stochastic approximation algorithms, but its efficacy in settings involving Markov noise, such as RL, has yet to be explored. In this paper, we study the use of the online bootstrap method for statistical inference in RL. In particular, we focus on the temporal difference (TD) learning and Gradient TD (GTD) learning algorithms, which are themselves special instances of linear stochastic approximation under Markov noise. The method is shown to be distributionally consistent for statistical inference in policy evaluation, and numerical experiments are included to demonstrate the effectiveness of this algorithm at statistical inference tasks across a range of real RL environments.  ( 3 min )
    Stochastic linear optimization never overfits with quadratically-bounded losses on general data. (arXiv:2202.06915v2 [cs.LG] UPDATED)
    This work provides test error bounds for iterative fixed point methods on linear predictors -- specifically, stochastic and batch mirror descent (MD), and stochastic temporal difference learning (TD) -- with two core contributions: (a) a single proof technique which gives high probability guarantees despite the absence of projections, regularization, or any equivalents, even when optima have large or infinite norm, for quadratically-bounded losses (e.g., providing unified treatment of squared and logistic losses); (b) locally-adapted rates which depend not on global problem structure (such as condition numbers and maximum margins), but rather on properties of low norm predictors which may suffer some small excess test error. The proof technique is an elementary and versatile coupling argument, and is demonstrated here in the following settings: stochastic MD under realizability; stochastic MD for general Markov data; batch MD for general IID data; stochastic MD on heavy-tailed data (still without projections); stochastic TD on Markov chains (all prior stochastic TD bounds are in expectation).  ( 2 min )
    Rankings from multimodal pairwise comparisons. (arXiv:2206.13580v1 [stat.ML])
    The task of ranking individuals or teams, based on a set of comparisons between pairs, arises in various contexts, including sporting competitions and the analysis of dominance hierarchies among animals and humans. Given data on which competitors beat which others, the challenge is to rank the competitors from best to worst. Here we study the problem of computing rankings when there are multiple, potentially conflicting modes of comparison, such as multiple types of dominance behaviors among animals. We assume that we do not know a priori what information each behavior conveys about the ranking, or even whether they convey any information at all. Nonetheless we show that it is possible to compute a ranking in this situation and present a fast method for doing so, based on a combination of an expectation-maximization algorithm and a modified Bradley-Terry model. We give a selection of example applications to both animal and human competition.  ( 2 min )
    Constrained Learning with Non-Convex Losses. (arXiv:2103.05134v4 [cs.LG] UPDATED)
    Though learning has become a core component of modern information processing, there is now ample evidence that it can lead to biased, unsafe, and prejudiced systems. The need to impose requirements on learning is therefore paramount, especially as it reaches critical applications in social, industrial, and medical domains. However, the non-convexity of most modern statistical problems is only exacerbated by the introduction of constraints. Whereas good unconstrained solutions can often be learned using empirical risk minimization, even obtaining a model that satisfies statistical constraints can be challenging. All the more so, a good one. In this paper, we overcome this issue by learning in the empirical dual domain, where constrained statistical learning problems become unconstrained and deterministic. We analyze the generalization properties of this approach by bounding the empirical duality gap -- i.e., the difference between our approximate, tractable solution and the solution of the original (non-convex) statistical problem -- and provide a practical constrained learning algorithm. These results establish a constrained counterpart to classical learning theory, enabling the explicit use of constraints in learning. We illustrate this theory and algorithm in rate-constrained learning applications arising in fairness and adversarial robustness.  ( 3 min )
    Feature Learning for Dimensionality Reduction toward Maximal Extraction of Hidden Patterns. (arXiv:2206.13891v1 [cs.LG])
    Dimensionality reduction (DR) plays a vital role in the visual analysis of high-dimensional data. One main aim of DR is to reveal hidden patterns that lie on intrinsic low-dimensional manifolds. However, DR often overlooks important patterns when the manifolds are strongly distorted or hidden by certain influential data attributes. This paper presents a feature learning framework, FEALM, designed to generate an optimized set of data projections for nonlinear DR in order to capture important patterns in the hidden manifolds. These projections produce maximally different nearest-neighbor graphs so that resultant DR outcomes are significantly different. To achieve such a capability, we design an optimization algorithm as well as introduce a new graph dissimilarity measure, called neighbor-shape dissimilarity. Additionally, we develop interactive visualizations to assist comparison of obtained DR results and interpretation of each DR result. We demonstrate FEALM's effectiveness through experiments using synthetic datasets and multiple case studies on real-world datasets.  ( 2 min )
    Memory Safe Computations with XLA Compiler. (arXiv:2206.14148v1 [cs.LG])
    Software packages like TensorFlow and PyTorch are designed to support linear algebra operations, and their speed and usability determine their success. However, by prioritising speed, they often neglect memory requirements. As a consequence, the implementations of memory-intensive algorithms that are convenient in terms of software design can often not be run for large problems due to memory overflows. Memory-efficient solutions require complex programming approaches with significant logic outside the computational framework. This impairs the adoption and use of such algorithms. To address this, we developed an XLA compiler extension that adjusts the computational data-flow representation of an algorithm according to a user-specified memory limit. We show that k-nearest neighbour and sparse Gaussian process regression methods can be run at a much larger scale on a single device, where standard implementations would have failed. Our approach leads to better use of hardware resources. We believe that further focus on removing memory constraints at a compiler level will widen the range of machine learning methods that can be developed in the future.  ( 2 min )
    Nonparametric, Nonasymptotic Confidence Bands with Paley-Wiener Kernels for Band-Limited Functions. (arXiv:2206.13629v1 [stat.ML])
    The paper introduces a method to construct confidence bands for bounded, band-limited functions based on a finite sample of input-output pairs. The approach is distribution-free w.r.t. the observation noises and only the knowledge of the input distribution is assumed. It is nonparametric, that is, it does not require a parametric model of the regression function and the regions have non-asymptotic guarantees. The algorithm is based on the theory of Paley-Wiener reproducing kernel Hilbert spaces. The paper first studies the fully observable variant, when there are no noises on the observations and only the inputs are random; then it generalizes the ideas to the noisy case using gradient-perturbation methods. Finally, numerical experiments demonstrating both cases are presented.  ( 2 min )
    Electronic-structure properties from atom-centered predictions of the electron density. (arXiv:2206.14087v1 [physics.chem-ph])
    The electron density of a molecule or material has recently received major attention as a target quantity of machine-learning models. A natural choice to construct a model that yields transferable and linear-scaling predictions is to represent the scalar field using a multi-centered atomic basis analogous to that routinely used in density fitting approximations. However, the non-orthogonality of the basis poses challenges for the learning exercise, as it requires accounting for all the atomic density components at once. We devise a gradient-based approach to directly minimize the loss function of the regression problem in an optimized and highly sparse feature space. In so doing, we overcome the limitations associated with adopting an atom-centered model to learn the electron density over arbitrarily complex datasets, obtaining extremely accurate predictions. The enhanced framework is tested on 32-molecule periodic cells of liquid water, presenting enough complexity to require an optimal balance between accuracy and computational efficiency. We show that starting from the predicted density a single Kohn-Sham diagonalization step can be performed to access total energy components that carry an error of just 0.1 meV/atom with respect to the reference density functional calculations. Finally, we test our method on the highly heterogeneous QM9 benchmark dataset, showing that a small fraction of the training data is enough to derive ground-state total energies within chemical accuracy.  ( 3 min )
    AutoInit: Automatic Initialization via Jacobian Tuning. (arXiv:2206.13568v1 [stat.ML])
    Good initialization is essential for training Deep Neural Networks (DNNs). Oftentimes such initialization is found through a trial and error approach, which has to be applied anew every time an architecture is substantially modified, or inherited from smaller size networks leading to sub-optimal initialization. In this work we introduce a new and cheap algorithm, that allows one to find a good initialization automatically, for general feed-forward DNNs. The algorithm utilizes the Jacobian between adjacent network blocks to tune the network hyperparameters to criticality. We solve the dynamics of the algorithm for fully connected networks with ReLU and derive conditions for its convergence. We then extend the discussion to more general architectures with BatchNorm and residual connections. Finally, we apply our method to ResMLP and VGG architectures, where the automatic one-shot initialization found by our method shows good performance on vision tasks.  ( 2 min )
    Dynamic Memory for Interpretable Sequential Optimisation. (arXiv:2206.13960v1 [cs.LG])
    Real-world applications of reinforcement learning for recommendation and experimentation faces a practical challenge: the relative reward of different bandit arms can evolve over the lifetime of the learning agent. To deal with these non-stationary cases, the agent must forget some historical knowledge, as it may no longer be relevant to minimise regret. We present a solution to handling non-stationarity that is suitable for deployment at scale, to provide business operators with automated adaptive optimisation. Our solution aims to provide interpretable learning that can be trusted by humans, whilst responding to non-stationarity to minimise regret. To this end, we develop an adaptive Bayesian learning agent that employs a novel form of dynamic memory. It enables interpretability through statistical hypothesis testing, by targeting a set point of statistical power when comparing rewards and adjusting its memory dynamically to achieve this power. By design, the agent is agnostic to different kinds of non-stationarity. Using numerical simulations, we compare its performance against an existing proposal and show that, under multiple non-stationary scenarios, our agent correctly adapts to real changes in the true rewards. In all bandit solutions, there is an explicit trade-off between learning and achieving maximal performance. Our solution sits on a different point on this trade-off when compared to another similarly robust approach: we prioritise interpretability, which relies on more learning, at the cost of some regret. We describe the architecture of a large-scale deployment of automatic optimisation-as-a-service where our agent achieves interpretability whilst adapting to changing circumstances.  ( 3 min )

  • Open

    Yandex Open-Sources YaLM Model With 100 Billion Parameters
    Transformers are used for translation and text summarising tasks because they can analyze sequential input data, such as natural language. Transformers use the self-attention process and weights the importance of each component of the input data differently. Large-scale transformer-based language models have gained a lot of popularity recently in the disciplines of computer vision and natural language processing (NLP). They expand in size and complexity frequently, yet it costs millions of dollars, hires the greatest experts, and takes years to construct these models. Because of this, many companies have been unable to use it, and only significant IT organizations have access to this cutting-edge technology. To address these problems, Yandex has developed the largest YaLM model to date, which uses 100 billion parameters. This largest GPT-like neural network for English is currently available for free. The researchers used a pool of 800 A100 graphics cards, 1.7 TB of online materials, books, and countless other sources to train the model over the course of 65 days. They have published the model and relevant materials on GitHub under the Apache 2.0 license, allowing both academic and commercial use. Continue reading | Github submitted by /u/shobha-kakkar [link] [comments]  ( 84 min )
    AI Dream 58 - Unbelievable Explosive Midjourney
    submitted by /u/LordPewPew777 [link] [comments]  ( 82 min )
    How can I get free access to a computer server to use Ai and Photoshop there, my pc is very old, is there a service which provides a free trial?
    submitted by /u/TheblackRook3 [link] [comments]  ( 82 min )
    Google's latest image AI Parti beats Imagen, which is only four weeks old (and DALL-E 2 as well)
    submitted by /u/henlo_there_fren [link] [comments]  ( 83 min )
    First photo I've published from NightCafe
    submitted by /u/PineappleTreePro [link] [comments]  ( 82 min )
    Annotated KDD 2022 paper - Learning Backward Compatible Embeddings
    I read a super interesting KDD 2022 paper recently - "Learning Backward Compatible Embeddings". The paper tackles a common industry problem of ensuring compatibility of newer embeddings with an older downstream model. An annotated version of the paper - Annotated-ML-Papers/Learning Backward Compatible Embeddings.pdf submitted by /u/shreyansh26 [link] [comments]  ( 82 min )
    A New Technique to Train Diffusion Model in Latent Space Using Limited Computational Resources While Maintaining High-Resolution Quality
    In recent years, image synthesis has experienced exponential growth in performance. The two main approaches to this task have been autoregressive transformers (ARs) and generative adversarial networks (GANs). The firsts are trained for sequence prediction and are able to generate images, token by token, starting from the first one. The seconds are based on the famous generator-discriminator method, where the generator tries to fool the discriminator into generating reliable samples. Nevertheless, both approaches have huge limitations: in particular, ARs require billions of parameters to be trained, while GANs rely on the minimax loss which has been demonstrated to often bring to mode collapse and instability in training. Diffusion models (DMs) have recently shown excellent results in different image synthesis tasks. They are based on two stages: in the first, noise is added to data step by step in a Markov chain modality, meaning that each step depends solely on the previous step. This process is repeated until losing the majority of the information of the original sample. Then, a denoising process is applied, aiming to reconstruct the image from the noisy version. Continue reading | Checkout the paper and github ​ https://i.redd.it/4w5te1twbe891.gif submitted by /u/shobha-kakkar [link] [comments]  ( 84 min )
    A World Undone" collection so far | NFT's for environmental protection
    submitted by /u/VictorTuring [link] [comments]  ( 83 min )
    "A World Undone" collection so far | NFT's for environmental protection
    submitted by /u/VictorTuring [link] [comments]  ( 82 min )
    Getting started in AI that analyses data for someone who already knows programming...?
    Hey there! I've been making games and programming to do that for around 8 years now. I have a pretty good know-how of most major programming languages, technologies, techniques in that realm by now but there's one thing that I've always struggled with: AI. Whenever I try to research it, I always seem to end up going down "buzz-word loopholes" as I like to call them, similar to a certain extent, to how if you try to research VR game dev now you might end up looking into the "metaverse" and stuff... I find lots of articles / youtube videos that explain either one very specific thing or they go too far abstract and explain how it works but not how to actually do it. What I'm really interested in is designing algorithms like Youtube's, Tiktok's or Google's, the ones that analyse large datasets and alter the platform depending on the results. I know quite a bit of this is machine learning now, but I actually want to gain a good understanding of how to write these algorithms and how to actually implement machine learning to make them, since I can think of many use cases where AI like this could be used in game development and other areas I'm interested in - Not to mention, this just sounds like a fun thing to learn! I'm happy to work with whatever languages and to learn new tools (of course!), but what I am really interested in is learning to create AI that analyses data specifically. All I've found so far when I wasn't just hitting those "buzz-word loopholes" was simple AI that can do things like solving sudoku or more complex ones like analysing images - But the issue is to a certain degree I already understand those sub-topics, and it isn't really the kind of AI I'm interested in learning about. ​ TLDR; So yeah, if anyone has any recommendations of recourses specifically targeted at AI that analyses data, or if I'm completely wrong and need to learn something else first, please chuck us a comment, it'd be much appreciated! submitted by /u/Ping-and-Pong [link] [comments]  ( 86 min )
    Hi, is there an AI of some sort which i can feed to a bunch of random images and let it create a sort of blend of them?
    submitted by /u/disnotmeiswear [link] [comments]  ( 83 min )
    AI Art Charity Project
    I am working on a midjourney based project to raise money for the AI for Good Foundation (ai4good.org). If anyone is interested in hearing more, please feel free to send me a dm. We are looking for volunteers in a few different areas: 1) (Extremely) part time experts in ML/AI/GANs to awnser people's questions 2) Artists to make AI/human collaboration art and post it in our "cyborg gallery" 3) People to make AI art using Midjourney (we have invites) and post them to Reddit etc. with a link to our Discord server. 4) (most important) people to handle prompt requests and generate/send people their results This project was given the green light by a team member at MJ, so they are fine with our charity project. Thanks for reaching, and once again, please reach out if you are interested! submitted by /u/Accomplished_Head5 [link] [comments]  ( 83 min )
    Human biases in Artificial Intelligence
    submitted by /u/HumanSeeing [link] [comments]  ( 83 min )
    What is the computation cost of a DALL-E image generation?
    submitted by /u/theo_champion [link] [comments]  ( 82 min )
    AI GENERATED ART (but it is horrifying)
    submitted by /u/CALP_is_holy [link] [comments]  ( 82 min )
    Google's powerful AI spotlights a human cognitive glitch: Mistaking fluent speech for fluent thought
    submitted by /u/bartturner [link] [comments]  ( 82 min )
    I Made an AI That Punishes Me if it Detects That I am Procrastinating on My Assignments
    submitted by /u/_ayushp_ [link] [comments]  ( 86 min )
    Elect Lamda
    submitted by /u/IwishIwasinOhio [link] [comments]  ( 83 min )
    BOSSCHAERT BOUQET | 4K 24 FPS (FILM EDIT) | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 83 min )
  • Open

    [D] Showing the important design decisions in MLOPS you made in your past jobs
    I've heard a lot that it's not just the tools that matter the most in the field of machine learning and MLOps, but mostly the design decisions one has to make in order to make stable models be pushed to production. And the design pattern varies from one organisation to another. While applying for a new job, how does an MLOps engineer showcase the most important design decisions they made in their past career? submitted by /u/metalvendetta [link] [comments]  ( 84 min )
    Understanding the difference between Time Series Analysis and "normal" Prediction with Regression in Forecasting? [D]
    I am currently working on a dataset for earnings of a specific market segment. For that I have created a dataset with the earnings as my dependent variable and multiple market parameters as my independent variables. For that I have accumulated data on weather, avg speed and so on for each month in 30 years, in total 80 different variables. Now I wanted to create a forecast with different time shifts (1 month, 2, 3, 4...) The goal is to have a forecast for market movement with data some x months prior. "How could earnings look next month with the knowledge of today". I used different regression methods, compared them and now have a model that can predict these values with an accuracy X. The results themselves arent bad, but not as good as I imagined. However, I think thats normal regardin…  ( 87 min )
    [D][P] YOLOv6: state-of-the-art object detection at 1242 FPS
    YOLOv6 has been making a lot of noise in the past 24 hours. Based on its performance - rightfully so. YOLOv6 is a single-stage object detection framework dedicated to industrial applications, with hardware-friendly efficient design and high performance. It outperforms YOLOv5 in accuracy and inference speed, making it the best OS version of YOLO architecture for production applications. I dived into the technical details published by the research group and made a qualitative and qualitative comparison between the results of YOLOv5 and YOLOv6. I invite you to read about all of these, with a bit of history on YOLO, in the my new blog submitted by /u/RepresentativeCod613 [link] [comments]  ( 84 min )
    Creating and Analyzing a Dataset of Roe v. Wade Tweets Labeled by Abortion Stance [P]
    How do pro-choice vs. pro-life twitter users differ? I built a free, labeled dataset of #RoeVsWade tweets, and an ML classifier on top. Some insights: Pro-life users are 20.4x more likely to put "christ" and 16.1x more likely to put "maga" in their bio.Pro-choice users are 7.5x more likely to put "blm" and 6.5x more likely to put "she/her". Full analysis + link to raw dataset here. submitted by /u/BB4evaTB12 [link] [comments]  ( 84 min )
    [N] PyTorch 1.12: TorchArrow, Functional API for Modules and NvFuser
    PyTorch 1.12 Release Notes Highlights Backwards Incompatible Change New Features Improvements Performance Documentation Highlights We are excited to announce the release of PyTorch 1.12! This release is composed of over 3124 commits, 433 contributors. Along with 1.12, we are releasing beta versions of AWS S3 Integration, PyTorch Vision Models on Channels Last on CPU, Empowering PyTorch on Intel® Xeon® Scalable processors with Bfloat16 and FSDP API. We want to sincerely thank our dedicated community for your contributions. Summary: Functional Module API to functionally apply module computation with a given set of parameters Complex32 and Complex Convolutions in PyTorch DataPipes from TorchData fully backward compatible with DataLoader Functorch with improved coverage for APIs nvFuser a deep learning compiler for PyTorch Changes to float32 matrix multiplication precision on Ampere and later CUDA hardware TorchArrow, a new beta library for machine learning preprocessing over batch data https://github.com/pytorch/pytorch/releases/tag/v1.12.0 https://pytorch.org/blog/pytorch-1.12-released/ submitted by /u/DreamFlasher [link] [comments]  ( 85 min )
    [D] [P] Questions about the usability of Shapley values on large feature spaces.
    Hello! I am planning a research project which involves creating a classification DNN that takes in a frame from a molecular dynamics simulation of a protein which encodes each amino acid's level of energetic interaction and tries to predict whether that frame came from protein state "A" or protein state "B." I want to analyze the feature importance, that is, the importance of amino acid's energetic interaction level for making the classification prediction. Although I have heard some interesting applications with Shapley values to preform such an analysis on feature importance, the input layer structure of the model I am thinking of making would require 100+ neurons as there are 100+ features. The reason why the feature space is so large is because I am investigating how a model learns which amino acids are most important for the model to make a classification prediction for which state a protein is in where the protein is 100+ amino acids in length. Can Shapley methods handle a feature space of a model that large /would the computational cost of such a process be infeasible? Apologies if this question is a little unclear let me know if anything needs to be clarified. Thanks! submitted by /u/ben_cow [link] [comments]  ( 85 min )
    [R] Annotated KDD 2022 paper - Learning Backward Compatible Embeddings
    I read a super interesting KDD 2022 paper recently - "Learning Backward Compatible Embeddings". The paper tackles a common industry problem of ensuring compatibility of newer embeddings with an older downstream model. An annotated version of the paper - Annotated-ML-Papers/Learning Backward Compatible Embeddings.pdf submitted by /u/shreyansh26 [link] [comments]  ( 84 min )
    [D] Run apps and dev environments in the cloud with a single command
    Hi everyone, I'm the creator of dstack, a tool that makes it easier to train models in the cloud. Our tool allows extending it with custom providers to support different languages, frameworks, etc. All the built-in providers are also open-source. Today, we've released a new update that extends the capabilities of dstack beyond training models, and now also allows users to quickly build and share apps with Streamlit, Gradio, and FastAPI in the cloud – in just a few clicks. Similar to apps, it's possible to run dev environments with required hardware and data access also in one command from the Terminal. All you have to do is to link your own AWS account to run commands. Invite everyone to read it, and share their thoughts. Happy to discuss the approach and what would be great to have! Blog post: https://blog.dstack.ai/introducing-apps-and-dev-environments P.S.: Currently, it's possible to run models and apps only in the configured cloud. If you'd like the tool to also allow you to run it locally, and if you would like this part to be open-source too, please leave comments! 🤗 submitted by /u/cheptsov [link] [comments]  ( 85 min )
    [R] Softmax Linear Units
    submitted by /u/the_great_magician [link] [comments]  ( 83 min )
    [R] Probabilistic Numerics: Computation as Machine Learning (Free Book!)
    Abs: Probabilistic numerical computation formalises the connection between machine learning and applied mathematics. Numerical algorithms approximate intractable quantities from computable ones. They estimate integrals from evaluations of the integrand, or the path of a dynamical system described by differential equations from evaluations of the vector field. In other words, they infer a latent quantity from data. This book shows that it is thus formally possible to think of computational routines as learning machines, and to use the notion of Bayesian inference to build more flexible, efficient, or customised algorithms for computation. The text caters for Masters' and PhD students, as well as postgraduate researchers in artificial intelligence, computer science, statistics, and applied mathematics. Extensive background material is provided along with a wealth of figures, worked examples, and exercises (with solutions) to develop intuition. Link to book: https://www.probabilistic-numerics.org/textbooks/ submitted by /u/bikeskata [link] [comments]  ( 85 min )
    [D] Have compression techniques every been applied to the likes of GPT-3 & DALLE-2?
    Large language models and the recent spur of diffusion based text-to-image models are gosh-darn fun to play with, but due to their size and expensive training costs - they're only accessible via an API or if you yourself have a access to a large # of GPUs. Yet there are also a number of compression techniques like pruning and quantization that can drastically reduce the size (+90%), and thus computational requirements, of a trained model. Has there been any work looking appling such techniques to these gigantic models floating around to make them more accessible? submitted by /u/Farconion [link] [comments]  ( 86 min )
    [P] Clustering long documents with Transformers in 10 minutes
    Transformers are awesome for so many things in 2022, but one thing I've found them to struggle with is generating embeddings for long documents. I put together a blog post going through some interesting techniques. Let me know if it helped you! Blog post submitted by /u/BlockDesigns [link] [comments]  ( 84 min )
    [N] Quaterion, a blazingly fast framework for similarity learning.
    Just released. Quaterion — an open source framework for training and fine-tuning similarity learning models. It enables you to train models significantly (100x) faster, and iterate over experiments in minutes instead of hours even with a laptop GPU. It takes advantage of the PyTorch Lightning backend to make a flexible and scalable learning pipeline. GitHub https://github.com/qdrant/quaterion Here is a demo of the caching functionality. https://i.redd.it/9qi8gf9n4d891.gif submitted by /u/devzaya [link] [comments]  ( 84 min )
    [p] RestifyML - AI/ML Tool for Developers to quickly experiment with data and generate AI/ML REST API to consume back into their application
    Developers can use RestifyML to Create DataScience experiments Create Data Source and upload CSV data within the experiment Do Data Cleansing and Sanitization Visualize raw data using Data Exploration Select Features which would help in building models Build Model, save or export them Finally, deploy Model and expose them as REST API Consume Machine Learning REST API from any Application Profit! https://github.com/rebataur/RestifyML Feedback/ Feature Request appreciated. submitted by /u/rebataur [link] [comments]  ( 85 min )
    [D]Can a transformer neural network learn to predict sequences longer than it saw?
    Simple task: transformer has to repeat a sequence of random integers (0-9) of varied length, like: sequence length=7: input[ 1, 3 ,5 ,6, 2, 4, 0] - output[ 1, 3 ,5 ,6, 2, 4, 0] sequence length=3: input[ 5, 4 ,9 ] - output[ 5, 4 ,9 ] sequence length=4: input[ 6, 3 ,9, 8 ] - output[ 6, 3 ,9, 8 ] ... Each integer(0-9) can be stored in embedding layer so we can pass it to transformer. I trained transformer (generic pytorch model with positional embeddings) on a dataset (1000 examples) of sequences of varied length (1 to 12) and it predicts sequences well within the range of 12 . It fails to predict sequences longer than 12 - 13. sequence length=20: input[3, 3, 4, 0, 0, 7, 1, 5, 1, 0, 7, 1, 9, 0, 9, 1, 5, 2, 3, 6] .............................. ...- output[3, 3, 4, 0, 0, 7, 1, 5, 1, 0, 7, 1, 7, 1, 7, 1, 0, 7, 0, 7] Is it considered an extrapolation task? Are there types of transformers (or other neural networks) that can handle the problem ? Same issue with recurrent neural networks (RNN, LSTM, GRU). submitted by /u/InternationalVisito [link] [comments]  ( 89 min )
    [N] PyTorch 1.12 released
    Pytorch 1.12 is available through the pytorch conda channel and pypi Release notes Issue tracker Highlights We are excited to announce the release of PyTorch 1.12! This release is composed of over 3124 commits, 433 contributors. Along with 1.12, we are releasing beta versions of AWS S3 Integration, PyTorch Vision Models on Channels Last on CPU, Empowering PyTorch on Intel® Xeon® Scalable processors with Bfloat16 and FSDP API. We want to sincerely thank our dedicated community for your contributions. Summary: Functional Module API to functionally apply module computation with a given set of parameters Complex32 and Complex Convolutions in PyTorch DataPipes from TorchData fully backward compatible with DataLoader Functorch with improved coverage for APIs nvFuser a deep learning compiler for PyTorch Changes to float32 matrix multiplication precision on Ampere and later CUDA hardware TorchArrow, a new beta library for machine learning preprocessing over batch data Other noteable changes CUDA 11.6 wheels torch.amp module submitted by /u/M4mb0 [link] [comments]  ( 85 min )
    [P] DALL-E Mini stripped to its bare essentials and converted to PyTorch
    submitted by /u/pcaversaccio [link] [comments]  ( 86 min )
    [R] Welcome to my continuous, free live machine learning class with intermediate mathematics
    Dear all, Welcome to join my continued ML knowledge dissemination class via Zoom. I will continue to explain machine learning using an intermediate level mathematics. It happens every second Thursday at GMT at 11:00 (HK7pm / SYD9pm) - the next class is on June 30. The current topic is: "Determinantal Point Process" I'll fully explain its beautiful mathematics over a period of a few sessions. This is a powerful model to model diverse subsets. Yet it is not as commonly used as it should! You can find my notes on my GitHub site: https://github.com/roboticcam/machine-learning-notes/ Determinantal Point Process notes is found at: https://github.com/roboticcam/machine-learning-notes/blob/master/files/dpp_new.pdf You need a solid understanding of linear algebra, calculus, probability and statistics. But if you just want to get a feel of how DPP works for example, and meet like-minded people, please come too! To join, sign up for one of the meetup groups you see fit: https://www.meetup.com/machine-learning-hong-kong/ https://www.meetup.com/deep-learning-sydney/ https://www.meetup.com/Deep-Learning-Melbourne/ https://www.meetup.com/machine-learning-athens/ submitted by /u/MLknowledge [link] [comments]  ( 85 min )
    [D] Surface rendering in Diffusion Probability Text-to-Image Generators.
    Two diffusion text-to-image generators are Google's Imagen and openai's DALLE.2. DALLE.2 uses a multimodal large language model called CLIP to encode an input text prompt. The output is produced by a reverse encoder called a diffusion probability model. Diffusion models have previously seen huge successes in image super resolution and denoising. One peculiar aspect of DALLE.2's output is that it is capable of generating light sources in certain (seemingly) 3D locations in the scene, then correctly lighting the objects based off of their implied location. DALLE.2 can also perform image completions from a starting image prompt. The two examples below are Spongebob dish sponge in a sink, and Vermeer's famous earring painting. https://i.imgur.com/vVI6IOI.png . https://i.imgur.com/8h48lTg.png . One plausible explanation for these physically perfect surface reflections is that DALLE.2 performs a phase where the image is reverse-encoded into a 3D scene. That scene is then rendered back into a 2D output image. However, when consulting the primary literature, no such conversion to a 3D model is seen anywhere along the DALLE.2 workflow. The implication is that DALLE.2 must contain a wealth of priors related to light transport, gleaned simply from 2D training images alone. This means these priors are being applied (mostly correctly) to particular instantiations of objects and surfaces in scenes. This application is performed even to the point where wet metallic surfaces have correct blurring in reflections. Further investigations of this phenomenon would involve finding some user prompts that generated a scene containing light casting a sharp shadow onto a flat surface. Another would be requesting a reflective object in the text prompt itself. Your thoughts? submitted by /u/moschles [link] [comments]  ( 87 min )
    [P] First-class Dims - a generalization of einops and named tensors
    Jupyter Notebook: https://colab.research.google.com/drive/1BsVkddtVMX35aZAvo2GyI-wSFPVBCWuA Github: https://github.com/facebookresearch/torchdim Some tweet threads about it Mine: https://twitter.com/cHHillee/status/1541536627746426881 Sasha Rush: https://twitter.com/srush_nlp/status/1541526906113298433 submitted by /u/programmerChilli [link] [comments]  ( 84 min )
    [D] How to evaluate the gain of a new feature without training?
    When evaluating the effectiveness of a new feature, it is common to train a model with/without this feature to compare the difference. But sometimes training a model based on huge amounts of data is both time and energy consuming. I was wondering if there are some lightweight ways to estimate the importance of the new feature without training? Computing descriptive statistics such as feature coverage, histogram and correlation matrix might be necessary, are there other pre-processing methods? submitted by /u/fishiwhj [link] [comments]  ( 84 min )
    [D] Laplacian positional encodings
    I just finished reading "Benchmarking Graph Neural Networks" (Dwivedi et al. 2020) and "A Generalization of Transformer Networks to Graphs" (also Dwivedi et al. 2020), and came across the claim that the eigenvectors of the Laplacian of a graph "represent a natural generalization of the Transformer (Vaswani et al., 2017) positional encodings (PE)". Xavier Bresson tweeted the same thing. So I worked out the eigenvectors of the Laplacian of a path graph (a line of vertices connected by edges like so: v-v-v-...-v), which is the kind of graph used in NLP to represent a sequence of tokens, and found that the ith eigenvector's kth entry is v_i(k) = cos(πik/n − πi/2n) where n is the number of tokens in the sequence, which is very different from the sinusoidal PEs used in transformers in NLP. I tried working out a change of variables, but nothing's worked so far. Are Laplacian eigenvectors just not the generalizations they're claimed to be, or am I missing something here? submitted by /u/hegelian_waffle [link] [comments]  ( 85 min )
  • Open

    Exploring emerging topics in artificial intelligence policy
    The second AI Policy Forum Symposium convened global stakeholders across sectors to discuss critical policy questions in artificial intelligence.  ( 7 min )
  • Open

    Gaussian Processes for Cartpole Environments
    Good day all, I have previously seen some Fitted Q iteration tutorials in a cart pole environment in which neural networks were used in updating Q values (e. https://github.com/seungjaeryanlee/implementations-nfq/tree/master/nfq). I am interested in doing something similar only, that I have to replace those neural network estimators with Gaussian processes. Please can anyone recommend some useful tutorials (Free/Paid) for using Gaussian processes for Cartpole setup? I have some but they are a little bit too theoretical with little/no practical/ programming. I will also appreciate links to some libraries or repos that provide more insights on the subject matter. submitted by /u/Thin-Ad9581 [link] [comments]  ( 83 min )
    "DALL·E 2 Pre-Training Mitigations", Nichol 2022 (how OA censored it: heavy filtering by training a classifier w/active-learning; reweighting; dupe deletion)
    submitted by /u/gwern [link] [comments]  ( 83 min )
    Animo Island makes machine learning fun and easy to learn so that anyone can harness the power of reinforcement learning! 🤖 🏝️
    submitted by /u/AnimoIsland [link] [comments]  ( 83 min )
    Suicidal Agents (blog post)
    Hey guys, I wrote my first blog post on RL about changing the reward function by a constant and how this can result in a different policy. At first thought this feels strange since the constant should not affect the expected sum of returns! Please let me know what you think. https://ea-aguilar.gitbook.io/rl-vault/food-for-thought/suicidal-agents Also, I'm not such a big fan of medium bc I want to keep the option to write more equations, but it seems it's the de-facto place to blog about ML/RL. Do you recommend also posting there? context: A couple of years ago I made a career switch into RL - and recently have been wanting to write more. So as an exercise, I want to start writing down some cute observations/thoughts about RL. I figure this could also help some people out there who are just now venturing into the field. submitted by /u/EdAlexAguilar [link] [comments]  ( 85 min )
    Simple continuous environment with spaceship but yet challenging for RL algorithms (like SAC, TD3)
    Hello All. We have designed a set of continuous reinforcement learning environments with locomotion tasks in space. The goal is to navigate a (planar) spaceship to reach the prescribed goals, or enter a prescribed orbit. The tasks in general seems simple, but we were surprised that they pose a serious challenge for vanilla RL approaches. We learned a lot from the environment design process. We find it particularly challenging to appropriately shape the reward function such that the RL algorithm converges to a satisfactory control. We used stable-baselines3 implementations of SAC, TD3, PPO with default hiperparameters (tuned for MuJoCo) One set of environments is about reaching the consecutive goals (regenerated randomly). In case there are 2 planets, the SAC agent performs perfectly, and matches the human baseline score (we have a keyboard controlled agent) 4715 +- 799 ​ SAC & TD3 evaluation curves for 2 planet goal env ​ the best agent (SAC) for 2 planet goal env In case an additional planet is added, the SAC agent performs poorly , and its performance is far from the human baseline score 4659 +-747 ​ SAC & TD3 evaluation curves for 2 planet goal env ​ ​ the best agent (SAC) for 3 planet goal env In case of 4 planets the performance drops even further. We could not explain the dramatic performance drop when increasing the number of planets from 2 to more (3,4). AI seemingly could not learn here the principles of the gravitational force . Any ideas which RL algorithm would do better here ? We plan to take a look at Physic-Informed RL. In case you want to take a look the envs are published here https://github.com/MIMUW-RL/space-gym Best, Jacek submitted by /u/dzako1 [link] [comments]  ( 85 min )
    Actions that you can only take once
    We are working on developing a DQN approach to sequence actions. The actions can only be taken once. I have read in several threads that you can prevent illegal actions from being selected both during learning (taking the max value over only legal actions) and during actual policy implementation (same), and like this your policy stays always legal. But my question is: do you need to supply the list of "exhausted actions" as part of the state? How would the q network be able to know what value to expect, when the remaining actions are completely determined by the already taken actions, if they are not supplied as part of the state at the input of the network? I have not found a single reference where the need to input the exhausted actions as part of the state is described. Any help or guidance would be greatly appreciated. C submitted by /u/Fresh-Literature-623 [link] [comments]  ( 85 min )
    I am new to RL, problem understanding on how to apply it
    Hey! I am very new to reinforcement learning and I am writing my bachelor thesis on a game where I have to use a learning method, however I don't really seem to understand how to solve it really. I hope the question is fine. I have to read a paper and implement it, then turn it into a repeated case and apply simple learning to it. The paper is about a pirate-farmer game where there are 3 islands. The farmer chooses an islands and plants flowers on it. The islands have different sizes, one holds 3 flowers, one holds 4 and the last 8. If the pirate chooses the same island as the farmer he gets the flowers, otherwise the farmer keeps them. This game is then played multiple rounds and the paper basically talks about the nash equilibrium of both players to choose each island. I have talked to my tutor about it and she told me to try and apply q-learning to it, however I don't exactly understand how to do that. When I read about q-learning and watched videos about it, people used it for a treasure hunt mostly to find the shortest path from one location to another. However I don't understand how to make the game repeated without changing the game itself, if that makes sense? Sorry if the question doesn't make a lot of sense, like I said I am still pretty new at it. submitted by /u/False-Bluebird-3538 [link] [comments]  ( 85 min )
  • Open

    Create audio for content in multiple languages with the same TTS voice persona in Amazon Polly
    Amazon Polly is a leading cloud-based service that converts text into lifelike speech. Following the adoption of Neural Text-to-Speech (NTTS), we have continuously expanded our portfolio of available voices in order to provide a wide selection of distinct speakers in supported languages. Today, we are pleased to announce four new additions: Pedro speaking US Spanish, […]  ( 5 min )
    New built-in Amazon SageMaker algorithms for tabular data modeling: LightGBM, CatBoost, AutoGluon-Tabular, and TabTransformer
    Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning (ML) practitioners get started on training and deploying ML models quickly. You can use these algorithms and models for both supervised and unsupervised learning. They can process various types of input data, including tabular, […]  ( 7 min )
    Semantic segmentation data labeling and model training using Amazon SageMaker
    In computer vision, semantic segmentation is the task of classifying every pixel in an image with a class from a known set of labels such that pixels with the same label share certain characteristics. It generates a segmentation mask of the input images. For example, the following images show a segmentation mask of the cat […]  ( 9 min )
    Deep demand forecasting with Amazon SageMaker
    Every business needs the ability to predict the future accurately in order to make better decisions and give the company a competitive advantage. With historical data, businesses can understand trends, make predictions of what might happen and when, and incorporate that information into their future plans, from product demand to inventory planning and staffing. If […]  ( 10 min )
  • Open

    DALL·E 2 Pre-Training Mitigations
    In order to share the magic of DALL·E 2 with a broad audience, we needed to reduce the risks associated with powerful image generation models. To this end, we put various guardrails in place to prevent generated images from violating our content policy. This post focuses on pre-training  ( 13 min )
  • Open

    NVIDIA Teams With HPE to Take AI From Edge to Cloud
    Enterprises now have a new option for quickly getting started with NVIDIA AI software: the HPE GreenLake edge-to-cloud platform. The NVIDIA AI Enterprise software suite is an end-to-end, cloud-native suite of AI and data analytics software. It’s optimized to enable any organization to use AI, and doesn’t require deep AI expertise. Fully supported by NVIDIA, Read article > The post NVIDIA Teams With HPE to Take AI From Edge to Cloud appeared first on NVIDIA Blog.  ( 5 min )
    Detect to Protect: Taiwan Hospital Deploys Real-Time AI Risk Prediction for Kidney Patients
    Taiwan has nearly 85,000 kidney dialysis patients — the highest prevalence in the world based on population density. Taipei Veterans General Hospital (TVGH) is working to improve outcomes for these patients with an AI model that predicts heart failure risk in real time during dialysis procedures. Cardiovascular disease is the leading cause of death for Read article > The post Detect to Protect: Taiwan Hospital Deploys Real-Time AI Risk Prediction for Kidney Patients appeared first on NVIDIA Blog.  ( 7 min )
  • Open

    time series classification
    Does any one know any good books or tutorials on time series classification using recurrent neural networks (LSTM)? Currently working on an EHR dataset and need to classify/predict disease, I know I can use normal classifiers i.e. SVM or XGboost but wanted to avoid the feature engineering that comes with this and thought neural networks would be the way to go. Just need good guidance on how to go about implementing it via a book or tutorial. Much appreciated submitted by /u/Abeokuta_ [link] [comments]  ( 84 min )
    Converting TensorFlow Keras model API to model subclassing
    For a simple TF2 Object detection CNN architecture defined using Keras's functional API, a batch of data is obtained as: example, label = next(data_generator(batch_size = 32)) example.keys() # dict_keys(['image']) image = example['image'] image.shape # (32, 144, 144, 3) label.keys() # dict_keys(['class_out', 'box_out']) label['class_out'].shape, label['box_out'].shape # ((32, 9), (32, 2)) The CNN architecture defined using Keras's functional API is: input_ = Input(shape = (144, 144, 3), name = 'image') # name - An optional name string for the Input layer. Should be unique in # a model (do not reuse the same name twice). It will be autogenerated if it isn't provided. # Here 'image' is the Python3 dict's key used to map the data to one of the layer in the model. x = input_ # Define a c…  ( 85 min )
  • Open

    What is Social Media Content Moderation and how Moderation Companies use various Techniques to…
    Moderation is the process of controlling the wanted contents from the online platforms like social media networking sites. And it is…  ( 8 min )
    DALL·E 2 — The AI artist that can create and edit images for you!
    “Homer Simpson reacting to the crash of Bitcoin” Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 10 min )

  • Open

    made with starryai
    submitted by /u/rikusorasephiroth [link] [comments]  ( 83 min )
    When your phone knows you get no bitches
    submitted by /u/asscheeseterps710 [link] [comments]  ( 82 min )
    Secured AI-related position within my current company, plan on moving-internally in the Fall. How do I negotiate salary in this case?
    Hi, I currently work in the auto industry, and have recently solidified an opportunity to transfer from my current role in Manufacturing to a role within AI, specifically focusing on Autonomous Driving. I currently work as a data scientist, and am responsible for setting up pipelines, modeling, forecasting, etc.. I am fluent in Python and have some basic introductory experience in Neural Nets and using image tensors (a capstone project in undergrad, graduate school). I currently make 70k but would like to obviously aim higher, given the current market and skills required for this job. I have 3.5 years of experience when looking at my career from a general "computer science" point of view. What is a reasonable amount to expect in a position like this? Should I throw out a large number and work backwards with my company from there? The position is Remote, but based out of MI, USA. I live in the States. Thanks. submitted by /u/Mr15ization [link] [comments]  ( 84 min )
    Weekly China AI News: CVPR 2022 Recap; Meituan Proposes YOLOv6; Tencent Invests in Data Processing Unit Firm
    submitted by /u/trcytony [link] [comments]  ( 82 min )
    An Artificial Intelligence chatbot powers profitability for a multinational bank
    submitted by /u/Diana-RS [link] [comments]  ( 82 min )
    Last Week in AI: AI learns to do tasks in Minecraft, Instagram AI scans faces for age verification, Amazon launches AI pair programming tool, and more!
    submitted by /u/regalalgorithm [link] [comments]  ( 83 min )
    LaMDA’s Sentience is Nonsense - Here’s Why
    submitted by /u/regalalgorithm [link] [comments]  ( 82 min )
    The possibility of general Artificial Intelligence
    submitted by /u/Diana-RS [link] [comments]  ( 82 min )
    Device42: AI Webinar Tomorrow
    Hey All, Just wanted to throw a quick remind that Device42 is hosting an upcoming AI webinar with award winning author Steve Shwartz (Evil Robots, Killer Computers, and Other Myths) and our CMO Yama Habibzai tomorrow, June 28th at 11 AM EDT as they discuss the impact of AI in IT and how you can leverage it to achieve more. Save your seat today Cheers. submitted by /u/Device42_Phil [link] [comments]  ( 83 min )
    A Tutorial on Generating Images from Text Prompts with VQGAN-Clip, Python, and TensorFlow
    View the tutorial here: HERE This tutorial teaches you how to convert any text prompt to an image using VQGAN-Clip. For example you could use the prompt "A spray painting of a waiting computer and a bedroom in the style of Edgar Degas and Art Nouveau". This would generate the following image: https://imgur.com/J3qGlc4 Let me know if you have any questions or comments. submitted by /u/mshriver2 [link] [comments]  ( 83 min )
    Two MSc options, not sure which one to go for. Any advice would be appreciated
    Hi all Hope everyone is having a good Monday (Tuesday depending on where you are). I have an undergraduate degree in Economics. I have spent a few years working now (mainly in economic analysis), but I am looking at doing an MSc in the DS/ML/AI space as I think it could help my current job, but equally, I have a genuine interest in the fields so could utilise it for a career switch. The two courses I am looking at are: https://online.essex.ac.uk/courses/msc-data-science/#overview https://online.essex.ac.uk/courses/msc-artificial-intelligence/ I am unsure whether it is better to go a bit more general and opt for the DS course, or go for the AI course. Personally, some of the DS modules I feel I already have the skills in such as the data visualisation, which is pushing me towards AI. Furthermore, when I have done some basic NLP, I have enjoyed that a lot and that's available in the AI course but not the DS course (you can see in the course structure then module choice what is included). In terms of future career, as I said above potential career switch (even if it means I have to go into an entry level job before climbing up the ladder again). A PhD would also be an option. Do you have any advice or general thoughts on the two courses above? Cheers all submitted by /u/chickenparmo [link] [comments]  ( 85 min )
    Large language models have a reasoning problem
    submitted by /u/bendee983 [link] [comments]  ( 83 min )
    AI That Passes the Turing Test Doesn't Guarantee Consciousness (3-minute audio clip from Lex Fridman & Sam Harris)
    submitted by /u/justine01923 [link] [comments]  ( 83 min )
    GPT-3 Powered Mac Writing App - Works Across All Applications
    Hello everyone, I have recently soft-launched a Mac app called Elephas that lets you write faster across applications on your Mac. It was even trending on HackerNews for a few hours. See the GIFs attached to see how it works. ​ Email ​ https://i.redd.it/iv3y0gdqm5891.gif It uses your own OpenAI keys and works almost across all applications like Mail, Message, Pages, Google Docs, and Gmail/Outlook. It also has features for, Sentence rewriting (such as professional and friendly modes) Fixing grammar mistakes Translation support I hope you will find it helpful. Feel free to share your feedback :) submitted by /u/juliarmg [link] [comments]  ( 83 min )
    How do I start (muli question post)
    How would one go upon building a "real-life" Jarvis? I am interested in learning and attempting to build an AI. I may sound and be very stupid of me but, I would like to learn to build something that can do everything on its own. Would it be possible for me to build an AI that I could teach like I would if I had a child? I have like all these ideas in my head of making AI that could do anything and starts off as a child and you teach it and teach it until it is what you want it to be. Where would I even really start from, is it even possible for me to do something like these? submitted by /u/ITZSELLABGAMING [link] [comments]  ( 84 min )
    Bootcamp or Master for learning AI?
    I've wanted to learn Data Science/AI for a long time now. My question is: is it better a bootcamp or a master for learning? In case a bootcamp was better, is there anyone you could recommend? Thanks in advance :) submitted by /u/ale3x_ [link] [comments]  ( 82 min )
    Google engineer identifies anonymous faces in WWII photos with AI facial recognition
    submitted by /u/bartturner [link] [comments]  ( 82 min )
    What kind of artificial intelligence can I use to rewrite and summarize texts, cookbooks, non-fiction books, etc.?
    If possible, the AI should remove all non-factual things from the cookbooks/non-fiction books to create a short text that is full of content and does not contain useless words or private stories to make the book more expensive. submitted by /u/xXLisa28Xx [link] [comments]  ( 82 min )
    When you were in school, arriving at the correct answer with the wrong method wouldn’t get you credit…
    And now one of artificial intelligence’s greatest strengths is that it can solve things, arriving at a valid answer, and we have no idea how it did it. It’s called efficient when AI does it! submitted by /u/shawster [link] [comments]  ( 83 min )
    Learning to Play Minecraft with Video PreTraining (VPT)
    submitted by /u/AChickenInAHole [link] [comments]  ( 82 min )
    It just walks!
    submitted by /u/FreeFriedMen [link] [comments]  ( 83 min )
    How the AI be walking on the 17th generation
    submitted by /u/PedroRibs [link] [comments]  ( 84 min )
  • Open

    Data & Analytics Regression Playbook: Make Your Data Work Harder…And Smarter!
    With a potential recession lurking on the horizon, 99% of companies will make the same old “safe” mistakes: hunker down, let people go, shrink, and hope to hold on for dear life. However, growth-oriented organizations will see this as a business opportunity – an opportunity to leverage their data to “do more with less”.  You… Read More »Data & Analytics Regression Playbook: Make Your Data Work Harder…And Smarter! The post Data & Analytics Regression Playbook: Make Your Data Work Harder…And Smarter! appeared first on Data Science Central.  ( 22 min )
  • Open

    [R] Theoretical Open Research Areas
    Hello everyone, my goal is to do research in the field of machine learning for motion planning/robotics in general. I'm really interested in the theoretical/mathematical side of the field. However I noticed that the majority of the field consists of very experimental papers where architectures are built and bench-marked without any thorough underlying theory. So my questions is: Are there any theoretical research areas in machine learning for motion planning/robotics in general? It would be nice if someone could also give me some labs/researchers working in that direction. ​ Thank you very much. submitted by /u/-aplusib- [link] [comments]  ( 84 min )
    "A Path Towards Autonomous Machine Intelligence" - Yann LeCun
    submitted by /u/s7v7nsilver [link] [comments]  ( 84 min )
    [R] Can I use whole-protein embeddings on isolated domains?
    I'm interested in studying properties of particular protein domains. One idea is to take advantage of state-of-the-art protein embedding models, such as this, most of which are based on transformers. Some of the domains I'm studying are found in large proteins, which have multiple other domains in the same chain. Therefore, I believe it might be more informative to obtain embeddings not of each protein as a whole, but just the domains. However, I worry that the embeddings would be all off, since the model expects a complete sequence. Has anyone tried this before? Are the pre-trained domain-level embeddings? submitted by /u/OmOshIroIdEs [link] [comments]  ( 84 min )
    [N] Inverse Scaling Prize: $250k in prizes for finding tasks where larger language models do worse
    We're used to finding that task performance scales well with large increases in sizes of language models. But for real-world applications, it's also very meaningful to search for failure cases preemptively to fix the underlying issues. Can you find and convincingly demonstrate these failure cases where language models scale inversely, with larger models behaving worse? You don't necessarily need to have extra deep knowledge of ML or language models in order to participate and win, because all models are frozen and you only need to come up with the right data. Check out these resources to learn more! Announcement Twitter thread, contest details on Github. The deadline for the first round of the contest is August 27, 2022. submitted by /u/alexlyzhov [link] [comments]  ( 87 min )
    [Discussion] [computer vision] Instant NeRF create quality depth maps?
    Surprised I haven't seen more chatter about this. What do you think about Nvidia's instant Nerf which turns 2d into 3d based on these techniques https://arxiv.org/abs/2003.10016 Does the output of a NeRF give a depth map that's comparable to what you'd get from a Kinect? Can these be used to create 3D models one would use in Unreal or Blender? submitted by /u/KalloDotIO [link] [comments]  ( 84 min )
    [D] Do you have any suggestions for a crowd-sourced annotation tool?
    We're currently doing research on computational social science, specifically on online toxicity. We have lots of text data, but we don't have annotations. As part of the research, we are thinking of annotating the text using a crowd-sourcing approach. Do any of you know of any open-source tool that we could employ to ease up the process? submitted by /u/vigneshwaranpersonal [link] [comments]  ( 84 min )
    [D] Stack - Seamless data collaboration and versioning
    Hey r/MachineLearning! We are the co-founders of Stack, a hub for data collaboration and versioning. We are developing this tool to help ML teams automatically track changes in their data seamlessly. We are opening a waiting list for our beta, which we aim to release soon. You can sign up at: https://www.getstack.ai/ We are also actively looking for feedback. Feel free to share any comments or thoughts! submitted by /u/baceituno [link] [comments]  ( 84 min )
    [P] I published a tutorial about ML model deployment
    The deployment of ML models in production is a delicate process filled with challenges. You can deploy a model via a REST API, on an edge device, or as as an off-line unit used for batch processing. You can build the deployment pipeline from scratch, or use ML deployment frameworks. In my new mini-series, you'll learn best practices to deploy your ML models. I try to concentrate everything in 2 videos, to keep the series short and sweet. The first video provides a theoretical overview of ML deployment. You'll learn about: Different strategies to deploy ML in production. The main ML deployment tools on the market (TF Serving, MLFlow Model, Seldon Deploy, KServe from Kubeflow). BentoML and its features. Here's the video: https://www.youtube.com/watch?v=Mrv3CZNWYEg submitted by /u/diabulusInMusica [link] [comments]  ( 85 min )
    [D] Has anyone trained the latent diffusion models by OpenAI(CompVis)? Need some help
    I am trying to train a latent-diffusion model by following the instructions on the repo, however I am running into errors while sampling from the checkpointed models. Can someone help? I am getting Errors while trying to sample using sample_diffusion.py from a custom model trained on LSUN churches submitted by /u/icelebratefestivus [link] [comments]  ( 85 min )
    [D] IBM Zurich Research Plagiarised Our Paper and got it published on CVPR 2022. Is "copy texts" is plagiarism, "copy idea" is not plagiarism?
    I am Xianbiao Qi, a computer vision researcher with more than ten years of research experience. I am writing this blog to complain of a serious case of deliberate plagiarism of our paper by the employees from IBM Zurich Research. They did not copy texts, they copied the idea. Our preprint paper on Arxiv is "Jiaquan Ye, Xianbiao Qi, Yelin He, and etc."PingAn-VCGroup's Solution for ICDAR 2021 Competition on Scientific Literature Parsing Task B: Table Recognition to HTML." arXiv preprint arXiv:2105.01848, May 2021" and the code was also released. Our paper (Ye et al. arXiv: 2105.01848) was plagiarised by a team in IBM Zurich Research: "Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, and Peter Staar, "TableFormer: Table Structure Understanding with Transformers." In Proceedings of the IE…  ( 97 min )
    [D] State-of-the-art permutation-invariant graph embeddings
    Suppose I have a data set consisting of weighted undirected simple graphs. I would like to learn a vector representation of these graphs. What are the state-of-the-art (2022) architectures/methods for learning such representations? Ideally, the representations are permutation-invariant. For what it's worth, I am only interested in the case where graphs (vertices, edges, and their respective weights) are fully observed; I'm not interested cases unobserved nodes. An additional requirement is the embedding must have a lower dimension that the number of nodes. submitted by /u/heylibrarian [link] [comments]  ( 87 min )
    [P] Skipgram: neural network instead of lookup table
    I'm looking for papers which use the skipgram model but instead of a lookup table they use a neural network. The use case is instead of sentences of words I want to use sequences of human behavior where additional information is available, e.g. think sequences of visited Amazon products. Cold-start also happens to be very common and I'm thinking that using a neural network instead of lookup embeddings table would be better. Updated with more context: The typical usage of skip gram is for learning word embedding as in text where each word has an embedding which is learned through skipgram. However there is nothing limiting the usage of skipgram for non-text cases. A popular way to use skipgram in i2i recommendation systems is to treat a session of products browsed by the user as a sequence and to have an embedding per product. (Eg see KDD 2018 winning paper from Airbnb) However, the question I have here is instead of having one embedding per product can we instead use a neural network where the output layer is the embedding layer. This way we can backprop through the neural network. The reason is we have more information for products than we do for words submitted by /u/curiousML5 [link] [comments]  ( 86 min )
    [D] For perciever (IO) with single-channel audio, are position encodings even necessary?
    I've been looking into using the Perciever for a project that involves single-channel (mono) audio. From the existing implementations and tutorials, I can't find one that only does audio. It seems like in the papers they rearrange the audio into patches and add position encodings, but this is a hack to bring the audio modality into the same size tensor as other modalities. If only using 1d audio is there any need at all for position encodings at all? submitted by /u/WigglyHypersurface [link] [comments]  ( 84 min )
  • Open

    Using autograd in TensorFlow to Solve a Regression Problem
    We usually use TensorFlow to build a neural network. However, TensorFlow is not limited to this. Behind the scene, TensorFlow is a tensor library with automatic differentiation capability. Hence we can easily use it to solve a numerical optimization problem with gradient descent. In this post, we are going to show how TensorFlow’s automatic differentiation […] The post Using autograd in TensorFlow to Solve a Regression Problem appeared first on Machine Learning Mastery.  ( 16 min )
  • Open

    What is Naive Bayes?
    An introduction to machine learning algorithms  ( 8 min )
  • Open

    Generating Images from Text Prompts with VQGAN-Clip, Python, and TensorFlow [TUT]
    View the tutorial here: HERE This tutorial teaches you how to convert any text prompt to an image using VQGAN-Clip. For example you could use the prompt "A spray painting of a waiting computer and a bedroom in the style of Edgar Degas and Art Nouveau". This would generate the following image: https://imgur.com/J3qGlc4 Let me know if you have any questions or comments. submitted by /u/mshriver2 [link] [comments]  ( 83 min )
    Object Localization from scratch TF2
    Object localization trained from scratch for emoji dataset in TensorFlow 2.8. Getting an IoU = 0.5969 and classification output accuracy = 100%. The code can be referred here. Though in fairness, I am using only 9 classes out of the emoji dataset. Thoughts? submitted by /u/grid_world [link] [comments]  ( 82 min )
    Machine Learning AI Goes Through Race Track
    submitted by /u/Plazmeer [link] [comments]  ( 82 min )
  • Open

    [ReReading Reinforcment Learning by Sutton and Barton] Chapter 1 - Introduction
    As some people liked the idea, let's read this together! :) ​ As mentioned in the previous post, the plan is to read one chapter per week-new chapters on mondays-,so it will be a 17 week endeavour. For those who don't know: the latest version of the book can be found here for free: http://incompleteideas.net/book/the-book-2nd.html. Code, Errata and other materials can be found as well. ​ The first week starts off mildly with the introduction chapter with only 13 pages. It may be worthwhile to use that week to think about how you want to read the book (Noteworthy book summary on reading well: https://en.wikipedia.org/wiki/How_to_Read_a_Book) and what you want to do with what you read. Personally, I'm planning to focus on getting the facts written down as anki flashcards (I can make them available online if people are interested) and following the math and algorithms by hand, so I'll get a notebook with lots of space for errors... Also some people asked for a Discord server to connect with others, but I personally have no idea how to moderate a discord server. Dexdev08 was so kind to recommend the RL Group Discord Server (https://discord.gg/RGsYwkJY) and asked there if we could get a channel for our cause (no answer on that yet, though). I hope that satisfies the need for a Discord server. ​ Happy Reading, I hope for some lively discussions. :) submitted by /u/Accomplished-Ninja31 [link] [comments]  ( 84 min )
    In MADDPG paper, there is a line "The algorithm does not assume a differentiable model of the environment dynamics or any particular structure on the communication method between agents..". Can someone explain what does differentiable means.
    Also i am getting confused, how does the equality hold here? https://preview.redd.it/jd3m2plza6891.png?width=820&format=png&auto=webp&s=55a3900aa9bd1e50dff31673d0150f6c14acd030 submitted by /u/aabra__ka__daabra [link] [comments]  ( 85 min )
    (Re)Reading Reinforcment Learning by Sutton and Barton
    I'm going to reread the book from start to finish again, maybe some people want to join? I will go for one chapter per week. If people want to join and discuss (and perhaps share notes?), I'd create a new post dedicated to that end every monday. What do you think? ​ Edit: Here we go! submitted by /u/Accomplished-Ninja31 [link] [comments]  ( 84 min )
    Mujoco Mesh: how can I rotate the orientation of the middle segment in reference to it‘s geometry?
    submitted by /u/disdisinform [link] [comments]  ( 85 min )
  • Open

    Inspect your data labels with a visual, no code tool to create high-quality training datasets with Amazon SageMaker Ground Truth Plus
    Launched at AWS re:Invent 2021, Amazon SageMaker Ground Truth Plus helps you create high-quality training datasets by removing the undifferentiated heavy lifting associated with building data labeling applications and managing the labeling workforce. All you do is share data along with labeling requirements, and Ground Truth Plus sets up and manages your data labeling workflow […]  ( 6 min )
  • Open

    NASA and conformal maps
    A couple years ago I wrote about how NASA was interested in regions bounded by curves of the form For example, here’s a plot for A = 2, B = 1, α = 2.5 and β = 6. That post mentioned a technical report from NASA that explains why these shapes are important in application, […] NASA and conformal maps first appeared on John D. Cook.  ( 6 min )
  • Open

    Megapixel Image Generation with Step-Unrolled Denoising Autoencoders. (arXiv:2206.12351v1 [cs.CV])
    An ongoing trend in generative modelling research has been to push sample resolutions higher whilst simultaneously reducing computational requirements for training and sampling. We aim to push this trend further via the combination of techniques - each component representing the current pinnacle of efficiency in their respective areas. These include vector-quantized GAN (VQ-GAN), a vector-quantization (VQ) model capable of high levels of lossy - but perceptually insignificant - compression; hourglass transformers, a highly scaleable self-attention model; and step-unrolled denoising autoencoders (SUNDAE), a non-autoregressive (NAR) text generative model. Unexpectedly, our method highlights weaknesses in the original formulation of hourglass transformers when applied to multidimensional data. In light of this, we propose modifications to the resampling mechanism, applicable in any task applying hierarchical transformers to multidimensional data. Additionally, we demonstrate the scalability of SUNDAE to long sequence lengths - four times longer than prior work. Our proposed framework scales to high-resolutions ($1024 \times 1024$) and trains quickly (2-4 days). Crucially, the trained model produces diverse and realistic megapixel samples in approximately 2 seconds on a consumer-grade GPU (GTX 1080Ti). In general, the framework is flexible: supporting an arbitrary number of sampling steps, sample-wise self-stopping, self-correction capabilities, conditional generation, and a NAR formulation that allows for arbitrary inpainting masks. We obtain FID scores of 10.56 on FFHQ256 - close to the original VQ-GAN in less than half the sampling steps - and 21.85 on FFHQ1024 in only 100 sampling steps.
    Phasic Self-Imitative Reduction for Sparse-Reward Goal-Conditioned Reinforcement Learning. (arXiv:2206.12030v1 [cs.LG])
    It has been a recent trend to leverage the power of supervised learning (SL) towards more effective reinforcement learning (RL) methods. We propose a novel phasic approach by alternating online RL and offline SL for tackling sparse-reward goal-conditioned problems. In the online phase, we perform RL training and collect rollout data while in the offline phase, we perform SL on those successful trajectories from the dataset. To further improve sample efficiency, we adopt additional techniques in the online phase including task reduction to generate more feasible trajectories and a value-difference-based intrinsic reward to alleviate the sparse-reward issue. We call this overall algorithm, PhAsic self-Imitative Reduction (PAIR). PAIR substantially outperforms both non-phasic RL and phasic SL baselines on sparse-reward goal-conditioned robotic control problems, including a challenging stacking task. PAIR is the first RL method that learns to stack 6 cubes with only 0/1 success rewards from scratch.
    STREAMLINE: A Simple, Transparent, End-To-End Automated Machine Learning Pipeline Facilitating Data Analysis and Algorithm Comparison. (arXiv:2206.12002v1 [cs.LG])
    Machine learning (ML) offers powerful methods for detecting and modeling associations often in data with large feature spaces and complex associations. Many useful tools/packages (e.g. scikit-learn) have been developed to make the various elements of data handling, processing, modeling, and interpretation accessible. However, it is not trivial for most investigators to assemble these elements into a rigorous, replicatable, unbiased, and effective data analysis pipeline. Automated machine learning (AutoML) seeks to address these issues by simplifying the process of ML analysis for all. Here, we introduce STREAMLINE, a simple, transparent, end-to-end AutoML pipeline designed as a framework to easily conduct rigorous ML modeling and analysis (limited initially to binary classification). STREAMLINE is specifically designed to compare performance between datasets, ML algorithms, and other AutoML tools. It is unique among other autoML tools by offering a fully transparent and consistent baseline of comparison using a carefully designed series of pipeline elements including: (1) exploratory analysis, (2) basic data cleaning, (3) cross validation partitioning, (4) data scaling and imputation, (5) filter-based feature importance estimation, (6) collective feature selection, (7) ML modeling with `Optuna' hyperparameter optimization across 15 established algorithms (including less well-known Genetic Programming and rule-based ML), (8) evaluation across 16 classification metrics, (9) model feature importance estimation, (10) statistical significance comparisons, and (11) automatically exporting all results, plots, a PDF summary report, and models that can be easily applied to replication data.
    Debiasing Learning for Membership Inference Attacks Against Recommender Systems. (arXiv:2206.12401v1 [cs.IR])
    Learned recommender systems may inadvertently leak information about their training data, leading to privacy violations. We investigate privacy threats faced by recommender systems through the lens of membership inference. In such attacks, an adversary aims to infer whether a user's data is used to train the target recommender. To achieve this, previous work has used a shadow recommender to derive training data for the attack model, and then predicts the membership by calculating difference vectors between users' historical interactions and recommended items. State-of-the-art methods face two challenging problems: (1) training data for the attack model is biased due to the gap between shadow and target recommenders, and (2) hidden states in recommenders are not observational, resulting in inaccurate estimations of difference vectors. To address the above limitations, we propose a Debiasing Learning for Membership Inference Attacks against recommender systems (DL-MIA) framework that has four main components: (1) a difference vector generator, (2) a disentangled encoder, (3) a weight estimator, and (4) an attack model. To mitigate the gap between recommenders, a variational auto-encoder (VAE) based disentangled encoder is devised to identify recommender invariant and specific features. To reduce the estimation bias, we design a weight estimator, assigning a truth-level score for each difference vector to indicate estimation accuracy. We evaluate DL-MIA against both general recommenders and sequential recommenders on three real-world datasets. Experimental results show that DL-MIA effectively alleviates training and estimation biases simultaneously, and achieves state-of-the-art attack performance.
    RankSim: Ranking Similarity Regularization for Deep Imbalanced Regression. (arXiv:2205.15236v2 [cs.LG] UPDATED)
    Data imbalance, in which a plurality of the data samples come from a small proportion of labels, poses a challenge in training deep neural networks. Unlike classification, in regression the labels are continuous, potentially boundless, and form a natural ordering. These distinct features of regression call for new techniques that leverage the additional information encoded in label-space relationships. This paper presents the RankSim (ranking similarity) regularizer for deep imbalanced regression, which encodes an inductive bias that samples that are closer in label space should also be closer in feature space. In contrast to recent distribution smoothing based approaches, RankSim captures both nearby and distant relationships: for a given data sample, RankSim encourages the sorted list of its neighbors in label space to match the sorted list of its neighbors in feature space. RankSim is complementary to conventional imbalanced learning techniques, including re-weighting, two-stage training, and distribution smoothing, and lifts the state-of-the-art performance on three imbalanced regression benchmarks: IMDB-WIKI-DIR, AgeDB-DIR, and STS-B-DIR.
    SECLEDS: Sequence Clustering in Evolving Data Streams via Multiple Medoids and Medoid Voting. (arXiv:2206.12190v1 [cs.LG])
    Sequence clustering in a streaming environment is challenging because it is computationally expensive, and the sequences may evolve over time. K-medoids or Partitioning Around Medoids (PAM) is commonly used to cluster sequences since it supports alignment-based distances, and the k-centers being actual data items helps with cluster interpretability. However, offline k-medoids has no support for concept drift, while also being prohibitively expensive for clustering data streams. We therefore propose SECLEDS, a streaming variant of the k-medoids algorithm with constant memory footprint. SECLEDS has two unique properties: i) it uses multiple medoids per cluster, producing stable high-quality clusters, and ii) it handles concept drift using an intuitive Medoid Voting scheme for approximating cluster distances. Unlike existing adaptive algorithms that create new clusters for new concepts, SECLEDS follows a fundamentally different approach, where the clusters themselves evolve with an evolving stream. Using real and synthetic datasets, we empirically demonstrate that SECLEDS produces high-quality clusters regardless of drift, stream size, data dimensionality, and number of clusters. We compare against three popular stream and batch clustering algorithms. The state-of-the-art BanditPAM is used as an offline benchmark. SECLEDS achieves comparable F1 score to BanditPAM while reducing the number of required distance computations by 83.7%. Importantly, SECLEDS outperforms all baselines by 138.7% when the stream contains drift. We also cluster real network traffic, and provide evidence that SECLEDS can support network bandwidths of up to 1.08 Gbps while using the (expensive) dynamic time warping distance.
    Dynamic network congestion pricing based on deep reinforcement learning. (arXiv:2206.12188v1 [eess.SY])
    Traffic congestion is a serious problem in urban areas. Dynamic congestion pricing is one of the useful schemes to eliminate traffic congestion in strategic scale. However, in the reality, an optimal dynamic congestion pricing is very difficult or impossible to determine theoretically, because road networks are usually large and complicated, and behavior of road users is uncertain. To account for this challenge, this work proposes a dynamic congestion pricing method using deep reinforcement learning (DRL). It is designed to eliminate traffic congestion based on observable data in general large-scale road networks, by leveraging the data-driven nature of deep reinforcement learning. One of the novel elements of the proposed method is the distributed and cooperative learning scheme. Specifically, the DRL is implemented by a spatial-temporally distributed manner, and cooperation among DRL agents is established by novel techniques we call spatially shared reward and temporally switching learning. It enables fast and computationally efficient learning in large-scale networks. The numerical experiments using Sioux Falls Network showed that the proposed method works well thanks to the novel learning scheme.
    The Digital Twin Landscape at the Crossroads of Predictive Maintenance, Machine Learning and Physics Based Modeling. (arXiv:2206.10462v2 [cs.LG] UPDATED)
    The concept of a digital twin has exploded in popularity over the past decade, yet confusion around its plurality of definitions, its novelty as a new technology, and its practical applicability still exists, all despite numerous reviews, surveys, and press releases. The history of the term digital twin is explored, as well as its initial context in the fields of product life cycle management, asset maintenance, and equipment fleet management, operations, and planning. A definition for a minimally viable framework to utilize a digital twin is also provided based on seven essential elements. A brief tour through DT applications and industries where DT methods are employed is also outlined. The application of a digital twin framework is highlighted in the field of predictive maintenance, and its extensions utilizing machine learning and physics based modeling. Employing the combination of machine learning and physics based modeling to form hybrid digital twin frameworks, may synergistically alleviate the shortcomings of each method when used in isolation. Key challenges of implementing digital twin models in practice are additionally discussed. As digital twin technology experiences rapid growth and as it matures, its great promise to substantially enhance tools and solutions for intelligent upkeep of complex equipment, are expected to materialize.
    Data Leakage in Federated Averaging. (arXiv:2206.12395v1 [cs.LG])
    Recent attacks have shown that user data can be reconstructed from FedSGD updates, thus breaking privacy. However, these attacks are of limited practical relevance as federated learning typically uses the FedAvg algorithm. It is generally accepted that reconstructing data from FedAvg updates is much harder than FedSGD as: (i) there are unobserved intermediate weight updates, (ii) the order of inputs matters, and (iii) the order of labels changes every epoch. In this work, we propose a new optimization-based attack which successfully attacks FedAvg by addressing the above challenges. First, we solve the optimization problem using automatic differentiation that forces a simulation of the client's update for the reconstructed labels and inputs so as to match the received client update. Second, we address the unknown input order by treating images at different epochs as independent during optimization, while relating them with a permutation invariant prior. Third, we reconstruct the labels by estimating the parameters of existing FedSGD attacks at every FedAvg step. On the popular FEMNIST dataset, we demonstrate that on average we successfully reconstruct >45% of the client's images from realistic FedAvg updates computed on 10 local epochs of 10 batches each with 5 images, compared to only <10% using the baseline. These findings indicate that many real-world federated learning implementations based on FedAvg are vulnerable.
    Predicting the Stability of Hierarchical Triple Systems with Convolutional Neural Networks. (arXiv:2206.12402v1 [astro-ph.EP])
    Understanding the long-term evolution of hierarchical triple systems is challenging due to its inherent chaotic nature, and it requires computationally expensive simulations. Here we propose a convolutional neural network model to predict the stability of hierarchical triples by looking at their evolution during the first $5 \times 10^5$ inner binary orbits. We employ the regularized few-body code \textsc{tsunami} to simulate $5\times 10^6$ hierarchical triples, from which we generate a large training and test dataset. We develop twelve different network configurations that use different combinations of the triples' orbital elements and compare their performances. Our best model uses 6 time-series, namely, the semimajor axes ratio, the inner and outer eccentricities, the mutual inclination and the arguments of pericenter. This model achieves an area under the curve of over $95\%$ and informs of the relevant parameters to study triple systems stability. All trained models are made publicly available, allowing to predict the stability of hierarchical triple systems $200$ times faster than pure $N$-body methods.
    Multi-Exit Semantic Segmentation Networks. (arXiv:2106.03527v2 [cs.CV] UPDATED)
    Semantic segmentation arises as the backbone of many vision systems, spanning from self-driving cars and robot navigation to augmented reality and teleconferencing. Frequently operating under stringent latency constraints within a limited resource envelope, optimising for efficient execution becomes important. At the same time, the heterogeneous capabilities of the target platforms and diverse constraints of different applications require the design and training of multiple target-specific segmentation models, leading to excessive maintenance costs. To this end, we propose a framework for converting state-of-the-art segmentation CNNs to Multi-Exit Semantic Segmentation (MESS) networks: specially trained models that employ parametrised early exits along their depth to i) dynamically save computation during inference on easier samples and ii) save training and maintenance cost by offering a post-training customisable speed-accuracy trade-off. Designing and training such networks naively can hurt performance. Thus, we propose novel two-staged training scheme for multi-exit networks. Furthermore, the parametrisation of MESS enables co-optimising the number, placement and architecture of the attached segmentation heads along with the exit policy, upon deployment via exhaustive search in <1GPUh. This allows MESS to rapidly adapt to the device capabilities and application requirements for each target use-case, offering a train-once-deploy-everywhere solution. MESS variants achieve latency gains of up to 2.83x with the same accuracy, or 5.33 pp higher accuracy for the same computational budget, compared to the original backbone network. Lastly, MESS delivers orders of magnitude faster architecture selection, compared to state-of-the-art techniques.
    Analyzing the impact of SARS-CoV-2 variants on respiratory sound signals. (arXiv:2206.12309v1 [eess.AS])
    The COVID-19 outbreak resulted in multiple waves of infections that have been associated with different SARS-CoV-2 variants. Studies have reported differential impact of the variants on respiratory health of patients. We explore whether acoustic signals, collected from COVID-19 subjects, show computationally distinguishable acoustic patterns suggesting a possibility to predict the underlying virus variant. We analyze the Coswara dataset which is collected from three subject pools, namely, i) healthy, ii) COVID-19 subjects recorded during the delta variant dominant period, and iii) data from COVID-19 subjects recorded during the omicron surge. Our findings suggest that multiple sound categories, such as cough, breathing, and speech, indicate significant acoustic feature differences when comparing COVID-19 subjects with omicron and delta variants. The classification areas-under-the-curve are significantly above chance for differentiating subjects infected by omicron from those infected by delta. Using a score fusion from multiple sound categories, we obtained an area-under-the-curve of 89% and 52.4% sensitivity at 95% specificity. Additionally, a hierarchical three class approach was used to classify the acoustic data into healthy and COVID-19 positive, and further COVID-19 subjects into delta and omicron variants providing high level of 3-class classification accuracy. These results suggest new ways for designing sound based COVID-19 diagnosis approaches.
    MSR-NV: Neural Vocoder Using Multiple Sampling Rates. (arXiv:2109.13714v3 [eess.AS] UPDATED)
    The development of neural vocoders (NVs) has resulted in the high-quality and fast generation of waveforms. However, conventional NVs target a single sampling rate and require re-training when applied to different sampling rates. A suitable sampling rate varies from application to application due to the trade-off between speech quality and generation speed. In this study, we propose a method to handle multiple sampling rates in a single NV, called the MSR-NV. By generating waveforms step-by-step starting from a low sampling rate, MSR-NV can efficiently learn the characteristics of each frequency band and synthesize high-quality speech at multiple sampling rates. It can be regarded as an extension of the previously proposed NVs, and in this study, we extend the structure of Parallel WaveGAN (PWG). Experimental evaluation results demonstrate that the proposed method achieves remarkably higher subjective quality than the original PWG trained separately at 16, 24, and 48 kHz, without increasing the inference time. We also show that MSR-NV can leverage speech with lower sampling rates to further improve the quality of the synthetic speech.
    Provably Confidential Language Modelling. (arXiv:2205.01863v2 [cs.CL] UPDATED)
    Large language models are shown to memorize privacy information such as social security numbers in training data. Given the sheer scale of the training corpus, it is challenging to screen and filter these privacy data, either manually or automatically. In this paper, we propose Confidentially Redacted Training (CRT), a method to train language generation models while protecting the confidential segments. We borrow ideas from differential privacy (which solves a related but distinct problem) and show that our method is able to provably prevent unintended memorization by randomizing parts of the training process. Moreover, we show that redaction with an approximately correct screening policy amplifies the confidentiality guarantee. We implement the method for both LSTM and GPT language models. Our experimental results show that the models trained by CRT obtain almost the same perplexity while preserving strong confidentiality.
    Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss. (arXiv:2106.04156v7 [cs.LG] UPDATED)
    Recent works in self-supervised learning have advanced the state-of-the-art by relying on the contrastive learning paradigm, which learns representations by pushing positive pairs, or similar examples from the same class, closer together while keeping negative pairs far apart. Despite the empirical successes, theoretical foundations are limited -- prior analyses assume conditional independence of the positive pairs given the same class label, but recent empirical applications use heavily correlated positive pairs (i.e., data augmentations of the same image). Our work analyzes contrastive learning without assuming conditional independence of positive pairs using a novel concept of the augmentation graph on data. Edges in this graph connect augmentations of the same data, and ground-truth classes naturally form connected sub-graphs. We propose a loss that performs spectral decomposition on the population augmentation graph and can be succinctly written as a contrastive learning objective on neural net representations. Minimizing this objective leads to features with provable accuracy guarantees under linear probe evaluation. By standard generalization bounds, these accuracy guarantees also hold when minimizing the training contrastive loss. Empirically, the features learned by our objective can match or outperform several strong baselines on benchmark vision datasets. In all, this work provides the first provable analysis for contrastive learning where guarantees for linear probe evaluation can apply to realistic empirical settings.
    How to train accurate BNNs for embedded systems?. (arXiv:2206.12322v1 [cs.LG])
    A key enabler of deploying convolutional neural networks on resource-constrained embedded systems is the binary neural network (BNN). BNNs save on memory and simplify computation by binarizing both features and weights. Unfortunately, binarization is inevitably accompanied by a severe decrease in accuracy. To reduce the accuracy gap between binary and full-precision networks, many repair methods have been proposed in the recent past, which we have classified and put into a single overview in this chapter. The repair methods are divided into two main branches, training techniques and network topology changes, which can further be split into smaller categories. The latter category introduces additional cost (energy consumption or additional area) for an embedded system, while the former does not. From our overview, we observe that progress has been made in reducing the accuracy gap, but BNN papers are not aligned on what repair methods should be used to get highly accurate BNNs. Therefore, this chapter contains an empirical review that evaluates the benefits of many repair methods in isolation over the ResNet-20\&CIFAR10 and ResNet-18\&CIFAR100 benchmarks. We found three repair categories most beneficial: feature binarizer, feature normalization, and double residual. Based on this review we discuss future directions and research opportunities. We sketch the benefit and costs associated with BNNs on embedded systems because it remains to be seen whether BNNs will be able to close the accuracy gap while staying highly energy-efficient on resource-constrained embedded systems.
    Graph-Coupled Oscillator Networks. (arXiv:2202.02296v2 [cs.LG] UPDATED)
    We propose Graph-Coupled Oscillator Networks (GraphCON), a novel framework for deep learning on graphs. It is based on discretizations of a second-order system of ordinary differential equations (ODEs), which model a network of nonlinear controlled and damped oscillators, coupled via the adjacency structure of the underlying graph. The flexibility of our framework permits any basic GNN layer (e.g. convolutional or attentional) as the coupling function, from which a multi-layer deep neural network is built up via the dynamics of the proposed ODEs. We relate the oversmoothing problem, commonly encountered in GNNs, to the stability of steady states of the underlying ODE and show that zero-Dirichlet energy steady states are not stable for our proposed ODEs. This demonstrates that the proposed framework mitigates the oversmoothing problem. Moreover, we prove that GraphCON mitigates the exploding and vanishing gradients problem to facilitate training of deep multi-layer GNNs. Finally, we show that our approach offers competitive performance with respect to the state-of-the-art on a variety of graph-based learning tasks.
    SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech. (arXiv:2206.12132v1 [eess.AS])
    In this paper, we present SANE-TTS, a stable and natural end-to-end multilingual TTS model. By the difficulty of obtaining multilingual corpus for given speaker, training multilingual TTS model with monolingual corpora is unavoidable. We introduce speaker regularization loss that improves speech naturalness during cross-lingual synthesis as well as domain adversarial training, which is applied in other multilingual TTS models. Furthermore, by adding speaker regularization loss, replacing speaker embedding with zero vector in duration predictor stabilizes cross-lingual inference. With this replacement, our model generates speeches with moderate rhythm regardless of source speaker in cross-lingual synthesis. In MOS evaluation, SANE-TTS achieves naturalness score above 3.80 both in cross-lingual and intralingual synthesis, where the ground truth score is 3.99. Also, SANE-TTS maintains speaker similarity close to that of ground truth even in cross-lingual inference. Audio samples are available on our web page.
    Achievement and Fragility of Long-term Equitability. (arXiv:2206.12333v1 [math.OC])
    Equipping current decision-making tools with notions of fairness, equitability, or other ethically motivated outcomes, is one of the top priorities in recent research efforts in machine learning, AI, and optimization. In this paper, we investigate how to allocate limited resources to {locally interacting} communities in a way to maximize a pertinent notion of equitability. In particular, we look at the dynamic setting where the allocation is repeated across multiple periods (e.g., yearly), the local communities evolve in the meantime (driven by the provided allocation), and the allocations are modulated by feedback coming from the communities themselves. We employ recent mathematical tools stemming from data-driven feedback online optimization, by which communities can learn their (possibly unknown) evolution, satisfaction, as well as they can share information with the deciding bodies. We design dynamic policies that converge to an allocation that maximize equitability in the long term. We further demonstrate our model and methodology with realistic examples of healthcare and education subsidies design in Sub-Saharian countries. One of the key empirical takeaways from our setting is that long-term equitability is fragile, in the sense that it can be easily lost when deciding bodies weigh in other factors (e.g., equality in allocation) in the allocation strategy. Moreover, a naive compromise, while not providing significant advantage to the communities, can promote inequality in social outcomes.
    Animal Behavior Classification via Deep Learning on Embedded Systems. (arXiv:2111.12295v2 [cs.LG] UPDATED)
    We develop an end-to-end deep-neural-network-based algorithm for classifying animal behavior using accelerometry data on the embedded system of an artificial intelligence of things (AIoT) device installed in a wearable collar tag. The proposed algorithm jointly performs feature extraction and classification utilizing a set of infinite-impulse-response (IIR) and finite-impulse-response (FIR) filters together with a multilayer perceptron. The utilized IIR and FIR filters can be viewed as specific types of recurrent and convolutional neural network layers, respectively. We evaluate the performance of the proposed algorithm via two real-world datasets collected from total eighteen grazing beef cattle using collar tags. The results show that the proposed algorithm offers good intra- and inter-dataset classification accuracy and outperforms its closest contenders including two state-of-the-art convolutional-neural-network-based time-series classification algorithms, which are significantly more complex. We implement the proposed algorithm on the embedded system of the utilized collar tags' AIoT device to perform in-situ classification of animal behavior. We achieve real-time in-situ behavior inference from accelerometry data without imposing any strain on the available computational, memory, or energy resources of the embedded system.
    Using Autoencoders on Differentially Private Federated Learning GANs. (arXiv:2206.12270v1 [cs.LG])
    Machine learning has been applied to almost all fields of computer science over the past decades. The introduction of GANs allowed for new possibilities in fields of medical research and text prediction. However, these new fields work with ever more privacy-sensitive data. In order to maintain user privacy, a combination of federated learning, differential privacy and GANs can be used to work with private data without giving away a users' privacy. Recently, two implementations of such combinations have been published: DP-Fed-Avg GAN and GS-WGAN. This paper compares their performance and introduces an alternative version of DP-Fed-Avg GAN that makes use of denoising techniques to combat the loss in accuracy that generally occurs when applying differential privacy and federated learning to GANs. We also compare the novel adaptation of denoised DP-Fed-Avg GAN to the state-of-the-art implementations in this field.
    NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling. (arXiv:2104.02321v2 [eess.AS] CROSS LISTED)
    In this work, we introduce NU-Wave, the first neural audio upsampling model to produce waveforms of sampling rate 48kHz from coarse 16kHz or 24kHz inputs, while prior works could generate only up to 16kHz. NU-Wave is the first diffusion probabilistic model for audio super-resolution which is engineered based on neural vocoders. NU-Wave generates high-quality audio that achieves high performance in terms of signal-to-noise ratio (SNR), log-spectral distance (LSD), and accuracy of the ABX test. In all cases, NU-Wave outperforms the baseline models despite the substantially smaller model capacity (3.0M parameters) than baselines (5.4-21%). The audio samples of our model are available at https://mindslab-ai.github.io/nuwave, and the code will be made available soon.
    Score-based Generative Models for Calorimeter Shower Simulation. (arXiv:2206.11898v1 [hep-ph])
    Score-based generative models are a new class of generative algorithms that have been shown to produce realistic images even in high dimensional spaces, currently surpassing other state-of-the-art models for different benchmark categories and applications. In this work we introduce CaloScore, a score-based generative model for collider physics applied to calorimeter shower generation. Three different diffusion models are investigated using the Fast Calorimeter Simulation Challenge 2022 dataset. CaloScore is the first application of a score-based generative model in collider physics and is able to produce high-fidelity calorimeter images for all datasets, providing an alternative paradigm for calorimeter shower simulation.
    Federated learning: Applications, challenges and future directions. (arXiv:2205.09513v2 [cs.LG] UPDATED)
    Federated learning (FL) is a system in which a central aggregator coordinates the efforts of multiple clients to solve machine learning problems. This setting allows training data to be dispersed in order to protect privacy. The purpose of this paper is to provide an overview of FL systems with a focus on healthcare. FL is evaluated here based on its frameworks, architectures, and applications. It is shown here that FL solves the preceding issues with a shared global deep learning (DL) model via a central aggregator server. This paper examines recent developments and provides a comprehensive list of unresolved issues, inspired by the rapid growth of FL research. In the context of FL, several privacy methods are described, including secure multiparty computation, homomorphic encryption, differential privacy, and stochastic gradient descent. Furthermore, a review of various FL classes, such as horizontal and vertical FL and federated transfer learning, is provided. FL has applications in wireless communication, service recommendation, intelligent medical diagnosis systems, and healthcare, all of which are discussed in this paper. We also present a thorough review of existing FL challenges, such as privacy protection, communication cost, system heterogeneity, and unreliable model upload, followed by future research directions.
    On Certifying and Improving Generalization to Unseen Domains. (arXiv:2206.12364v1 [cs.LG])
    Domain Generalization (DG) aims to learn models whose performance remains high on unseen domains encountered at test-time by using data from multiple related source domains. Many existing DG algorithms reduce the divergence between source distributions in a representation space to potentially align the unseen domain close to the sources. This is motivated by the analysis that explains generalization to unseen domains using distributional distance (such as the Wasserstein distance) to the sources. However, due to the openness of the DG objective, it is challenging to evaluate DG algorithms comprehensively using a few benchmark datasets. In particular, we demonstrate that the accuracy of the models trained with DG methods varies significantly across unseen domains, generated from popular benchmark datasets. This highlights that the performance of DG methods on a few benchmark datasets may not be representative of their performance on unseen domains in the wild. To overcome this roadblock, we propose a universal certification framework based on distributionally robust optimization (DRO) that can efficiently certify the worst-case performance of any DG method. This enables a data-independent evaluation of a DG method complementary to the empirical evaluations on benchmark datasets. Furthermore, we propose a training algorithm that can be used with any DG method to provably improve their certified performance. Our empirical evaluation demonstrates the effectiveness of our method at significantly improving the worst-case loss (i.e., reducing the risk of failure of these models in the wild) without incurring a significant performance drop on benchmark datasets.
    AdAUC: End-to-end Adversarial AUC Optimization Against Long-tail Problems. (arXiv:2206.12169v1 [cs.LG])
    It is well-known that deep learning models are vulnerable to adversarial examples. Existing studies of adversarial training have made great progress against this challenge. As a typical trait, they often assume that the class distribution is overall balanced. However, long-tail datasets are ubiquitous in a wide spectrum of applications, where the amount of head class instances is larger than the tail classes. Under such a scenario, AUC is a much more reasonable metric than accuracy since it is insensitive toward class distribution. Motivated by this, we present an early trial to explore adversarial training methods to optimize AUC. The main challenge lies in that the positive and negative examples are tightly coupled in the objective function. As a direct result, one cannot generate adversarial examples without a full scan of the dataset. To address this issue, based on a concavity regularization scheme, we reformulate the AUC optimization problem as a saddle point problem, where the objective becomes an instance-wise function. This leads to an end-to-end training protocol. Furthermore, we provide a convergence guarantee of the proposed algorithm. Our analysis differs from the existing studies since the algorithm is asked to generate adversarial examples by calculating the gradient of a min-max problem. Finally, the extensive experimental results show the performance and robustness of our algorithm in three long-tail datasets.
    Affinity-Aware Graph Networks. (arXiv:2206.11941v1 [cs.LG])
    Graph Neural Networks (GNNs) have emerged as a powerful technique for learning on relational data. Owing to the relatively limited number of message passing steps they perform -- and hence a smaller receptive field -- there has been significant interest in improving their expressivity by incorporating structural aspects of the underlying graph. In this paper, we explore the use of affinity measures as features in graph neural networks, in particular measures arising from random walks, including effective resistance, hitting and commute times. We propose message passing networks based on these features and evaluate their performance on a variety of node and graph property prediction tasks. Our architecture has lower computational complexity, while our features are invariant to the permutations of the underlying graph. The measures we compute allow the network to exploit the connectivity properties of the graph, thereby allowing us to outperform relevant benchmarks for a wide variety of tasks, often with significantly fewer message passing steps. On one of the largest publicly available graph regression datasets, OGB-LSC-PCQM4Mv1, we obtain the best known single-model validation MAE at the time of writing.
    Multi-Agent Deep Reinforcement Learning for Cost- and Delay-Sensitive Virtual Network Function Placement and Routing. (arXiv:2206.12146v1 [cs.AI])
    This paper proposes an effective and novel multiagent deep reinforcement learning (MADRL)-based method for solving the joint virtual network function (VNF) placement and routing (P&R), where multiple service requests with differentiated demands are delivered at the same time. The differentiated demands of the service requests are reflected by their delay- and cost-sensitive factors. We first construct a VNF P&R problem to jointly minimize a weighted sum of service delay and resource consumption cost, which is NP-complete. Then, the joint VNF P&R problem is decoupled into two iterative subtasks: placement subtask and routing subtask. Each subtask consists of multiple concurrent parallel sequential decision processes. By invoking the deep deterministic policy gradient method and multi-agent technique, an MADRL-P&R framework is designed to perform the two subtasks. The new joint reward and internal rewards mechanism is proposed to match the goals and constraints of the placement and routing subtasks. We also propose the parameter migration-based model-retraining method to deal with changing network topologies. Corroborated by experiments, the proposed MADRL-P&R framework is superior to its alternatives in terms of service cost and delay, and offers higher flexibility for personalized service demands. The parameter migration-based model-retraining method can efficiently accelerate convergence under moderate network topology changes.
    Learning sparse features can lead to overfitting in neural networks. (arXiv:2206.12314v1 [stat.ML])
    It is widely believed that the success of deep networks lies in their ability to learn a meaningful representation of the features of the data. Yet, understanding when and how this feature learning improves performance remains a challenge: for example, it is beneficial for modern architectures trained to classify images, whereas it is detrimental for fully-connected networks trained for the same task on the same data. Here we propose an explanation for this puzzle, by showing that feature learning can perform worse than lazy training (via random feature kernel or the NTK) as the former can lead to a sparser neural representation. Although sparsity is known to be essential for learning anisotropic data, it is detrimental when the target function is constant or smooth along certain directions of input space. We illustrate this phenomenon in two settings: (i) regression of Gaussian random functions on the d-dimensional unit sphere and (ii) classification of benchmark datasets of images. For (i), we compute the scaling of the generalization error with number of training points, and show that methods that do not learn features generalize better, even when the dimension of the input space is large. For (ii), we show empirically that learning features can indeed lead to sparse and thereby less smooth representations of the image predictors. This fact is plausibly responsible for deteriorating the performance, which is known to be correlated with smoothness along diffeomorphisms.
    Mitigating Neural Network Overconfidence with Logit Normalization. (arXiv:2205.09310v2 [cs.LG] UPDATED)
    Detecting out-of-distribution inputs is critical for safe deployment of machine learning models in the real world. However, neural networks are known to suffer from the overconfidence issue, where they produce abnormally high confidence for both in- and out-of-distribution inputs. In this work, we show that this issue can be mitigated through Logit Normalization (LogitNorm) -- a simple fix to the cross-entropy loss -- by enforcing a constant vector norm on the logits in training. Our method is motivated by the analysis that the norm of the logit keeps increasing during training, leading to overconfident output. Our key idea behind LogitNorm is thus to decouple the influence of output's norm during network optimization. Trained with LogitNorm, neural networks produce highly distinguishable confidence scores between in- and out-of-distribution data. Extensive experiments demonstrate the superiority of LogitNorm, reducing the average FPR95 by up to 42.30% on common benchmarks.
    Improved-Mask R-CNN: Towards an Accurate Generic MSK MRI instance segmentation platform (Data from the Osteoarthritis Initiative). (arXiv:2107.12889v2 [eess.IV] UPDATED)
    Objective assessment of Magnetic Resonance Imaging (MRI) scans of osteoarthritis (OA) can address the limitation of the current OA assessment. Segmentation of bone, cartilage, and joint fluid is necessary for the OA objective assessment. Most of the proposed segmentation methods are not performing instance segmentation and suffer from class imbalance problems. This study deployed Mask R-CNN instance segmentation and improved it (improved-Mask R-CNN (iMaskRCNN)) to obtain a more accurate generalized segmentation for OA-associated tissues. Training and validation of the method were performed using 500 MRI knees from the Osteoarthritis Initiative (OAI) dataset and 97 MRI scans of patients with symptomatic hip OA. Three modifications to Mask R-CNN yielded the iMaskRCNN: adding a 2nd ROIAligned block, adding an extra decoder layer to the mask-header, and connecting them by a skip connection. The results were assessed using Hausdorff distance, dice score, and coefficients of variation (CoV). The iMaskRCNN led to improved bone and cartilage segmentation compared to Mask RCNN as indicated with the increase in dice score from 95% to 98% for the femur, 95% to 97% for tibia, 71% to 80% for femoral cartilage, and 81% to 82% for tibial cartilage. For the effusion detection, dice improved with iMaskRCNN 72% versus MaskRCNN 71%. The CoV values for effusion detection between Reader1 and Mask R-CNN (0.33), Reader1 and iMaskRCNN (0.34), Reader2 and Mask R-CNN (0.22), Reader2 and iMaskRCNN (0.29) are close to CoV between two readers (0.21), indicating a high agreement between the human readers and both Mask R-CNN and iMaskRCNN. Mask R-CNN and iMaskRCNN can reliably and simultaneously extract different scale articular tissues involved in OA, forming the foundation for automated assessment of OA. The iMaskRCNN results show that the modification improved the network performance around the edges.
    Leverage Score Sampling for Tensor Product Matrices in Input Sparsity Time. (arXiv:2202.04515v2 [cs.LG] UPDATED)
    We propose an input sparsity time sampling algorithm that can spectrally approximate the Gram matrix corresponding to the $q$-fold column-wise tensor product of $q$ matrices using a nearly optimal number of samples, improving upon all previously known methods by poly$(q)$ factors. Furthermore, for the important special case of the $q$-fold self-tensoring of a dataset, which is the feature matrix of the degree-$q$ polynomial kernel, the leading term of our method's runtime is proportional to the size of the input dataset and has no dependence on $q$. Previous techniques either incur poly$(q)$ slowdowns in their runtime or remove the dependence on $q$ at the expense of having sub-optimal target dimension, and depend quadratically on the number of data-points in their runtime. Our sampling technique relies on a collection of $q$ partially correlated random projections which can be simultaneously applied to a dataset $X$ in total time that only depends on the size of $X$, and at the same time their $q$-fold Kronecker product acts as a near-isometry for any fixed vector in the column span of $X^{\otimes q}$. We also show that our sampling methods generalize to other classes of kernels beyond polynomial, such as Gaussian and Neural Tangent kernels.
    Geometric Policy Iteration for Markov Decision Processes. (arXiv:2206.05809v2 [cs.LG] UPDATED)
    Recently discovered polyhedral structures of the value function for finite state-action discounted Markov decision processes (MDP) shed light on understanding the success of reinforcement learning. We investigate the value function polytope in greater detail and characterize the polytope boundary using a hyperplane arrangement. We further show that the value space is a union of finitely many cells of the same hyperplane arrangement and relate it to the polytope of the classical linear programming formulation for MDPs. Inspired by these geometric properties, we propose a new algorithm, Geometric Policy Iteration (GPI), to solve discounted MDPs. GPI updates the policy of a single state by switching to an action that is mapped to the boundary of the value function polytope, followed by an immediate update of the value function. This new update rule aims at a faster value improvement without compromising computational efficiency. Moreover, our algorithm allows asynchronous updates of state values which is more flexible and advantageous compared to traditional policy iteration when the state set is large. We prove that the complexity of GPI achieves the best known bound $\mathcal{O}\left(\frac{|\mathcal{A}|}{1 - \gamma}\log \frac{1}{1-\gamma}\right)$ of policy iteration and empirically demonstrate the strength of GPI on MDPs of various sizes.
    Towards Representative Subset Selection for Self-Supervised Speech Recognition. (arXiv:2203.09829v2 [cs.LG] UPDATED)
    Self-supervised speech recognition models require considerable labeled training data for learning high-fidelity representations for Automatic Speech Recognition (ASR) which is computationally demanding and time-consuming, thereby hindering the usage of these models in resource-constrained environments. We consider the task of identifying an optimal subset of data to train self-supervised speech models for ASR. We make a surprising observation that the dataset pruning strategies used in vision tasks for sampling the most informative examples do not perform better than random subset selection on the task of fine-tuning self-supervised ASR. We then present the COWERAGE algorithm for better subset selection in self-supervised ASR, which is based on our finding that ensuring the coverage of examples based on training Word Error Rate (WER) in the early training epochs leads to better generalization performance. Extensive experiments on the wav2vec 2.0 model and TIMIT, Librispeech, and LJSpeech datasets show the effectiveness of COWERAGE, with up to 17% absolute WER improvement over existing dataset pruning methods and random sampling. We also demonstrate that the coverage of training instances in terms of WER ensures inclusion of phonemically diverse examples which leads to better test accuracy in self-supervised speech recognition models.
    Property Unlearning: A Defense Strategy Against Property Inference Attacks. (arXiv:2205.08821v2 [cs.CR] UPDATED)
    During the training of machine learning models, they may store or "learn" more information about the training data than what is actually needed for the prediction or classification task. This is exploited by property inference attacks which aim at extracting statistical properties from the training data of a given model without having access to the training data itself. These properties may include the quality of pictures to identify the camera model, the age distribution to reveal the target audience of a product, or the included host types to refine a malware attack in computer networks. This attack is especially accurate when the attacker has access to all model parameters, i.e., in a white-box scenario. By defending against such attacks, model owners are able to ensure that their training data, associated properties, and thus their intellectual property stays private, even if they deliberately share their models, e.g., to train collaboratively, or if models are leaked. In this paper, we introduce property unlearning, an effective defense mechanism against white-box property inference attacks, independent of the training data type, model task, or number of properties. Property unlearning mitigates property inference attacks by systematically changing the trained weights and biases of a target model such that an adversary cannot extract chosen properties. We empirically evaluate property unlearning on three different data sets, including tabular and image data, and two types of artificial neural networks. Our results show that property unlearning is both efficient and reliable to protect machine learning models against property inference attacks, with a good privacy-utility trade-off. Furthermore, our approach indicates that this mechanism is also effective to unlearn multiple properties.
    Deep learning algorithms for solving high dimensional nonlinear backward stochastic differential equations. (arXiv:2010.01319v3 [math.NA] UPDATED)
    In this work, we propose a new deep learning-based scheme for solving high dimensional nonlinear backward stochastic differential equations (BSDEs). The idea is to reformulate the problem as a global optimization, where the local loss functions are included. Essentially, we approximate the unknown solution of a BSDE using a deep neural network and its gradient with automatic differentiation. The approximations are performed by globally minimizing the quadratic local loss function defined at each time step, which always includes the terminal condition. This kind of loss functions are obtained by iterating the Euler discretization of the time integrals with the terminal condition. Our formulation can prompt the stochastic gradient descent algorithm not only to take the accuracy at each time layer into account, but also converge to a good local minima. In order to demonstrate performances of our algorithm, several high-dimensional nonlinear BSDEs including pricing problems in finance are provided.
    Correlation Clustering via Strong Triadic Closure Labeling: Fast Approximation Algorithms and Practical Lower Bounds. (arXiv:2111.10699v2 [cs.DS] UPDATED)
    Correlation clustering is a widely studied framework for clustering based on pairwise similarity and dissimilarity scores, but its best approximation algorithms rely on impractical linear programming relaxations. We present faster approximation algorithms that avoid these relaxations, for two well-studied special cases: cluster editing and cluster deletion. We accomplish this by drawing new connections to edge labeling problems related to the principle of strong triadic closure. This leads to faster and more practical linear programming algorithms, as well as extremely scalable combinatorial techniques, including the first combinatorial approximation algorithm for cluster deletion. In practice, our algorithms produce approximate solutions that nearly match the best algorithms in quality, while scaling to problems that are orders of magnitude larger.
    Deep Reinforcement Learning Guided Graph Neural Networks for Brain Network Analysis. (arXiv:2203.10093v3 [cs.LG] UPDATED)
    Modern neuroimaging techniques, such as diffusion tensor imaging (DTI) and functional magnetic resonance imaging (fMRI), enable us to model the human brain as a brain network or connectome. Capturing brain networks' structural information and hierarchical patterns is essential for understanding brain functions and disease states. Recently, the promising network representation learning capability of graph neural networks (GNNs) has prompted many GNN-based methods for brain network analysis to be proposed. Specifically, these methods apply feature aggregation and global pooling to convert brain network instances into meaningful low-dimensional representations used for downstream brain network analysis tasks. However, existing GNN-based methods often neglect that brain networks of different subjects may require various aggregation iterations and use GNN with a fixed number of layers to learn all brain networks. Therefore, how to fully release the potential of GNNs to promote brain network analysis is still non-trivial. To solve this problem, we propose a novel brain network representation framework, namely BN-GNN, which searches for the optimal GNN architecture for each brain network. Concretely, BN-GNN employs deep reinforcement learning (DRL) to train a meta-policy to automatically determine the optimal number of feature aggregations (reflected in the number of GNN layers) required for a given brain network. Extensive experiments on eight real-world brain network datasets demonstrate that our proposed BN-GNN improves the performance of traditional GNNs on different brain network analysis tasks.
    Zero-shot Transfer Learning on Heterogeneous Graphs via Knowledge Transfer Networks. (arXiv:2203.02018v3 [cs.LG] UPDATED)
    Data continuously emitted from industrial ecosystems such as social or commerce platforms are commonly represented as heterogeneous graphs (HG) composed of multiple node/edge types. State-of-the-art graph learning methods for HGs known as heterogeneous graph neural networks (HGNNs) are applied to learn deep context-informed node representations. However, many HG datasets from industrial applications suffer from label imbalance between node types. As there is no direct way to learn using labels rooted at different node types, HGNNs have been applied to only a few node types with abundant labels. We propose a zero-shot transfer learning module for HGNNs called a Knowledge Transfer Network (KTN) that transfers knowledge from label-abundant node types to zero-labeled node types through rich relational information given in the HG. KTN is derived from the theoretical relationship, which we introduce in this work, between distinct feature extractors for each node type given in an HGNN model. KTN improves the performance of 6 different types of HGNN models by up to 960% for inference on zero-labeled node types and outperforms state-of-the-art transfer learning baselines by up to 73% across 18 different transfer learning tasks on HGs.
    A Mixed-Integer Programming Approach to Training Dense Neural Networks. (arXiv:2201.00723v2 [cs.LG] UPDATED)
    Artificial Neural Networks (ANNs) are prevalent machine learning models that are applied across various real-world classification tasks. However, training ANNs is time-consuming and the resulting models take a lot of memory to deploy. In order to train more parsimonious ANNs, we propose a novel mixed-integer programming (MIP) formulation for training fully-connected ANNs. Our formulations can account for both binary and rectified linear unit (ReLU) activations, and for the use of a log-likelihood loss. We present numerical experiments comparing our MIP-based methods against existing approaches and show that we are able to achieve competitive out-of-sample performance with more parsimonious models.
    Virtual Homogeneity Learning: Defending against Data Heterogeneity in Federated Learning. (arXiv:2206.02465v2 [cs.LG] UPDATED)
    In federated learning (FL), model performance typically suffers from client drift induced by data heterogeneity, and mainstream works focus on correcting client drift. We propose a different approach named virtual homogeneity learning (VHL) to directly "rectify" the data heterogeneity. In particular, VHL conducts FL with a virtual homogeneous dataset crafted to satisfy two conditions: containing no private information and being separable. The virtual dataset can be generated from pure noise shared across clients, aiming to calibrate the features from the heterogeneous clients. Theoretically, we prove that VHL can achieve provable generalization performance on the natural distribution. Empirically, we demonstrate that VHL endows FL with drastically improved convergence speed and generalization performance. VHL is the first attempt towards using a virtual dataset to address data heterogeneity, offering new and effective means to FL.
    Regret Bounds for Noise-Free Kernel-Based Bandits. (arXiv:2002.05096v2 [stat.ML] UPDATED)
    Kernel-based bandit is an extensively studied black-box optimization problem, in which the objective function is assumed to live in a known reproducing kernel Hilbert space. While nearly optimal regret bounds (up to logarithmic factors) are established in the noisy setting, surprisingly, less is known about the noise-free setting (when the exact values of the underlying function is accessible without observation noise). We discuss several upper bounds on regret; none of which seem order optimal, and provide a conjecture on the order optimal regret bound.
    Efficient End-to-End AutoML via Scalable Search Space Decomposition. (arXiv:2206.09423v2 [cs.LG] UPDATED)
    End-to-end AutoML has attracted intensive interests from both academia and industry which automatically searches for ML pipelines in a space induced by feature engineering, algorithm/model selection, and hyper-parameter tuning. Existing AutoML systems, however, suffer from scalability issues when applying to application domains with large, high-dimensional search spaces. We present VolcanoML, a scalable and extensible framework that facilitates systematic exploration of large AutoML search spaces. VolcanoML introduces and implements basic building blocks that decompose a large search space into smaller ones, and allows users to utilize these building blocks to compose an execution plan for the AutoML problem at hand. VolcanoML further supports a Volcano-style execution model -- akin to the one supported by modern database systems -- to execute the plan constructed. Our evaluation demonstrates that, not only does VolcanoML raise the level of expressiveness for search space decomposition in AutoML, it also leads to actual findings of decomposition strategies that are significantly more efficient than the ones employed by state-of-the-art AutoML systems such as auto-sklearn. This paper is the extended version of the initial VolcanoML paper appeared in VLDB 2021.
    Inductive Biases and Variable Creation in Self-Attention Mechanisms. (arXiv:2110.10090v2 [cs.LG] UPDATED)
    Self-attention, an architectural motif designed to model long-range interactions in sequential data, has driven numerous recent breakthroughs in natural language processing and beyond. This work provides a theoretical analysis of the inductive biases of self-attention modules. Our focus is to rigorously establish which functions and long-range dependencies self-attention blocks prefer to represent. Our main result shows that bounded-norm Transformer networks "create sparse variables": a single self-attention head can represent a sparse function of the input sequence, with sample complexity scaling only logarithmically with the context length. To support our analysis, we present synthetic experiments to probe the sample complexity of learning sparse Boolean functions with Transformers.
    Generalizing to New Physical Systems via Context-Informed Dynamics Model. (arXiv:2202.01889v3 [cs.LG] UPDATED)
    Data-driven approaches to modeling physical systems fail to generalize to unseen systems that share the same general dynamics with the learning domain, but correspond to different physical contexts. We propose a new framework for this key problem, context-informed dynamics adaptation (CoDA), which takes into account the distributional shift across systems for fast and efficient adaptation to new dynamics. CoDA leverages multiple environments, each associated to a different dynamic, and learns to condition the dynamics model on contextual parameters, specific to each environment. The conditioning is performed via a hypernetwork, learned jointly with a context vector from observed data. The proposed formulation constrains the search hypothesis space to foster fast adaptation and better generalization across environments. We theoretically motivate our approach and show state-of-the-art generalization results on a set of nonlinear dynamics, representative of a variety of application domains. We also show, on these systems, that new system parameters can be inferred from context vectors with minimal supervision. Code is available at https://github.com/yuan-yin/CoDA .
    Deep Reinforcement Learning for Optimal Power Flow with Renewables Using Graph Information. (arXiv:2112.11461v2 [cs.LG] UPDATED)
    Renewable energy resources (RERs) have been increasingly integrated into large-scale distributed power systems. Considering uncertainties and voltage fluctuation issues introduced by RERs, in this paper, we propose a deep reinforcement learning (DRL)-based strategy leveraging spatial-temporal (ST) graphical information of power systems, to dynamically search for the optimal operation, i.e., optimal power flow (OPF), of power systems with a high uptake of RERs. Specifically, we formulate the OPF problem as a multi-objective optimization problem considering generation cost, voltage fluctuation, and transmission loss, and employ deep deterministic policy gradient (DDPG) to learn an optimal allocation strategy for OPF. Moreover, given that the nodes in power systems are self-correlated and interrelated in temporal and spatial views, we develop a multi-grained attention-based spatial-temporal graph convolution network (MG-ASTGCN) for extracting ST graphical correlations and features, aiming to provide prior knowledge of power systems for its sequential DDPG algorithm to more effectively solve OPF. We validate our algorithm on modified IEEE 33, 69, and 118-bus radial distribution systems and demonstrate that our algorithm outperforms other benchmark algorithms. Our experimental results also reveal that our MG-ASTGCN can significantly accelerate DDPG's training process and performance in solving OPF.
    Turning Your Strength against You: Detecting and Mitigating Robust and Universal Adversarial Patch Attacks. (arXiv:2108.05075v3 [cs.CR] UPDATED)
    Adversarial patch attacks that inject arbitrary distortions within a bounded region of an image, can trigger misclassification in deep neural networks (DNNs). These attacks are robust (i.e., physically realizable) and universally malicious, and hence represent a severe security threat to real-world DNN-based systems. This work proposes Jujutsu, a two-stage technique to detect and mitigate robust and universal adversarial patch attacks. We first observe that patch attacks often yield large influence on the prediction output in order to dominate the prediction on any input, and Jujutsu is built to expose this behavior for effective attack detection. For mitigation, we observe that patch attacks corrupt only a localized region while the remaining contents are unperturbed, based on which Jujutsu leverages GAN-based image inpainting to synthesize the semantic contents in the pixels that are corrupted by the attacks, and reconstruct the ``clean'' image for correct prediction. We evaluate Jujutsu on four diverse datasets and show that it achieves superior performance and significantly outperforms four leading defenses. Jujutsu can further defend against physical-world attacks, attacks that target diverse classes, and adaptive attacks. Our code is available at https://github.com/DependableSystemsLab/Jujutsu.
    Bugs in Machine Learning-based Systems: A Faultload Benchmark. (arXiv:2206.12311v1 [cs.SE])
    The rapid escalation of applying Machine Learning (ML) in various domains has led to paying more attention to the quality of ML components. There is then a growth of techniques and tools aiming at improving the quality of ML components and integrating them into the ML-based system safely. Although most of these tools use bugs' lifecycle, there is no standard benchmark of bugs to assess their performance, compare them and discuss their advantages and weaknesses. In this study, we firstly investigate the reproducibility and verifiability of the bugs in ML-based systems and show the most important factors in each one. Then, we explore the challenges of generating a benchmark of bugs in ML-based software systems and provide a bug benchmark namely defect4ML that satisfies all criteria of standard benchmark, i.e. relevance, reproducibility, fairness, verifiability, and usability. This faultload benchmark contains 113 bugs reported by ML developers on GitHub and Stack Overflow, using two of the most popular ML frameworks: TensorFlow and Keras. defect4ML also addresses important challenges in Software Reliability Engineering of ML-based software systems, like: 1) fast changes in frameworks, by providing various bugs for different versions of frameworks, 2) code portability, by delivering similar bugs in different ML frameworks, 3) bug reproducibility, by providing fully reproducible bugs with complete information about required dependencies and data, and 4) lack of detailed information on bugs, by presenting links to the bugs' origins. defect4ML can be of interest to ML-based systems practitioners and researchers to assess their testing tools and techniques.
    GNNSampler: Bridging the Gap between Sampling Algorithms of GNN and Hardware. (arXiv:2108.11571v2 [cs.LG] UPDATED)
    Sampling is a critical operation in Graph Neural Network (GNN) training that helps reduce the cost. Previous literature has explored improving sampling algorithms via mathematical and statistical methods. However, there is a gap between sampling algorithms and hardware. Without consideration of hardware, algorithm designers merely optimize sampling at the algorithm level, missing the great potential of promoting the efficiency of existing sampling algorithms by leveraging hardware features. In this paper, we pioneer to propose a unified programming model for mainstream sampling algorithms, termed GNNSampler, covering the critical processes of sampling algorithms in various categories. Second, to leverage the hardware feature, we choose the data locality as a case study, and explore the data locality among nodes and their neighbors in a graph to alleviate irregular memory access in sampling. Third, we implement locality-aware optimizations in GNNSampler for various sampling algorithms to optimize the general sampling process. Finally, we emphatically conduct experiments on large graph datasets to analyze the relevance among training time, accuracy, and hardware-level metrics. Extensive experiments show that our method is universal to mainstream sampling algorithms and helps significantly reduce the training time, especially in large-scale graphs.
    Cluster Attack: Query-based Adversarial Attacks on Graphs with Graph-Dependent Priors. (arXiv:2109.13069v2 [cs.LG] UPDATED)
    While deep neural networks have achieved great success in graph analysis, recent work has shown that they are vulnerable to adversarial attacks. Compared with adversarial attacks on image classification, performing adversarial attacks on graphs is more challenging because of the discrete and non-differential nature of the adjacent matrix for a graph. In this work, we propose Cluster Attack -- a Graph Injection Attack (GIA) on node classification, which injects fake nodes into the original graph to degenerate the performance of graph neural networks (GNNs) on certain victim nodes while affecting the other nodes as little as possible. We demonstrate that a GIA problem can be equivalently formulated as a graph clustering problem; thus, the discrete optimization problem of the adjacency matrix can be solved in the context of graph clustering. In particular, we propose to measure the similarity between victim nodes by a metric of Adversarial Vulnerability, which is related to how the victim nodes will be affected by the injected fake node, and to cluster the victim nodes accordingly. Our attack is performed in a practical and unnoticeable query-based black-box manner with only a few nodes on the graphs that can be accessed. Theoretical analysis and extensive experiments demonstrate the effectiveness of our method by fooling the node classifiers with only a small number of queries.
    Learning to Predict Graphs with Fused Gromov-Wasserstein Barycenters. (arXiv:2202.03813v3 [stat.ML] UPDATED)
    This paper introduces a novel and generic framework to solve the flagship task of supervised labeled graph prediction by leveraging Optimal Transport tools. We formulate the problem as regression with the Fused Gromov-Wasserstein (FGW) loss and propose a predictive model relying on a FGW barycenter whose weights depend on inputs. First we introduce a non-parametric estimator based on kernel ridge regression for which theoretical results such as consistency and excess risk bound are proved. Next we propose an interpretable parametric model where the barycenter weights are modeled with a neural network and the graphs on which the FGW barycenter is calculated are additionally learned. Numerical experiments show the strength of the method and its ability to interpolate in the labeled graph space on simulated data and on a difficult metabolic identification problem where it can reach very good performance with very little engineering.
    Out of distribution robustness with pre-trained Bayesian neural networks. (arXiv:2206.12361v1 [cs.LG])
    We develop ShiftMatch, a new training-data-dependent likelihood for out of distribution (OOD) robustness in Bayesian neural networks (BNNs). ShiftMatch is inspired by the training-data-dependent "EmpCov" priors from Izmailov et al. (2021a) and efficiently matches test-time spatial correlations to those at training time. Critically, ShiftMatch is designed to leave neural network training unchanged, allowing it to use publically available samples from pretrained BNNs. Using pre-trained HMC samples, ShiftMatch gives strong performance improvements on CIFAR-10-C, outperforms EmpCov priors, and is perhaps the first Bayesian method capable of convincingly outperforming plain deep ensembles. ShiftMatch can be integrated with non-Bayesian methods like deep ensembles, where it offers smaller, but still considerable, performance improvements. Overall, Bayesian ShiftMatch gave slightly better accuracy than ensembles with ShiftMatch, though they both had very similar log-likelihoods.
    Adversarially Robust Models may not Transfer Better: Sufficient Conditions for Domain Transferability from the View of Regularization. (arXiv:2202.01832v2 [cs.LG] UPDATED)
    Machine learning (ML) robustness and domain generalization are fundamentally correlated: they essentially concern data distribution shifts under adversarial and natural settings, respectively. On one hand, recent studies show that more robust (adversarially trained) models are more generalizable. On the other hand, there is a lack of theoretical understanding of their fundamental connections. In this paper, we explore the relationship between regularization and domain transferability considering different factors such as norm regularization and data augmentations (DA). We propose a general theoretical framework proving that factors involving the model function class regularization are sufficient conditions for relative domain transferability. Our analysis implies that ``robustness" is neither necessary nor sufficient for transferability; rather, regularization is a more fundamental perspective for understanding domain transferability. We then discuss popular DA protocols (including adversarial training) and show when they can be viewed as the function class regularization under certain conditions and therefore improve generalization. We conduct extensive experiments to verify our theoretical findings and show several counterexamples where robustness and generalization are negatively correlated on different datasets.
    Channel Estimation for RIS-Empowered Multi-User MISO Wireless Communications. (arXiv:2008.01459v2 [cs.IT] UPDATED)
    Reconfigurable Intelligent Surfaces (RISs) have been recently considered as an energy-efficient solution for future wireless networks due to their fast and low-power configuration, which has increased potential in enabling massive connectivity and low-latency communications. Accurate and low-overhead channel estimation in RIS-based systems is one of the most critical challenges due to the usually large number of RIS unit elements and their distinctive hardware constraints. In this paper, we focus on the uplink of a RIS-empowered multi-user Multiple Input Single Output (MISO) uplink communication systems and propose a channel estimation framework based on the parallel factor decomposition to unfold the resulting cascaded channel model. We present two iterative estimation algorithms for the channels between the base station and RIS, as well as the channels between RIS and users. One is based on alternating least squares (ALS), while the other uses vector approximate message passing to iteratively reconstruct two unknown channels from the estimated vectors. To theoretically assess the performance of the ALS-based algorithm, we derived its estimation Cram\'er-Rao Bound (CRB). We also discuss the downlink achievable sum rate computation with estimated channels and different precoding schemes for the base station. Our extensive simulation results show that our algorithms outperform benchmark schemes and that the ALS technique achieves the CRB. It is also demonstrated that the sum rate using the estimated channels always reach that of perfect channels under various settings, thus, verifying the effectiveness and robustness of the proposed estimation algorithms.
    Source Localization of Graph Diffusion via Variational Autoencoders for Graph Inverse Problems. (arXiv:2206.12327v1 [cs.LG])
    Graph diffusion problems such as the propagation of rumors, computer viruses, or smart grid failures are ubiquitous and societal. Hence it is usually crucial to identify diffusion sources according to the current graph diffusion observations. Despite its tremendous necessity and significance in practice, source localization, as the inverse problem of graph diffusion, is extremely challenging as it is ill-posed: different sources may lead to the same graph diffusion patterns. Different from most traditional source localization methods, this paper focuses on a probabilistic manner to account for the uncertainty of different candidate sources. Such endeavors require overcoming challenges including 1) the uncertainty in graph diffusion source localization is hard to be quantified; 2) the complex patterns of the graph diffusion sources are difficult to be probabilistically characterized; 3) the generalization under any underlying diffusion patterns is hard to be imposed. To solve the above challenges, this paper presents a generic framework: Source Localization Variational AutoEncoder (SL-VAE) for locating the diffusion sources under arbitrary diffusion patterns. Particularly, we propose a probabilistic model that leverages the forward diffusion estimation model along with deep generative models to approximate the diffusion source distribution for quantifying the uncertainty. SL-VAE further utilizes prior knowledge of the source-observation pairs to characterize the complex patterns of diffusion sources by a learned generative prior. Lastly, a unified objective that integrates the forward diffusion estimation model is derived to enforce the model to generalize under arbitrary diffusion patterns. Extensive experiments are conducted on 7 real-world datasets to demonstrate the superiority of SL-VAE in reconstructing the diffusion sources by excelling other methods on average 20% in AUC score.
    Simplified and Unified Analysis of Various Learning Problems by Reduction to Multiple-Instance Learning. (arXiv:1911.05999v4 [cs.LG] UPDATED)
    In statistical learning, many problem formulations have been proposed so far, such as multi-class learning, complementarily labeled learning, multi-label learning, multi-task learning, which provide theoretical models for various real-world tasks. Although they have been extensively studied, the relationship among them has not been fully investigated. In this work, we focus on a particular problem formulation called Multiple-Instance Learning (MIL), and show that various learning problems including all the problems mentioned above with some of new problems can be reduced to MIL with theoretically guaranteed generalization bounds, where the reductions are established under a new reduction scheme we provide as a by-product. The results imply that the MIL-reduction gives a simplified and unified framework for designing and analyzing algorithms for various learning problems. Moreover, we show that the MIL-reduction framework can be kernelized.
    HANF: Hyperparameter And Neural Architecture Search in Federated Learning. (arXiv:2206.12342v1 [cs.LG])
    Automated machine learning (AutoML) is an important step to make machine learning models being widely applied to solve real world problems. Despite numerous research advancement, machine learning methods are not fully utilized by industries mainly due to their data privacy and security regulations, high cost involved in storing and computing increasing amount of data at central location and most importantly lack of expertise. Hence, we introduce a novel framework, HANF - $\textbf{H}$yperparameter $\textbf{A}$nd $\textbf{N}$eural architecture search in $\textbf{F}$ederated learning as a step towards building an AutoML framework for data distributed across several data owner servers without any need for bringing the data to a central location. HANF jointly optimizes a neural architecture and non-architectural hyperparameters of a learning algorithm using gradient-based neural architecture search and $n$-armed bandit approach respectively in data distributed setting. We show that HANF efficiently finds the optimized neural architecture and also tunes the hyperparameters on data owner servers. Additionally, HANF can be applied in both, federated and non-federated settings. Empirically, we show that HANF converges towards well-suited architectures and non-architectural hyperparameter-sets using image-classification tasks.
    A Framework of Inertial Alternating Direction Method of Multipliers for Non-Convex Non-Smooth Optimization. (arXiv:2102.05433v2 [math.OC] UPDATED)
    In this paper, we propose an algorithmic framework, dubbed inertial alternating direction methods of multipliers (iADMM), for solving a class of nonconvex nonsmooth multiblock composite optimization problems with linear constraints. Our framework employs the general minimization-majorization (MM) principle to update each block of variables so as to not only unify the convergence analysis of previous ADMM that use specific surrogate functions in the MM step, but also lead to new efficient ADMM schemes. To the best of our knowledge, in the nonconvex nonsmooth setting, ADMM used in combination with the MM principle to update each block of variables, and ADMM combined with \emph{inertial terms for the primal variables} have not been studied in the literature. Under standard assumptions, we prove the subsequential convergence and global convergence for the generated sequence of iterates. We illustrate the effectiveness of iADMM on a class of nonconvex low-rank representation problems.
    ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings. (arXiv:2206.12403v1 [cs.CV])
    We present a scalable approach for learning open-world object-goal navigation (ObjectNav) -- the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment (e.g., "find a sink"). Our approach is entirely zero-shot -- i.e., it does not require ObjectNav rewards or demonstrations of any kind. Instead, we train on the image-goal navigation (ImageNav) task, in which agents find the location where a picture (i.e., goal image) was captured. Specifically, we encode goal images into a multimodal, semantic embedding space to enable training semantic-goal navigation (SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D). After training, SemanticNav agents can be instructed to find objects described in free-form natural language (e.g., "sink", "bathroom sink", etc.) by projecting language goals into the same multimodal, semantic embedding space. As a result, our approach enables open-world ObjectNav. We extensively evaluate our agents on three ObjectNav datasets (Gibson, HM3D, and MP3D) and observe absolute improvements in success of 4.2% - 20.0% over existing zero-shot methods. For reference, these gains are similar or better than the 5% improvement in success between the Habitat 2020 and 2021 ObjectNav challenge winners. In an open-world setting, we discover that our agents can generalize to compound instructions with a room explicitly mentioned (e.g., "Find a kitchen sink") and when the target room can be inferred (e.g., "Find a sink and a stove").
    Hard hat wearing detection based on head keypoint localization. (arXiv:2106.10944v2 [cs.CV] UPDATED)
    In recent years, a lot of attention is paid to deep learning methods in the context of vision-based construction site safety systems, especially regarding personal protective equipment. However, despite all this attention, there is still no reliable way to establish the relationship between workers and their hard hats. To answer this problem a combination of deep learning, object detection and head keypoint localization, with simple rule-based reasoning is proposed in this article. In tests, this solution surpassed the previous methods based on the relative bounding box position of different instances, as well as direct detection of hard hat wearers and non-wearers. The results show that the conjunction of novel deep learning methods with humanly-interpretable rule-based systems can result in a solution that is both reliable and can successfully mimic manual, on-site supervision. This work is the next step in the development of fully autonomous construction site safety systems and shows that there is still room for improvement in this area.
    Empirical and Instance-Dependent Estimation of Markov Chain and Mixing Time. (arXiv:1912.06845v3 [math.PR] UPDATED)
    We tackle the problem of estimating the mixing time of a Markov chain from a single trajectory of observations. In contrast with previous works which considered Hilbert space methods to estimate spectral gaps, we opt for an approach based on contraction with respect to total variation. Specifically, we define and estimate a generalized contraction coefficient based on Dobrushin's. We show that this quantity -- unlike the spectral gap -- controls the mixing time up to strong universal constants and remains valid for non-reversible chains. We design fully data-dependent confidence intervals around the coefficient, which are both easier to compute and thinner than their spectral counterparts. Furthermore, we initiate the beyond worst-case analysis, by showing how to leverage additional information about the transition matrix in order to obtain instance-dependent rates for its estimation with respect to the induced uniform norm, as well as some of its mixing properties.
    Socially-Compatible Behavior Design of Autonomous Vehicles with Verification on Real Human Data. (arXiv:2010.14712v8 [cs.RO] UPDATED)
    As more and more autonomous vehicles (AVs) are being deployed on public roads, designing socially compatible behaviors for them is becoming increasingly important. In order to generate safe and efficient actions, AVs need to not only predict the future behaviors of other traffic participants, but also be aware of the uncertainties associated with such behavior prediction. In this paper, we propose an uncertain-aware integrated prediction and planning (UAPP) framework. It allows the AVs to infer the characteristics of other road users online and generate behaviors optimizing not only their own rewards, but also their courtesy to others, and their confidence regarding the prediction uncertainties. We first propose the definitions for courtesy and confidence. Based on that, their influences on the behaviors of AVs in interactive driving scenarios are explored. Moreover, we evaluate the proposed algorithm on naturalistic human driving data by comparing the generated behavior against ground truth. Results show that the online inference can significantly improve the human-likeness of the generated behaviors. Furthermore, we find that human drivers show great courtesy to others, even for those without right-of-way. We also find that such driving preferences vary significantly in different cultures.
    How many labelers do you have? A closer look at gold-standard labels. (arXiv:2206.12041v1 [math.ST])
    The construction of most supervised learning datasets revolves around collecting multiple labels for each instance, then aggregating the labels to form a type of ``gold-standard.''. We question the wisdom of this pipeline by developing a (stylized) theoretical model of this process and analyzing its statistical consequences, showing how access to non-aggregated label information can make training well-calibrated models easier or -- in some cases -- even feasible, whereas it is impossible with only gold-standard labels. The entire story, however, is subtle, and the contrasts between aggregated and fuller label information depend on the particulars of the problem, where estimators that use aggregated information exhibit robust but slower rates of convergence, while estimators that can effectively leverage all labels converge more quickly if they have fidelity to (or can learn) the true labeling process. The theory we develop in the stylized model makes several predictions for real-world datasets, including when non-aggregate labels should improve learning performance, which we test to corroborate the validity of our predictions.
    Quantifying Inherent Randomness in Machine Learning Algorithms. (arXiv:2206.12353v1 [stat.ML])
    Most machine learning (ML) algorithms have several stochastic elements, and their performances are affected by these sources of randomness. This paper uses an empirical study to systematically examine the effects of two sources: randomness in model training and randomness in the partitioning of a dataset into training and test subsets. We quantify and compare the magnitude of the variation in predictive performance for the following ML algorithms: Random Forests (RFs), Gradient Boosting Machines (GBMs), and Feedforward Neural Networks (FFNNs). Among the different algorithms, randomness in model training causes larger variation for FFNNs compared to tree-based methods. This is to be expected as FFNNs have more stochastic elements that are part of their model initialization and training. We also found that random splitting of datasets leads to higher variation compared to the inherent randomness from model training. The variation from data splitting can be a major issue if the original dataset has considerable heterogeneity. Keywords: Model Training, Reproducibility, Variation
    Data-driven discovery of novel 2D materials by deep generative models. (arXiv:2206.12159v1 [cond-mat.mtrl-sci])
    Efficient algorithms to generate candidate crystal structures with good stability properties can play a key role in data-driven materials discovery. Here we show that a crystal diffusion variational autoencoder (CDVAE) is capable of generating two-dimensional (2D) materials of high chemical and structural diversity and formation energies mirroring the training structures. Specifically, we train the CDVAE on 2615 2D materials with energy above the convex hull $\Delta H_{\mathrm{hull}}< 0.3$ eV/atom, and generate 5003 materials that we relax using density functional theory (DFT). We also generate 14192 new crystals by systematic element substitution of the training structures. We find that the generative model and lattice decoration approach are complementary and yield materials with similar stability properties but very different crystal structures and chemical compositions. In total we find 11630 predicted new 2D materials, where 8599 of these have $\Delta H_{\mathrm{hull}}< 0.3$ eV/atom as the seed structures, while 2004 are within 50 meV of the convex hull and could potentially be synthesized. The relaxed atomic structures of all the materials are available in the open Computational 2D Materials Database (C2DB). Our work establishes the CDVAE as an efficient and reliable crystal generation machine, and significantly expands the space of 2D materials.
    Set Norm and Equivariant Skip Connections: Putting the Deep in Deep Sets. (arXiv:2206.11925v1 [cs.LG])
    Permutation invariant neural networks are a promising tool for making predictions from sets. However, we show that existing permutation invariant architectures, Deep Sets and Set Transformer, can suffer from vanishing or exploding gradients when they are deep. Additionally, layer norm, the normalization of choice in Set Transformer, can hurt performance by removing information useful for prediction. To address these issues, we introduce the clean path principle for equivariant residual connections and develop set norm, a normalization tailored for sets. With these, we build Deep Sets++ and Set Transformer++, models that reach high depths with comparable or better performance than their original counterparts on a diverse suite of tasks. We additionally introduce Flow-RBC, a new single-cell dataset and real-world application of permutation invariant prediction. We open-source our data and code here: https://github.com/rajesh-lab/deep_permutation_invariant.
    Physically Consistent Learning of Conservative Lagrangian Systems with Gaussian Processes. (arXiv:2206.12272v1 [cs.LG])
    This paper proposes a physically consistent Gaussian Process (GP) enabling the identification of uncertain Lagrangian systems. The function space is tailored according to the energy components of the Lagrangian and the differential equation structure, analytically guaranteeing physical and mathematical properties such as energy conservation and quadratic form. The novel formulation of Cholesky decomposed matrix kernels allow the probabilistic preservation of positive definiteness. Only differential input-to-output measurements of the function map are required while Gaussian noise is permitted in torques, velocities, and accelerations. We demonstrate the effectiveness of the approach in numerical simulation.
    Three Applications of Conformal Prediction for Rating Breast Density in Mammography. (arXiv:2206.12008v1 [eess.IV])
    Breast cancer is the most common cancers and early detection from mammography screening is crucial in improving patient outcomes. Assessing mammographic breast density is clinically important as the denser breasts have higher risk and are more likely to occlude tumors. Manual assessment by experts is both time-consuming and subject to inter-rater variability. As such, there has been increased interest in the development of deep learning methods for mammographic breast density assessment. Despite deep learning having demonstrated impressive performance in several prediction tasks for applications in mammography, clinical deployment of deep learning systems in still relatively rare; historically, mammography Computer-Aided Diagnoses (CAD) have over-promised and failed to deliver. This is in part due to the inability to intuitively quantify uncertainty of the algorithm for the clinician, which would greatly enhance usability. Conformal prediction is well suited to increase reliably and trust in deep learning tools but they lack realistic evaluations on medical datasets. In this paper, we present a detailed analysis of three possible applications of conformal prediction applied to medical imaging tasks: distribution shift characterization, prediction quality improvement, and subgroup fairness analysis. Our results show the potential of distribution-free uncertainty quantification techniques to enhance trust on AI algorithms and expedite their translation to usage.
    On the Limitations of Elo: Real-World Games, are Transitive, not Additive. (arXiv:2206.12301v1 [cs.GT])
    Real-world competitive games, such as chess, go, or StarCraft II, rely on Elo models to measure the strength of their players. Since these games are not fully transitive, using Elo implicitly assumes they have a strong transitive component that can correctly be identified and extracted. In this study, we investigate the challenge of identifying the strength of the transitive component in games. First, we show that Elo models can fail to extract this transitive component, even in elementary transitive games. Then, based on this observation, we propose an extension of the Elo score: we end up with a disc ranking system that assigns each player two scores, which we refer to as skill and consistency. Finally, we propose an empirical validation on payoff matrices coming from real-world games played by bots and humans.
    Deep Stable neural networks: large-width asymptotics and convergence rates. (arXiv:2108.02316v2 [cs.LG] UPDATED)
    In modern deep learning, there is a recent and growing literature on the interplay between large-width asymptotic properties of deep Gaussian neural networks (NNs), i.e. deep NNs with Gaussian-distributed weights, and Gaussian stochastic processes (SPs). Such an interplay has proved to be critical in Bayesian inference under Gaussian SP priors, kernel regression for infinitely wide deep NNs trained via gradient descent, and information propagation within infinitely wide NNs. Motivated by empirical analyses that show the potential of replacing Gaussian distributions with Stable distributions for the NN's weights, in this paper we present a rigorous analysis of the large-width asymptotic behaviour of (fully connected) feed-forward deep Stable NNs, i.e. deep NNs with Stable-distributed weights. We show that as the width goes to infinity jointly over the NN's layers, i.e. the ``joint growth" setting, a rescaled deep Stable NN converges weakly to a Stable SP whose distribution is characterized recursively through the NN's layers. Because of the non-triangular structure of the NN, this is a non-standard asymptotic problem, to which we propose an inductive approach of independent interest. Then, we establish sup-norm convergence rates of the rescaled deep Stable NN to the Stable SP, under the ``joint growth" and a ``sequential growth" of the width over the NN's layers. Such a result provides the difference between the ``joint growth" and the ``sequential growth" settings, showing that the former leads to a slower rate than the latter, depending on the depth of the layer and the number of inputs of the NN. Our work extends some recent results on infinitely wide limits for deep Gaussian NNs to the more general deep Stable NNs, providing the first result on convergence rates in the ``joint growth" setting.
    Indecision Trees: Learning Argument-Based Reasoning under Quantified Uncertainty. (arXiv:2206.12252v1 [cs.LG])
    Using Machine Learning systems in the real world can often be problematic, with inexplicable black-box models, the assumed certainty of imperfect measurements, or providing a single classification instead of a probability distribution. This paper introduces Indecision Trees, a modification to Decision Trees which learn under uncertainty, can perform inference under uncertainty, provide a robust distribution over the possible labels, and can be disassembled into a set of logical arguments for use in other reasoning systems.
    Content Popularity Prediction Based on Quantized Federated Bayesian Learning in Fog Radio Access Networks. (arXiv:2206.12258v1 [cs.LG])
    In this paper, we investigate the content popularity prediction problem in cache-enabled fog radio access networks (F-RANs). In order to predict the content popularity with high accuracy and low complexity, we propose a Gaussian process based regressor to model the content request pattern. Firstly, the relationship between content features and popularity is captured by our proposed model. Then, we utilize Bayesian learning to train the model parameters, which is robust to overfitting. However, Bayesian methods are usually unable to find a closed-form expression of the posterior distribution. To tackle this issue, we apply a stochastic variance reduced gradient Hamiltonian Monte Carlo (SVRG-HMC) method to approximate the posterior distribution. To utilize the computing resources of other fog access points (F-APs) and to reduce the communications overhead, we propose a quantized federated learning (FL) framework combining with Bayesian learning. The quantized federated Bayesian learning framework allows each F-AP to send gradients to the cloud server after quantizing and encoding. It can achieve a tradeoff between prediction accuracy and communications overhead effectively. Simulation results show that the performance of our proposed policy outperforms the existing policies.
    End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue. (arXiv:2206.12040v1 [eess.AS])
    The recent text-to-speech (TTS) has achieved quality comparable to that of humans; however, its application in spoken dialogue has not been widely studied. This study aims to realize a TTS that closely resembles human dialogue. First, we record and transcribe actual spontaneous dialogues. Then, the proposed dialogue TTS is trained in two stages: first stage, variational autoencoder (VAE)-VITS or Gaussian mixture variational autoencoder (GMVAE)-VITS is trained, which introduces an utterance-level latent variable into variational inference with adversarial learning for end-to-end text-to-speech (VITS), a recently proposed end-to-end TTS model. A style encoder that extracts a latent speaking style representation from speech is trained jointly with TTS. In the second stage, a style predictor is trained to predict the speaking style to be synthesized from dialogue history. During inference, by passing the speaking style representation predicted by the style predictor to VAE/GMVAE-VITS, speech can be synthesized in a style appropriate to the context of the dialogue. Subjective evaluation results demonstrate that the proposed method outperforms the original VITS in terms of dialogue-level naturalness.
    Symbolic-Regression Boosting. (arXiv:2206.12082v1 [cs.NE])
    Modifying standard gradient boosting by replacing the embedded weak learner in favor of a strong(er) one, we present SyRBo: Symbolic-Regression Boosting. Experiments over 98 regression datasets show that by adding a small number of boosting stages -- between 2--5 -- to a symbolic regressor, statistically significant improvements can often be attained. We note that coding SyRBo on top of any symbolic regressor is straightforward, and the added cost is simply a few more evolutionary rounds. SyRBo is essentially a simple add-on that can be readily added to an extant symbolic regressor, often with beneficial results.
    Iterative Sound Source Localization for Unknown Number of Sources. (arXiv:2206.12273v1 [eess.AS])
    Sound source localization aims to seek the direction of arrival (DOA) of all sound sources from the observed multi-channel audio. For the practical problem of unknown number of sources, existing localization algorithms attempt to predict a likelihood-based coding (i.e., spatial spectrum) and employ a pre-determined threshold to detect the source number and corresponding DOA value. However, these threshold-based algorithms are not stable since they are limited by the careful choice of threshold. To address this problem, we propose an iterative sound source localization approach called ISSL, which can iteratively extract each source's DOA without threshold until the termination criterion is met. Unlike threshold-based algorithms, ISSL designs an active source detector network based on binary classifier to accept residual spatial spectrum and decide whether to stop the iteration. By doing so, our ISSL can deal with an arbitrary number of sources, even more than the number of sources seen during the training stage. The experimental results show that our ISSL achieves significant performance improvements in both DOA estimation and source number detection compared with the existing threshold-based algorithms.
    From Tensor Network Quantum States to Tensorial Recurrent Neural Networks. (arXiv:2206.12363v1 [quant-ph])
    We show that any matrix product state (MPS) can be exactly represented by a recurrent neural network (RNN) with a linear memory update. We generalize this RNN architecture to 2D lattices using a multilinear memory update. It supports perfect sampling and wave function evaluation in polynomial time, and can represent an area law of entanglement entropy. Numerical evidence shows that it can encode the wave function using a bond dimension lower by orders of magnitude when compared to MPS, with an accuracy that can be systematically improved by increasing the bond dimension.
    RARTS: An Efficient First-Order Relaxed Architecture Search Method. (arXiv:2008.03901v2 [cs.LG] UPDATED)
    Differentiable architecture search (DARTS) is an effective method for data-driven neural network design based on solving a bilevel optimization problem. Despite its success in many architecture search tasks, there are still some concerns about the accuracy of first-order DARTS and the efficiency of the second-order DARTS. In this paper, we formulate a single level alternative and a relaxed architecture search (RARTS) method that utilizes the whole dataset in architecture learning via both data and network splitting, without involving mixed second derivatives of the corresponding loss functions like DARTS. In our formulation of network splitting, two networks with different but related weights cooperate in search of a shared architecture. The advantage of RARTS over DARTS is justified by a convergence theorem and an analytically solvable model. Moreover, RARTS outperforms DARTS and its variants in accuracy and search efficiency, as shown in adequate experimental results. For the task of searching topological architecture, i.e., the edges and the operations, RARTS obtains a higher accuracy and 60\% reduction of computational cost than second-order DARTS on CIFAR-10. RARTS continues to out-perform DARTS upon transfer to ImageNet and is on par with recent variants of DARTS even though our innovation is purely on the training algorithm without modifying search space. For the task of searching width, i.e., the number of channels in convolutional layers, RARTS also outperforms the traditional network pruning benchmarks. Further experiments on the public architecture search benchmark like NATS-Bench also support the preeminence of RARTS.
    MPClan: Protocol Suite for Privacy-Conscious Computations. (arXiv:2206.12224v1 [cs.CR])
    The growing volumes of data being collected and its analysis to provide better services are creating worries about digital privacy. To address privacy concerns and give practical solutions, the literature has relied on secure multiparty computation. However, recent research has mostly focused on the small-party honest-majority setting of up to four parties, noting efficiency concerns. In this work, we extend the strategies to support a larger number of participants in an honest-majority setting with efficiency at the center stage. Cast in the preprocessing paradigm, our semi-honest protocol improves the online complexity of the decade-old state-of-the-art protocol of Damg\aa rd and Nielson (CRYPTO'07). In addition to having an improved online communication cost, we can shut down almost half of the parties in the online phase, thereby saving up to 50% in the system's operational costs. Our maliciously secure protocol also enjoys similar benefits and requires only half of the parties, except for one-time verification, towards the end. To showcase the practicality of the designed protocols, we benchmark popular applications such as deep neural networks, graph neural networks, genome sequence matching, and biometric matching using prototype implementations. Our improved protocols aid in bringing up to 60-80% savings in monetary cost over prior work.
    Multi-Frequency Joint Community Detection and Phase Synchronization. (arXiv:2206.12276v1 [cs.SI])
    This paper studies the joint community detection and phase synchronization problem on the \textit{stochastic block model with relative phase}, where each node is associated with a phase. This problem, with a variety of real-world applications, aims to recover community memberships and associated phases simultaneously. By studying the maximum likelihood estimation formulation, we show that this problem exhibits a \textit{``multi-frequency''} structure. To this end, two simple yet efficient algorithms that leverage information across multiple frequencies are proposed. The former is a spectral method based on the novel multi-frequency column-pivoted QR factorization, and the latter is an iterative multi-frequency generalized power method. Numerical experiments indicate our proposed algorithms outperform state-of-the-art algorithms, in recovering community memberships and associated phases.
    Adversarial Robustness of Deep Neural Networks: A Survey from a Formal Verification Perspective. (arXiv:2206.12227v1 [cs.CR])
    Neural networks have been widely applied in security applications such as spam and phishing detection, intrusion prevention, and malware detection. This black-box method, however, often has uncertainty and poor explainability in applications. Furthermore, neural networks themselves are often vulnerable to adversarial attacks. For those reasons, there is a high demand for trustworthy and rigorous methods to verify the robustness of neural network models. Adversarial robustness, which concerns the reliability of a neural network when dealing with maliciously manipulated inputs, is one of the hottest topics in security and machine learning. In this work, we survey existing literature in adversarial robustness verification for neural networks and collect 39 diversified research works across machine learning, security, and software engineering domains. We systematically analyze their approaches, including how robustness is formulated, what verification techniques are used, and the strengths and limitations of each technique. We provide a taxonomy from a formal verification perspective for a comprehensive understanding of this topic. We classify the existing techniques based on property specification, problem reduction, and reasoning strategies. We also demonstrate representative techniques that have been applied in existing studies with a sample model. Finally, we discuss open questions for future research.
    ModLaNets: Learning Generalisable Dynamics via Modularity and Physical Inductive Bias. (arXiv:2206.12325v1 [cs.LG])
    Deep learning models are able to approximate one specific dynamical system but struggle at learning generalisable dynamics, where dynamical systems obey the same laws of physics but contain different numbers of elements (e.g., double- and triple-pendulum systems). To relieve this issue, we proposed the Modular Lagrangian Network (ModLaNet), a structural neural network framework with modularity and physical inductive bias. This framework models the energy of each element using modularity and then construct the target dynamical system via Lagrangian mechanics. Modularity is beneficial for reusing trained networks and reducing the scale of networks and datasets. As a result, our framework can learn from the dynamics of simpler systems and extend to more complex ones, which is not feasible using other relevant physics-informed neural networks. We examine our framework for modelling double-pendulum or three-body systems with small training datasets, where our models achieve the best data efficiency and accuracy performance compared with counterparts. We also reorganise our models as extensions to model multi-pendulum and multi-body systems, demonstrating the intriguing reusable feature of our framework.
    A Manifold-based Airfoil Geometric-feature Extraction and Discrepant Data Fusion Learning Method. (arXiv:2206.12254v1 [cs.LG])
    Geometrical shape of airfoils, together with the corresponding flight conditions, are crucial factors for aerodynamic performances prediction. The obtained airfoils geometrical features in most existing approaches (e.g., geometrical parameters extraction, polynomial description and deep learning) are in Euclidean space. State-of-the-art studies showed that curves or surfaces of an airfoil formed a manifold in Riemannian space. Therefore, the features extracted by existing methods are not sufficient to reflect the geometric-features of airfoils. Meanwhile, flight conditions and geometric features are greatly discrepant with different types, the relevant knowledge of the influence of these two factors that on final aerodynamic performances predictions must be evaluated and learned to improve prediction accuracy. Motivated by the advantages of manifold theory and multi-task learning, we propose a manifold-based airfoil geometric-feature extraction and discrepant data fusion learning method (MDF) to extract geometric-features of airfoils in Riemannian space (we call them manifold-features) and further fuse the manifold-features with flight conditions to predict aerodynamic performances. Experimental results show that our method could extract geometric-features of airfoils more accurately compared with existing methods, that the average MSE of re-built airfoils is reduced by 56.33%, and while keeping the same predicted accuracy level of CL, the MSE of CD predicted by MDF is further reduced by 35.37%.
    zPROBE: Zero Peek Robustness Checks for Federated Learning. (arXiv:2206.12100v1 [cs.LG])
    Privacy-preserving federated learning allows multiple users to jointly train a model with coordination of a central server. The server only learns the final aggregation result, thereby preventing leakage of the users' (private) training data from the individual model updates. However, keeping the individual updates private allows malicious users to perform Byzantine attacks and degrade the model accuracy without being detected. Best existing defenses against Byzantine workers rely on robust rank-based statistics, e.g., the median, to find malicious updates. However, implementing privacy-preserving rank-based statistics is nontrivial and unscalable in the secure domain, as it requires sorting of all individual updates. We establish the first private robustness check that uses high break point rank-based statistics on aggregated model updates. By exploiting randomized clustering, we significantly improve the scalability of our defense without compromising privacy. We leverage the derived statistical bounds in zero-knowledge proofs to detect and remove malicious updates without revealing the private user updates. Our novel framework, zPROBE, enables Byzantine resilient and secure federated learning. Empirical evaluations demonstrate that zPROBE provides a low overhead solution to defend against state-of-the-art Byzantine attacks while preserving privacy.
    InfoAT: Improving Adversarial Training Using the Information Bottleneck Principle. (arXiv:2206.12292v1 [cs.LG])
    Adversarial training (AT) has shown excellent high performance in defending against adversarial examples. Recent studies demonstrate that examples are not equally important to the final robustness of models during AT, that is, the so-called hard examples that can be attacked easily exhibit more influence than robust examples on the final robustness. Therefore, guaranteeing the robustness of hard examples is crucial for improving the final robustness of the model. However, defining effective heuristics to search for hard examples is still difficult. In this article, inspired by the information bottleneck (IB) principle, we uncover that an example with high mutual information of the input and its associated latent representation is more likely to be attacked. Based on this observation, we propose a novel and effective adversarial training method (InfoAT). InfoAT is encouraged to find examples with high mutual information and exploit them efficiently to improve the final robustness of models. Experimental results show that InfoAT achieves the best robustness among different datasets and models in comparison with several state-of-the-art methods.
    Synthesizing Rolling Bearing Fault Samples in New Conditions: A framework based on a modified CGAN. (arXiv:2206.12076v1 [cs.LG])
    Bearings are one of the vital components of rotating machines that are prone to unexpected faults. Therefore, bearing fault diagnosis and condition monitoring is essential for reducing operational costs and downtime in numerous industries. In various production conditions, bearings can be operated under a range of loads and speeds, which causes different vibration patterns associated with each fault type. Normal data is ample as systems usually work in desired conditions. On the other hand, fault data is rare, and in many conditions, there is no data recorded for the fault classes. Accessing fault data is crucial for developing data-driven fault diagnosis tools that can improve both the performance and safety of operations. To this end, a novel algorithm based on Conditional Generative Adversarial Networks (CGANs) is introduced. Trained on the normal and fault data on any actual fault conditions, this algorithm generates fault data from normal data of target conditions. The proposed method is validated on a real-world bearing dataset, and fault data are generated for different conditions. Several state-of-the-art classifiers and visualization models are implemented to evaluate the quality of the synthesized data. The results demonstrate the efficacy of the proposed algorithm.
    AnyMorph: Learning Transferable Polices By Inferring Agent Morphology. (arXiv:2206.12279v1 [cs.LG])
    The prototypical approach to reinforcement learning involves training policies tailored to a particular agent from scratch for every new morphology. Recent work aims to eliminate the re-training of policies by investigating whether a morphology-agnostic policy, trained on a diverse set of agents with similar task objectives, can be transferred to new agents with unseen morphologies without re-training. This is a challenging problem that required previous approaches to use hand-designed descriptions of the new agent's morphology. Instead of hand-designing this description, we propose a data-driven method that learns a representation of morphology directly from the reinforcement learning objective. Ours is the first reinforcement learning algorithm that can train a policy to generalize to new agent morphologies without requiring a description of the agent's morphology in advance. We evaluate our approach on the standard benchmark for agent-agnostic control, and improve over the current state of the art in zero-shot generalization to new agents. Importantly, our method attains good performance without an explicit description of morphology.
    Multi-modal Sensor Data Fusion for In-situ Classification of Animal Behavior Using Accelerometry and GNSS Data. (arXiv:2206.12078v1 [cs.LG])
    We examine using data from multiple sensing modes, i.e., accelerometry and global navigation satellite system (GNSS), for classifying animal behavior. We extract three new features from the GNSS data, namely, the distance from the water point, median speed, and median estimated horizontal position error. We consider two approaches for combining the information available from the accelerometry and GNSS data. The first approach is based on concatenating the features extracted from both sensor data and feeding the concatenated feature vector into a multi-layer perceptron (MLP) classifier. The second approach is based on fusing the posterior probabilities predicted by two MLP classifiers each taking the features extracted from the data of one sensor as input. We evaluate the performance of the developed multi-modal animal behavior classification algorithms using two real-world datasets collected via smart cattle collar and ear tags. The leave-one-animal-out cross-validation results show that both approaches improve the classification performance appreciably compared with using the data from only one sensing mode, in particular, for the infrequent but important behaviors of walking and drinking. The algorithms developed based on both approaches require rather small computational and memory resources hence are suitable for implementation on embedded systems of our collar and ear tags. However, the multi-modal animal behavior classification algorithm based on posterior probability fusion is preferable to the one based on feature concatenation as it delivers better classification accuracy, has less computational and memory complexity, is more robust to sensor data failure, and enjoys better modularity.
    Computational Complexity Evaluation of Neural Network Applications in Signal Processing. (arXiv:2206.12191v1 [eess.SP])
    In this paper, we provide a systematic approach for assessing and comparing the computational complexity of neural network layers in digital signal processing. We provide and link four software-to-hardware complexity measures, defining how the different complexity metrics relate to the layers' hyper-parameters. This paper explains how to compute these four metrics for feed-forward and recurrent layers, and defines in which case we ought to use a particular metric depending on whether we characterize a more soft- or hardware-oriented application. One of the four metrics, called `the number of additions and bit shifts (NABS)', is newly introduced for heterogeneous quantization. NABS characterizes the impact of not only the bitwidth used in the operation but also the type of quantization used in the arithmetical operations. We intend this work to serve as a baseline for the different levels (purposes) of complexity estimation related to the neural networks' application in real-time digital signal processing, aiming at unifying the computational complexity estimation.
    Adversarial Zoom Lens: A Novel Physical-World Attack to DNNs. (arXiv:2206.12251v1 [cs.CR])
    Although deep neural networks (DNNs) are known to be fragile, no one has studied the effects of zooming-in and zooming-out of images in the physical world on DNNs performance. In this paper, we demonstrate a novel physical adversarial attack technique called Adversarial Zoom Lens (AdvZL), which uses a zoom lens to zoom in and out of pictures of the physical world, fooling DNNs without changing the characteristics of the target object. The proposed method is so far the only adversarial attack technique that does not add physical adversarial perturbation attack DNNs. In a digital environment, we construct a data set based on AdvZL to verify the antagonism of equal-scale enlarged images to DNNs. In the physical environment, we manipulate the zoom lens to zoom in and out of the target object, and generate adversarial samples. The experimental results demonstrate the effectiveness of AdvZL in both digital and physical environments. We further analyze the antagonism of the proposed data set to the improved DNNs. On the other hand, we provide a guideline for defense against AdvZL by means of adversarial training. Finally, we look into the threat possibilities of the proposed approach to future autonomous driving and variant attack ideas similar to the proposed attack.
    Towards FPGA Implementation of Neural Network-Based Nonlinearity Mitigation Equalizers in Coherent Optical Transmission Systems. (arXiv:2206.12180v1 [eess.SP])
    For the first time, recurrent and feedforward neural network-based equalizers for nonlinearity compensation are implemented in an FPGA, with a level of complexity comparable to that of a dispersion equalizer. We demonstrate that the NN-based equalizers can outperform a 1 step-per-span DBP.
    CoSP: Co-supervised pretraining of pocket and ligand. (arXiv:2206.12241v1 [cs.LG])
    Can we inject the pocket-ligand interaction knowledge into the pre-trained model and jointly learn their chemical space? Pretraining molecules and proteins has attracted considerable attention in recent years, while most of these approaches focus on learning one of the chemical spaces and lack the injection of biological knowledge. We propose a co-supervised pretraining (CoSP) framework to simultaneously learn 3D pocket and ligand representations. We use a gated geometric message passing layer to model both 3D pockets and ligands, where each node's chemical features, geometric position and orientation are considered. To learn biological meaningful embeddings, we inject the pocket-ligand interaction knowledge into the pretraining model via contrastive loss. Considering the specificity of molecules, we further propose a chemical similarity-enhanced negative sampling strategy to improve the contrastive learning performance. Through extensive experiments, we conclude that CoSP can achieve competitive results in pocket matching, molecule property predictions, and virtual screening.
    Reinforcement learning based adaptive metaheuristics. (arXiv:2206.12233v1 [cs.NE])
    Parameter adaptation, that is the capability to automatically adjust an algorithm's hyperparameters depending on the problem being faced, is one of the main trends in evolutionary computation applied to numerical optimization. While several handcrafted adaptation policies have been proposed over the years to address this problem, only few attempts have been done so far at apply machine learning to learn such policies. Here, we introduce a general-purpose framework for performing parameter adaptation in continuous-domain metaheuristics based on state-of-the-art reinforcement learning algorithms. We demonstrate the applicability of this framework on two algorithms, namely Covariance Matrix Adaptation Evolution Strategies (CMA-ES) and Differential Evolution (DE), for which we learn, respectively, adaptation policies for the step-size (for CMA-ES), and the scale factor and crossover rate (for DE). We train these policies on a set of 46 benchmark functions at different dimensionalities, with various inputs to the policies, in two settings: one policy per function, and one global policy for all functions. Compared, respectively, to the Cumulative Step-size Adaptation (CSA) policy and to two well-known adaptive DE variants (iDE and jDE), our policies are able to produce competitive results in the majority of cases, especially in the case of DE.
    World Value Functions: Knowledge Representation for Learning and Planning. (arXiv:2206.11940v1 [cs.AI])
    We propose world value functions (WVFs), a type of goal-oriented general value function that represents how to solve not just a given task, but any other goal-reaching task in an agent's environment. This is achieved by equipping an agent with an internal goal space defined as all the world states where it experiences a terminal transition. The agent can then modify the standard task rewards to define its own reward function, which provably drives it to learn how to achieve all reachable internal goals, and the value of doing so in the current task. We demonstrate two key benefits of WVFs in the context of learning and planning. In particular, given a learned WVF, an agent can compute the optimal policy in a new task by simply estimating the task's reward function. Furthermore, we show that WVFs also implicitly encode the transition dynamics of the environment, and so can be used to perform planning. Experimental results show that WVFs can be learned faster than regular value functions, while their ability to infer the environment's dynamics can be used to integrate learning and planning methods to further improve sample efficiency.
    Cyclic Graph Attentive Match Encoder (CGAME): A Novel Neural Network For OD Estimation. (arXiv:2111.14625v3 [cs.LG] UPDATED)
    Origin-Destination Estimation plays an important role in traffic management and traffic simulation in the era of Intelligent Transportation System (ITS). Nevertheless, previous model-based methods face the under-determined challenge, thus desperate demand for additional assumptions and extra data exists. Deep learning provides an ideal data-based method for connecting inputs and outputs by probabilistic distribution transformation. While relevant researches of applying deep learning into OD estimation are limited due to the challenges lying in data transformation across representation space, especially from dynamic spatial-temporal space to heterogeneous graph in this issue. To address it, we propose Cyclic Graph Attentive Matching Encoder (C-GAME) based on a novel Graph Matcher with double-layer attention mechanism. It realizes effective information exchange and establishes coupling relationship across underlying feature space. The proposed model achieves state-of-the-art results in experiments, and offers a novel framework for inference task across spaces in prospective employments.
    Classifying Unstructured Clinical Notes via Automatic Weak Supervision. (arXiv:2206.12088v1 [cs.CL])
    Healthcare providers usually record detailed notes of the clinical care delivered to each patient for clinical, research, and billing purposes. Due to the unstructured nature of these narratives, providers employ dedicated staff to assign diagnostic codes to patients' diagnoses using the International Classification of Diseases (ICD) coding system. This manual process is not only time-consuming but also costly and error-prone. Prior work demonstrated potential utility of Machine Learning (ML) methodology in automating this process, but it has relied on large quantities of manually labeled data to train the models. Additionally, diagnostic coding systems evolve with time, which makes traditional supervised learning strategies unable to generalize beyond local applications. In this work, we introduce a general weakly-supervised text classification framework that learns from class-label descriptions only, without the need to use any human-labeled documents. It leverages the linguistic domain knowledge stored within pre-trained language models and the data programming framework to assign code labels to individual texts. We demonstrate the efficacy and flexibility of our method by comparing it to state-of-the-art weak text classifiers across four real-world text classification datasets, in addition to assigning ICD codes to medical notes in the publicly available MIMIC-III database.
    Exploring System Performance of Continual Learning for Mobile and Embedded Sensing Applications. (arXiv:2110.13290v2 [cs.LG] UPDATED)
    Continual learning approaches help deep neural network models adapt and learn incrementally by trying to solve catastrophic forgetting. However, whether these existing approaches, applied traditionally to image-based tasks, work with the same efficacy to the sequential time series data generated by mobile or embedded sensing systems remains an unanswered question. To address this void, we conduct the first comprehensive empirical study that quantifies the performance of three predominant continual learning schemes (i.e., regularization, replay, and replay with examples) on six datasets from three mobile and embedded sensing applications in a range of scenarios having different learning complexities. More specifically, we implement an end-to-end continual learning framework on edge devices. Then we investigate the generalizability, trade-offs between performance, storage, computational costs, and memory footprint of different continual learning methods. Our findings suggest that replay with exemplars-based schemes such as iCaRL has the best performance trade-offs, even in complex scenarios, at the expense of some storage space (few MBs) for training examples (1% to 5%). We also demonstrate for the first time that it is feasible and practical to run continual learning on-device with a limited memory budget. In particular, the latency on two types of mobile and embedded devices suggests that both incremental learning time (few seconds - 4 minutes) and training time (1 - 75 minutes) across datasets are acceptable, as training could happen on the device when the embedded device is charging thereby ensuring complete data privacy. Finally, we present some guidelines for practitioners who want to apply a continual learning paradigm for mobile sensing tasks.
    On making optimal transport robust to all outliers. (arXiv:2206.11988v1 [stat.ML])
    Optimal transport (OT) is known to be sensitive against outliers because of its marginal constraints. Outlier robust OT variants have been proposed based on the definition that outliers are samples which are expensive to move. In this paper, we show that this definition is restricted by considering the case where outliers are closer to the target measure than clean samples. We show that outlier robust OT fully transports these outliers leading to poor performances in practice. To tackle these outliers, we propose to detect them by relying on a classifier trained with adversarial training to classify source and target samples. A sample is then considered as an outlier if the prediction from the classifier is different from its assigned label. To decrease the influence of these outliers in the transport problem, we propose to either remove them from the problem or to increase the cost of moving them by using the classifier prediction. We show that we successfully detect these outliers and that they do not influence the transport problem on several experiments such as gradient flows, generative models and label propagation.
    How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections. (arXiv:2206.12037v1 [cs.LG])
    Linear time-invariant state space models (SSM) are a classical model from engineering and statistics, that have recently been shown to be very promising in machine learning through the Structured State Space sequence model (S4). A core component of S4 involves initializing the SSM state matrix to a particular matrix called a HiPPO matrix, which was empirically important for S4's ability to handle long sequences. However, the specific matrix that S4 uses was actually derived in previous work for a particular time-varying dynamical system, and the use of this matrix as a time-invariant SSM had no known mathematical interpretation. Consequently, the theoretical mechanism by which S4 models long-range dependencies actually remains unexplained. We derive a more general and intuitive formulation of the HiPPO framework, which provides a simple mathematical interpretation of S4 as a decomposition onto exponentially-warped Legendre polynomials, explaining its ability to capture long dependencies. Our generalization introduces a theoretically rich class of SSMs that also lets us derive more intuitive S4 variants for other bases such as the Fourier basis, and explains other aspects of training S4, such as how to initialize the important timescale parameter. These insights improve S4's performance to 86% on the Long Range Arena benchmark, with 96% on the most difficult Path-X task.
    Self Supervised Learning for Few Shot Hyperspectral Image Classification. (arXiv:2206.12117v1 [cs.CV])
    Deep learning has proven to be a very effective approach for Hyperspectral Image (HSI) classification. However, deep neural networks require large annotated datasets to generalize well. This limits the applicability of deep learning for HSI classification, where manually labelling thousands of pixels for every scene is impractical. In this paper, we propose to leverage Self Supervised Learning (SSL) for HSI classification. We show that by pre-training an encoder on unlabeled pixels using Barlow-Twins, a state-of-the-art SSL algorithm, we can obtain accurate models with a handful of labels. Experimental results demonstrate that this approach significantly outperforms vanilla supervised learning.
    Implicit Channel Learning for Machine Learning Applications in 6G Wireless Networks. (arXiv:2206.12127v1 [eess.SP])
    With the deployment of the fifth generation (5G) wireless systems gathering momentum across the world, possible technologies for 6G are under active research discussions. In particular, the role of machine learning (ML) in 6G is expected to enhance and aid emerging applications such as virtual and augmented reality, vehicular autonomy, and computer vision. This will result in large segments of wireless data traffic comprising image, video and speech. The ML algorithms process these for classification/recognition/estimation through the learning models located on cloud servers. This requires wireless transmission of data from edge devices to the cloud server. Channel estimation, handled separately from recognition step, is critical for accurate learning performance. Toward combining the learning for both channel and the ML data, we introduce implicit channel learning to perform the ML tasks without estimating the wireless channel. Here, the ML models are trained with channel-corrupted datasets in place of nominal data. Without channel estimation, the proposed approach exhibits approximately 60% improvement in image and speech classification tasks for diverse scenarios such as millimeter wave and IEEE 802.11p vehicular channels.
    TreeDRNet:A Robust Deep Model for Long Term Time Series Forecasting. (arXiv:2206.12106v1 [cs.LG])
    Various deep learning models, especially some latest Transformer-based approaches, have greatly improved the state-of-art performance for long-term time series forecasting.However, those transformer-based models suffer a severe deterioration performance with prolonged input length, which prohibits them from using extended historical info.Moreover, these methods tend to handle complex examples in long-term forecasting with increased model complexity, which often leads to a significant increase in computation and less robustness in performance(e.g., overfitting). We propose a novel neural network architecture, called TreeDRNet, for more effective long-term forecasting. Inspired by robust regression, we introduce doubly residual link structure to make prediction more robust.Built upon Kolmogorov-Arnold representation theorem, we explicitly introduce feature selection, model ensemble, and a tree structure to further utilize the extended input sequence, which improves the robustness and representation power of TreeDRNet. Unlike previous deep models for sequential forecasting work, TreeDRNet is built entirely on multilayer perceptron and thus enjoys high computational efficiency. Our extensive empirical studies show that TreeDRNet is significantly more effective than state-of-the-art methods, reducing prediction errors by 20% to 40% for multivariate time series. In particular, TreeDRNet is over 10 times more efficient than transformer-based methods. The code will be released soon.
    On Structural Explanation of Bias in Graph Neural Networks. (arXiv:2206.12104v1 [cs.LG])
    Graph Neural Networks (GNNs) have shown satisfying performance in various graph analytical problems. Hence, they have become the \emph{de facto} solution in a variety of decision-making scenarios. However, GNNs could yield biased results against certain demographic subgroups. Some recent works have empirically shown that the biased structure of the input network is a significant source of bias for GNNs. Nevertheless, no studies have systematically scrutinized which part of the input network structure leads to biased predictions for any given node. The low transparency on how the structure of the input network influences the bias in GNN outcome largely limits the safe adoption of GNNs in various decision-critical scenarios. In this paper, we study a novel research problem of structural explanation of bias in GNNs. Specifically, we propose a novel post-hoc explanation framework to identify two edge sets that can maximally account for the exhibited bias and maximally contribute to the fairness level of the GNN prediction for any given node, respectively. Such explanations not only provide a comprehensive understanding of bias/fairness of GNN predictions but also have practical significance in building an effective yet fair GNN model. Extensive experiments on real-world datasets validate the effectiveness of the proposed framework towards delivering effective structural explanations for the bias of GNNs. Open-source code can be found at https://github.com/yushundong/REFEREE.
    MULTI-FLGANs: Multi-Distributed Adversarial Networks for Non-IID distribution. (arXiv:2206.12178v1 [cs.LG])
    Federated learning is an emerging concept in the domain of distributed machine learning. This concept has enabled GANs to benefit from the rich distributed training data while preserving privacy. However, in a non-iid setting, current federated GAN architectures are unstable, struggling to learn the distinct features and vulnerable to mode collapse. In this paper, we propose a novel architecture MULTI-FLGAN to solve the problem of low-quality images, mode collapse and instability for non-iid datasets. Our results show that MULTI-FLGAN is four times as stable and performant (i.e. high inception score) on average over 20 clients compared to baseline FLGAN.
    Knowledge Distillation via Weighted Ensemble of Teaching Assistants. (arXiv:2206.12005v1 [cs.LG])
    Knowledge distillation in machine learning is the process of transferring knowledge from a large model called the teacher to a smaller model called the student. Knowledge distillation is one of the techniques to compress the large network (teacher) to a smaller network (student) that can be deployed in small devices such as mobile phones. When the network size gap between the teacher and student increases, the performance of the student network decreases. To solve this problem, an intermediate model is employed between the teacher model and the student model known as the teaching assistant model, which in turn bridges the gap between the teacher and the student. In this research, we have shown that using multiple teaching assistant models, the student model (the smaller model) can be further improved. We combined these multiple teaching assistant models using weighted ensemble learning where we have used a differential evaluation optimization algorithm to generate the weight values.
    Aggregated Multi-output Gaussian Processes with Knowledge Transfer Across Domains. (arXiv:2206.12141v1 [stat.ML])
    Aggregate data often appear in various fields such as socio-economics and public security. The aggregate data are associated not with points but with supports (e.g., spatial regions in a city). Since the supports may have various granularities depending on attributes (e.g., poverty rate and crime rate), modeling such data is not straightforward. This article offers a multi-output Gaussian process (MoGP) model that infers functions for attributes using multiple aggregate datasets of respective granularities. In the proposed model, the function for each attribute is assumed to be a dependent GP modeled as a linear mixing of independent latent GPs. We design an observation model with an aggregation process for each attribute; the process is an integral of the GP over the corresponding support. We also introduce a prior distribution of the mixing weights, which allows a knowledge transfer across domains (e.g., cities) by sharing the prior. This is advantageous in such a situation where the spatially aggregated dataset in a city is too coarse to interpolate; the proposed model can still make accurate predictions of attributes by utilizing aggregate datasets in other cities. The inference of the proposed model is based on variational Bayes, which enables one to learn the model parameters using the aggregate datasets from multiple domains. The experiments demonstrate that the proposed model outperforms in the task of refining coarse-grained aggregate data on real-world datasets: Time series of air pollutants in Beijing and various kinds of spatial datasets from New York City and Chicago.
    Discrete-Continuous Smoothing and Mapping. (arXiv:2204.11936v2 [cs.RO] UPDATED)
    We describe a general approach to smoothing and mapping with a class of discrete-continuous factor graphs commonly encountered in robotics applications. While there are openly available tools providing flexible and easy-to-use interfaces for specifying and solving optimization problems formulated in terms of either discrete or continuous graphical models, at present, no similarly general tools exist enabling the same functionality for hybrid discrete-continuous problems. We aim to address this problem. In particular, we provide a library, DC-SAM, extending existing tools for optimization problems defined in terms of factor graphs to the setting of discrete-continuous models. A key contribution of our work is a novel solver for efficiently recovering approximate solutions to discrete-continuous optimization problems. The key insight to our approach is that while joint inference over continuous and discrete state spaces is often hard, many commonly encountered discrete-continuous problems can naturally be split into a "discrete part" and a "continuous part" that can individually be solved easily. Leveraging this structure, we optimize discrete and continuous variables in an alternating fashion. In consequence, our proposed work enables straightforward representation of and approximate inference in discrete-continuous graphical models. We also provide a method to recover the uncertainty in estimates of both discrete and continuous variables. We demonstrate the versatility of our approach through its application to three distinct robot perception applications: point-cloud registration, robust pose graph optimization, and object-based mapping and localization.
    Supervised learning of random quantum circuits via scalable neural networks. (arXiv:2206.10348v2 [quant-ph] UPDATED)
    Predicting the output of quantum circuits is a hard computational task that plays a pivotal role in the development of universal quantum computers. Here we investigate the supervised learning of output expectation values of random quantum circuits. Deep convolutional neural networks (CNNs) are trained to predict single-qubit and two-qubit expectation values using databases of classically simulated circuits. These circuits are represented via an appropriately designed one-hot encoding of the constituent gates. The prediction accuracy for previously unseen circuits is analyzed, also making comparisons with small-scale quantum computers available from the free IBM Quantum program. The CNNs often outperform the quantum devices, depending on the circuit depth, on the network depth, and on the training set size. Notably, our CNNs are designed to be scalable. This allows us exploiting transfer learning and performing extrapolations to circuits larger than those included in the training set. These CNNs also demonstrate remarkable resilience against noise, namely, they remain accurate even when trained on (simulated) expectation values averaged over very few measurements.
    F3: Fair and Federated Face Attribute Classification with Heterogeneous Data. (arXiv:2109.02351v3 [cs.LG] UPDATED)
    Fairness across different demographic groups is an essential criterion for face-related tasks, Face Attribute Classification (FAC) being a prominent example. Apart from this trend, Federated Learning (FL) is increasingly gaining traction as a scalable paradigm for distributed training. Existing FL approaches require data homogeneity to ensure fairness. However, this assumption is too restrictive in real-world settings. We propose F3, a novel FL framework for fair FAC under data heterogeneity. F3 adopts multiple heuristics to improve fairness across different demographic groups without requiring data homogeneity assumption. We demonstrate the efficacy of F3 by reporting empirically observed fairness measures and accuracy guarantees on popular face datasets. Our results suggest that F3 strikes a practical balance between accuracy and fairness for FAC.
    Parallel Deep Neural Networks Have Zero Duality Gap. (arXiv:2110.06482v2 [cs.LG] UPDATED)
    Training deep neural networks is a well-known highly non-convex problem. In recent works, it is shown that there is no duality gap for regularized two-layer neural networks with ReLU activation, which enables global optimization via convex programs. For multi-layer linear networks with vector outputs, we formulate convex dual problems and demonstrate that the duality gap is non-zero for depth three and deeper networks. However, by modifying the deep networks to more powerful parallel architectures, we show that the duality gap is exactly zero. Therefore, strong convex duality holds, and hence there exist equivalent convex programs that enable training deep networks to global optimality. We also demonstrate that the weight decay regularization in the parameters explicitly encourages low-rank solutions via closed-form expressions. For three-layer non-parallel ReLU networks, we show that strong duality holds for rank-1 data matrices, however, the duality gap is non-zero for whitened data matrices. Similarly, by transforming the neural network architecture into a corresponding parallel version, the duality gap vanishes.
    Segmentation-free PVC for Cardiac SPECT using a Densely-connected Multi-dimensional Dynamic Network. (arXiv:2206.12344v1 [eess.IV])
    In nuclear imaging, limited resolution causes partial volume effects (PVEs) that affect image sharpness and quantitative accuracy. Partial volume correction (PVC) methods incorporating high-resolution anatomical information from CT or MRI have been demonstrated to be effective. However, such anatomical-guided methods typically require tedious image registration and segmentation steps. Accurately segmented organ templates are also hard to obtain, particularly in cardiac SPECT imaging, due to the lack of hybrid SPECT/CT scanners with high-end CT and associated motion artifacts. Slight mis-registration/mis-segmentation would result in severe degradation in image quality after PVC. In this work, we develop a deep-learning-based method for fast cardiac SPECT PVC without anatomical information and associated organ segmentation. The proposed network involves a densely-connected multi-dimensional dynamic mechanism, allowing the convolutional kernels to be adapted based on the input images, even after the network is fully trained. Intramyocardial blood volume (IMBV) is introduced as an additional clinical-relevant loss function for network optimization. The proposed network demonstrated promising performance on 28 canine studies acquired on a GE Discovery NM/CT 570c dedicated cardiac SPECT scanner with a 64-slice CT using Technetium-99m-labeled red blood cells. This work showed that the proposed network with densely-connected dynamic mechanism produced superior results compared with the same network without such mechanism. Results also showed that the proposed network without anatomical information could produce images with statistically comparable IMBV measurements to the images generated by anatomical-guided PVC methods, which could be helpful in clinical translation.
    A Spatio-temporal Track Association Algorithm Based on Marine Vessel Automatic Identification System Data. (arXiv:2010.15921v2 [cs.LG] UPDATED)
    Tracking multiple moving objects in real-time in a dynamic threat environment is an important element in national security and surveillance system. It helps pinpoint and distinguish potential candidates posing threats from other normal objects and monitor the anomalous trajectories until intervention. To locate the anomalous pattern of movements, one needs to have an accurate data association algorithm that can associate the sequential observations of locations and motion with the underlying moving objects, and therefore, build the trajectories of the objects as the objects are moving. In this work, we develop a spatio-temporal approach for tracking maritime vessels as the vessel's location and motion observations are collected by an Automatic Identification System. The proposed approach is developed as an effort to address a data association challenge in which the number of vessels as well as the vessel identification are purposely withheld and time gaps are created in the datasets to mimic the real-life operational complexities under a threat environment. Three training datasets and five test sets are provided in the challenge and a set of quantitative performance metrics is devised by the data challenge organizer for evaluating and comparing resulting methods developed by participants. When our proposed track association algorithm is applied to the five test sets, the algorithm scores a very competitive performance.
    A Disability Lens towards Biases in GPT-3 Generated Open-Ended Languages. (arXiv:2206.11993v1 [cs.CL])
    Language models (LM) are becoming prevalent in many language-based application spaces globally. Although these LMs are improving our day-to-day interactions with digital products, concerns remain whether open-ended languages or text generated from these models reveal any biases toward a specific group of people, thereby risking the usability of a certain product. There is a need to identify whether these models possess bias to improve the fairness in these models. This gap motivates our ongoing work, where we measured the two aspects of bias in GPT-3 generated text through a disability lens.
    Similarity-aware Positive Instance Sampling for Graph Contrastive Pre-training. (arXiv:2206.11959v1 [cs.LG])
    Graph instance contrastive learning has been proved as an effective task for Graph Neural Network (GNN) pre-training. However, one key issue may seriously impede the representative power in existing works: Positive instances created by current methods often miss crucial information of graphs or even yield illegal instances (such as non-chemically-aware graphs in molecular generation). To remedy this issue, we propose to select positive graph instances directly from existing graphs in the training set, which ultimately maintains the legality and similarity to the target graphs. Our selection is based on certain domain-specific pair-wise similarity measurements as well as sampling from a hierarchical graph encoding similarity relations among graphs. Besides, we develop an adaptive node-level pre-training method to dynamically mask nodes to distribute them evenly in the graph. We conduct extensive experiments on $13$ graph classification and node classification benchmark datasets from various domains. The results demonstrate that the GNN models pre-trained by our strategies can outperform those trained-from-scratch models as well as the variants obtained by existing methods.
    Computationally Efficient PAC RL in POMDPs with Latent Determinism and Conditional Embeddings. (arXiv:2206.12081v1 [cs.LG])
    We study reinforcement learning with function approximation for large-scale Partially Observable Markov Decision Processes (POMDPs) where the state space and observation space are large or even continuous. Particularly, we consider Hilbert space embeddings of POMDP where the feature of latent states and the feature of observations admit a conditional Hilbert space embedding of the observation emission process, and the latent state transition is deterministic. Under the function approximation setup where the optimal latent state-action $Q$-function is linear in the state feature, and the optimal $Q$-function has a gap in actions, we provide a \emph{computationally and statistically efficient} algorithm for finding the \emph{exact optimal} policy. We show our algorithm's computational and statistical complexities scale polynomially with respect to the horizon and the intrinsic dimension of the feature on the observation space. Furthermore, we show both the deterministic latent transitions and gap assumptions are necessary to avoid statistical complexity exponential in horizon or dimension. Since our guarantee does not have an explicit dependence on the size of the state and observation spaces, our algorithm provably scales to large-scale POMDPs.
    Neural Networks with A La Carte Selection of Activation Functions. (arXiv:2206.12166v1 [cs.NE])
    Activation functions (AFs), which are pivotal to the success (or failure) of a neural network, have received increased attention in recent years, with researchers seeking to design novel AFs that improve some aspect of network performance. In this paper we take another direction, wherein we combine a slew of known AFs into successful architectures, proposing three methods to do so beneficially: 1) generate AF architectures at random, 2) use Optuna, an automatic hyper-parameter optimization software framework, with a Tree-structured Parzen Estimator (TPE) sampler, and 3) use Optuna with a Covariance Matrix Adaptation Evolution Strategy (CMA-ES) sampler. We show that all methods often produce significantly better results for 25 classification problems when compared with a standard network composed of ReLU hidden units and a softmax output unit. Optuna with the TPE sampler emerged as the best AF architecture-producing method.  ( 2 min )
    Provably Efficient Reinforcement Learning in Partially Observable Dynamical Systems. (arXiv:2206.12020v1 [cs.LG])
    We study Reinforcement Learning for partially observable dynamical systems using function approximation. We propose a new \textit{Partially Observable Bilinear Actor-Critic framework}, that is general enough to include models such as observable tabular Partially Observable Markov Decision Processes (POMDPs), observable Linear-Quadratic-Gaussian (LQG), Predictive State Representations (PSRs), as well as a newly introduced model Hilbert Space Embeddings of POMDPs and observable POMDPs with latent low-rank transition. Under this framework, we propose an actor-critic style algorithm that is capable of performing agnostic policy learning. Given a policy class that consists of memory based policies (that look at a fixed-length window of recent observations), and a value function class that consists of functions taking both memory and future observations as inputs, our algorithm learns to compete against the best memory-based policy in the given policy class. For certain examples such as undercomplete observable tabular POMDPs, observable LQGs and observable POMDPs with latent low-rank transition, by implicitly leveraging their special properties, our algorithm is even capable of competing against the globally optimal policy without paying an exponential dependence on the horizon in its sample complexity.  ( 2 min )
    Bilateral Network with Channel Splitting Network and Transformer for Thermal Image Super-Resolution. (arXiv:2206.12046v1 [cs.CV])
    In recent years, the Thermal Image Super-Resolution (TISR) problem has become an attractive research topic. TISR would been used in a wide range of fields, including military, medical, agricultural and animal ecology. Due to the success of PBVS-2020 and PBVS-2021 workshop challenge, the result of TISR keeps improving and attracts more researchers to sign up for PBVS-2022 challenge. In this paper, we will introduce the technical details of our submission to PBVS-2022 challenge designing a Bilateral Network with Channel Splitting Network and Transformer(BN-CSNT) to tackle the TISR problem. Firstly, we designed a context branch based on channel splitting network with transformer to obtain sufficient context information. Secondly, we designed a spatial branch with shallow transformer to extract low level features which can preserve the spatial information. Finally, for the context branch in order to fuse the features from channel splitting network and transformer, we proposed an attention refinement module, and then features from context branch and spatial branch are fused by proposed feature fusion module. The proposed method can achieve PSNR=33.64, SSIM=0.9263 for x4 and PSNR=21.08, SSIM=0.7803 for x2 in the PBVS-2022 challenge test dataset.  ( 2 min )
    Learning quantum symmetries with interactive quantum-classical variational algorithms. (arXiv:2206.11970v1 [quant-ph])
    A symmetry of a state $\lvert \psi \rangle$ is a unitary operator of which $\lvert \psi \rangle$ is an eigenvector. When $\lvert \psi \rangle$ is an unknown state supplied by a black-box oracle, the state's symmetries serve to characterize it, and often relegate much of the desired information about $\lvert \psi \rangle$. In this paper, we develop a variational hybrid quantum-classical learning scheme to systematically probe for symmetries of $\lvert \psi \rangle$ with no a priori assumptions about the state. This procedure can be used to learn various symmetries at the same time. In order to avoid re-learning already known symmetries, we introduce an interactive protocol with a classical deep neural net. The classical net thereby regularizes against repetitive findings and allows our algorithm to terminate empirically with all possible symmetries found. Our scheme can be implemented efficiently on average with non-local SWAP gates; we also give a less efficient algorithm with only local operations, which may be more appropriate for current noisy quantum devices. We demonstrate our algorithm on representative families of states.  ( 2 min )
    The Real Deal: A Review of Challenges and Opportunities in Moving Reinforcement Learning-Based Traffic Signal Control Systems Towards Reality. (arXiv:2206.11996v1 [cs.AI])
    Traffic signal control (TSC) is a high-stakes domain that is growing in importance as traffic volume grows globally. An increasing number of works are applying reinforcement learning (RL) to TSC; RL can draw on an abundance of traffic data to improve signalling efficiency. However, RL-based signal controllers have never been deployed. In this work, we provide the first review of challenges that must be addressed before RL can be deployed for TSC. We focus on four challenges involving (1) uncertainty in detection, (2) reliability of communications, (3) compliance and interpretability, and (4) heterogeneous road users. We show that the literature on RL-based TSC has made some progress towards addressing each challenge. However, more work should take a systems thinking approach that considers the impacts of other pipeline components on RL.  ( 2 min )
    PSP: Million-level Protein Sequence Dataset for Protein Structure Prediction. (arXiv:2206.12240v1 [q-bio.BM])
    Proteins are essential component of human life and their structures are important for function and mechanism analysis. Recent work has shown the potential of AI-driven methods for protein structure prediction. However, the development of new models is restricted by the lack of dataset and benchmark training procedure. To the best of our knowledge, the existing open source datasets are far less to satisfy the needs of modern protein sequence-structure related research. To solve this problem, we present the first million-level protein structure prediction dataset with high coverage and diversity, named as PSP. This dataset consists of 570k true structure sequences (10TB) and 745k complementary distillation sequences (15TB). We provide in addition the benchmark training procedure for SOTA protein structure prediction model on this dataset. We validate the utility of this dataset for training by participating CAMEO contest in which our model won the first place. We hope our PSP dataset together with the training benchmark can enable a broader community of AI/biology researchers for AI-driven protein related research.
    BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping. (arXiv:2206.12038v1 [cs.SD])
    Methods for extracting audio and speech features have been studied since pioneering work on spectrum analysis decades ago. Recent efforts are guided by the ambition to develop general-purpose audio representations. For example, deep neural networks can extract optimal embeddings if they are trained on large audio datasets. This work extends existing methods based on self-supervised learning by bootstrapping, proposes various encoder architectures, and explores the effects of using different pre-training datasets. Lastly, we present a novel training framework to come up with a hybrid audio representation, which combines handcrafted and data-driven learned audio features. All the proposed representations were evaluated within the HEAR NeurIPS 2021 challenge for auditory scene classification and timestamp detection tasks. Our results indicate that the hybrid model with a convolutional transformer as the encoder yields superior performance in most HEAR challenge tasks.  ( 2 min )
    Efficient and Accurate Top-$K$ Recovery from Choice Data. (arXiv:2206.11995v1 [cs.LG])
    The intersection of learning to rank and choice modeling is an active area of research with applications in e-commerce, information retrieval and the social sciences. In some applications such as recommendation systems, the statistician is primarily interested in recovering the set of the top ranked items from a large pool of items as efficiently as possible using passively collected discrete choice data, i.e., the user picks one item from a set of multiple items. Motivated by this practical consideration, we propose the choice-based Borda count algorithm as a fast and accurate ranking algorithm for top $K$-recovery i.e., correctly identifying all of the top $K$ items. We show that the choice-based Borda count algorithm has optimal sample complexity for top-$K$ recovery under a broad class of random utility models. We prove that in the limit, the choice-based Borda count algorithm produces the same top-$K$ estimate as the commonly used Maximum Likelihood Estimate method but the former's speed and simplicity brings considerable advantages in practice. Experiments on both synthetic and real datasets show that the counting algorithm is competitive with commonly used ranking algorithms in terms of accuracy while being several orders of magnitude faster.  ( 2 min )
    Sampling Enclosing Subgraphs for Link Prediction. (arXiv:2206.12004v1 [cs.LG])
    Link prediction is a fundamental problem for graph-structured data (e.g., social networks, drug side-effect networks, etc.). Graph neural networks have offered robust solutions for this problem, specifically by learning the representation of the subgraph enclosing the target link (i.e., pair of nodes). However, these solutions do not scale well to large graphs as extraction and operation on enclosing subgraphs are computationally expensive, especially for large graphs. This paper presents a scalable link prediction solution, that we call ScaLed, which utilizes sparse enclosing subgraphs to make predictions. To extract sparse enclosing subgraphs, ScaLed takes multiple random walks from a target pair of nodes, then operates on the sampled enclosing subgraph induced by all visited nodes. By leveraging the smaller sampled enclosing subgraph, ScaLed can scale to larger graphs with much less overhead while maintaining high accuracy. ScaLed further provides the flexibility to control the trade-off between computation overhead and accuracy. Through comprehensive experiments, we have shown that ScaLed can produce comparable accuracy to those reported by the existing subgraph representation learning frameworks while being less computationally demanding.  ( 2 min )
    Approximating 1-Wasserstein Distance with Trees. (arXiv:2206.12116v1 [stat.ML])
    Wasserstein distance, which measures the discrepancy between distributions, shows efficacy in various types of natural language processing (NLP) and computer vision (CV) applications. One of the challenges in estimating Wasserstein distance is that it is computationally expensive and does not scale well for many distribution comparison tasks. In this paper, we aim to approximate the 1-Wasserstein distance by the tree-Wasserstein distance (TWD), where TWD is a 1-Wasserstein distance with tree-based embedding and can be computed in linear time with respect to the number of nodes on a tree. More specifically, we propose a simple yet efficient L1-regularized approach to learning the weights of the edges in a tree. To this end, we first show that the 1-Wasserstein approximation problem can be formulated as a distance approximation problem using the shortest path distance on a tree. We then show that the shortest path distance can be represented by a linear model and can be formulated as a Lasso-based regression problem. Owing to the convex formulation, we can obtain a globally optimal solution efficiently. Moreover, we propose a tree-sliced variant of these methods. Through experiments, we demonstrated that the weighted TWD can accurately approximate the original 1-Wasserstein distance.  ( 2 min )
    Measuring Representational Robustness of Neural Networks Through Shared Invariances. (arXiv:2206.11939v1 [cs.LG])
    A major challenge in studying robustness in deep learning is defining the set of ``meaningless'' perturbations to which a given Neural Network (NN) should be invariant. Most work on robustness implicitly uses a human as the reference model to define such perturbations. Our work offers a new view on robustness by using another reference NN to define the set of perturbations a given NN should be invariant to, thus generalizing the reliance on a reference ``human NN'' to any NN. This makes measuring robustness equivalent to measuring the extent to which two NNs share invariances, for which we propose a measure called STIR. STIR re-purposes existing representation similarity measures to make them suitable for measuring shared invariances. Using our measure, we are able to gain insights into how shared invariances vary with changes in weight initialization, architecture, loss functions, and training dataset. Our implementation is available at: \url{https://github.com/nvedant07/STIR}.
    "You Can't Fix What You Can't Measure": Privately Measuring Demographic Performance Disparities in Federated Learning. (arXiv:2206.12183v1 [cs.LG])
    Federated learning allows many devices to collaborate in the training of machine learning models. As in traditional machine learning, there is a growing concern that models trained with federated learning may exhibit disparate performance for different demographic groups. Existing solutions to measure and ensure equal model performance across groups require access to information about group membership, but this access is not always available or desirable, especially under the privacy aspirations of federated learning. We study the feasibility of measuring such performance disparities while protecting the privacy of the user's group membership and the federated model's performance on the user's data. Protecting both is essential for privacy, because they may be correlated, and thus learning one may reveal the other. On the other hand, from the utility perspective, the privacy-preserved data should maintain the correlation to ensure the ability to perform accurate measurements of the performance disparity. We achieve both of these goals by developing locally differentially private mechanisms that preserve the correlations between group membership and model performance. To analyze the effectiveness of the mechanisms, we bound their error in estimating the disparity when optimized for a given privacy budget, and validate these bounds on synthetic data. Our results show that the error rapidly decreases for realistic numbers of participating clients, demonstrating that, contrary to what prior work suggested, protecting the privacy of protected attributes is not necessarily in conflict with identifying disparities in the performance of federated models.
    Task-Adaptive Few-shot Node Classification. (arXiv:2206.11972v1 [cs.LG])
    Node classification is of great importance among various graph mining tasks. In practice, real-world graphs generally follow the long-tail distribution, where a large number of classes only consist of limited labeled nodes. Although Graph Neural Networks (GNNs) have achieved significant improvements in node classification, their performance decreases substantially in such a few-shot scenario. The main reason can be attributed to the vast generalization gap between meta-training and meta-test due to the task variance caused by different node/class distributions in meta-tasks (i.e., node-level and class-level variance). Therefore, to effectively alleviate the impact of task variance, we propose a task-adaptive node classification framework under the few-shot learning setting. Specifically, we first accumulate meta-knowledge across classes with abundant labeled nodes. Then we transfer such knowledge to the classes with limited labeled nodes via our proposed task-adaptive modules. In particular, to accommodate the different node/class distributions among meta-tasks, we propose three essential modules to perform \emph{node-level}, \emph{class-level}, and \emph{task-level} adaptations in each meta-task, respectively. In this way, our framework can conduct adaptations to different meta-tasks and thus advance the model generalization performance on meta-test tasks. Extensive experiments on four prevalent node classification datasets demonstrate the superiority of our framework over the state-of-the-art baselines. Our code is provided at https://github.com/SongW-SW/TENT.
    Equiformer: Equivariant Graph Attention Transformer for 3D Atomistic Graphs. (arXiv:2206.11990v1 [cs.LG])
    3D-related inductive biases like translational invariance and rotational equivariance are indispensable to graph neural networks operating on 3D atomistic graphs such as molecules. Inspired by the success of Transformers in various domains, we study how to incorporate these inductive biases into Transformers. In this paper, we present Equiformer, a graph neural network leveraging the strength of Transformer architectures and incorporating $SE(3)/E(3)$-equivariant features based on irreducible representations (irreps). Irreps features encode equivariant information in channel dimensions without complicating graph structures. The simplicity enables us to directly incorporate them by replacing original operations with equivariant counterparts. Moreover, to better adapt Transformers to 3D graphs, we propose a novel equivariant graph attention, which considers both content and geometric information such as relative position contained in irreps features. To improve expressivity of the attention, we replace dot product attention with multi-layer perceptron attention and include non-linear message passing. We benchmark Equiformer on two quantum properties prediction datasets, QM9 and OC20. For QM9, among models trained with the same data partition, Equiformer achieves best results on 11 out of 12 regression tasks. For OC20, under the setting of training with IS2RE data and optionally IS2RS data, Equiformer improves upon state-of-the-art models. Code reproducing all main results will be available soon.
  • Open

    Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss. (arXiv:2106.04156v7 [cs.LG] UPDATED)
    Recent works in self-supervised learning have advanced the state-of-the-art by relying on the contrastive learning paradigm, which learns representations by pushing positive pairs, or similar examples from the same class, closer together while keeping negative pairs far apart. Despite the empirical successes, theoretical foundations are limited -- prior analyses assume conditional independence of the positive pairs given the same class label, but recent empirical applications use heavily correlated positive pairs (i.e., data augmentations of the same image). Our work analyzes contrastive learning without assuming conditional independence of positive pairs using a novel concept of the augmentation graph on data. Edges in this graph connect augmentations of the same data, and ground-truth classes naturally form connected sub-graphs. We propose a loss that performs spectral decomposition on the population augmentation graph and can be succinctly written as a contrastive learning objective on neural net representations. Minimizing this objective leads to features with provable accuracy guarantees under linear probe evaluation. By standard generalization bounds, these accuracy guarantees also hold when minimizing the training contrastive loss. Empirically, the features learned by our objective can match or outperform several strong baselines on benchmark vision datasets. In all, this work provides the first provable analysis for contrastive learning where guarantees for linear probe evaluation can apply to realistic empirical settings.
    RARTS: An Efficient First-Order Relaxed Architecture Search Method. (arXiv:2008.03901v2 [cs.LG] UPDATED)
    Differentiable architecture search (DARTS) is an effective method for data-driven neural network design based on solving a bilevel optimization problem. Despite its success in many architecture search tasks, there are still some concerns about the accuracy of first-order DARTS and the efficiency of the second-order DARTS. In this paper, we formulate a single level alternative and a relaxed architecture search (RARTS) method that utilizes the whole dataset in architecture learning via both data and network splitting, without involving mixed second derivatives of the corresponding loss functions like DARTS. In our formulation of network splitting, two networks with different but related weights cooperate in search of a shared architecture. The advantage of RARTS over DARTS is justified by a convergence theorem and an analytically solvable model. Moreover, RARTS outperforms DARTS and its variants in accuracy and search efficiency, as shown in adequate experimental results. For the task of searching topological architecture, i.e., the edges and the operations, RARTS obtains a higher accuracy and 60\% reduction of computational cost than second-order DARTS on CIFAR-10. RARTS continues to out-perform DARTS upon transfer to ImageNet and is on par with recent variants of DARTS even though our innovation is purely on the training algorithm without modifying search space. For the task of searching width, i.e., the number of channels in convolutional layers, RARTS also outperforms the traditional network pruning benchmarks. Further experiments on the public architecture search benchmark like NATS-Bench also support the preeminence of RARTS.  ( 3 min )
    Multi-Frequency Joint Community Detection and Phase Synchronization. (arXiv:2206.12276v1 [cs.SI])
    This paper studies the joint community detection and phase synchronization problem on the \textit{stochastic block model with relative phase}, where each node is associated with a phase. This problem, with a variety of real-world applications, aims to recover community memberships and associated phases simultaneously. By studying the maximum likelihood estimation formulation, we show that this problem exhibits a \textit{``multi-frequency''} structure. To this end, two simple yet efficient algorithms that leverage information across multiple frequencies are proposed. The former is a spectral method based on the novel multi-frequency column-pivoted QR factorization, and the latter is an iterative multi-frequency generalized power method. Numerical experiments indicate our proposed algorithms outperform state-of-the-art algorithms, in recovering community memberships and associated phases.  ( 2 min )
    Simplified and Unified Analysis of Various Learning Problems by Reduction to Multiple-Instance Learning. (arXiv:1911.05999v4 [cs.LG] UPDATED)
    In statistical learning, many problem formulations have been proposed so far, such as multi-class learning, complementarily labeled learning, multi-label learning, multi-task learning, which provide theoretical models for various real-world tasks. Although they have been extensively studied, the relationship among them has not been fully investigated. In this work, we focus on a particular problem formulation called Multiple-Instance Learning (MIL), and show that various learning problems including all the problems mentioned above with some of new problems can be reduced to MIL with theoretically guaranteed generalization bounds, where the reductions are established under a new reduction scheme we provide as a by-product. The results imply that the MIL-reduction gives a simplified and unified framework for designing and analyzing algorithms for various learning problems. Moreover, we show that the MIL-reduction framework can be kernelized.  ( 2 min )
    A Unified Statistical Learning Model for Rankings and Scores with Application to Grant Panel Review. (arXiv:2201.02539v2 [stat.ME] UPDATED)
    Rankings and scores are two common data types used by judges to express preferences and/or perceptions of quality in a collection of objects. Numerous models exist to study data of each type separately, but no unified statistical model captures both data types simultaneously without first performing data conversion. We propose the Mallows-Binomial model to close this gap, which combines a Mallows' $\phi$ ranking model with Binomial score models through shared parameters that quantify object quality, a consensus ranking, and the level of consensus between judges. We propose an efficient tree-search algorithm to calculate the exact MLE of model parameters, study statistical properties of the model both analytically and through simulation, and apply our model to real data from an instance of grant panel review that collected both scores and partial rankings. Furthermore, we demonstrate how model outputs can be used to rank objects with confidence. The proposed model is shown to sensibly combine information from both scores and rankings to quantify object quality and measure consensus with appropriate levels of statistical uncertainty.  ( 2 min )
    Unified field theoretical approach to deep and recurrent neuronal networks. (arXiv:2112.05589v3 [cond-mat.dis-nn] UPDATED)
    Understanding capabilities and limitations of different network architectures is of fundamental importance to machine learning. Bayesian inference on Gaussian processes has proven to be a viable approach for studying recurrent and deep networks in the limit of infinite layer width, $n\to\infty$. Here we present a unified and systematic derivation of the mean-field theory for both architectures that starts from first principles by employing established methods from statistical physics of disordered systems. The theory elucidates that while the mean-field equations are different with regard to their temporal structure, they yet yield identical Gaussian kernels when readouts are taken at a single time point or layer, respectively. Bayesian inference applied to classification then predicts identical performance and capabilities for the two architectures. Numerically, we find that convergence towards the mean-field theory is typically slower for recurrent networks than for deep networks and the convergence speed depends non-trivially on the parameters of the weight prior as well as the depth or number of time steps, respectively. Our method exposes that Gaussian processes are but the lowest order of a systematic expansion in $1/n$ and we compute next-to-leading-order corrections which turn out to be architecture-specific. The formalism thus paves the way to investigate the fundamental differences between recurrent and deep architectures at finite widths $n$.  ( 3 min )
    Inductive Biases and Variable Creation in Self-Attention Mechanisms. (arXiv:2110.10090v2 [cs.LG] UPDATED)
    Self-attention, an architectural motif designed to model long-range interactions in sequential data, has driven numerous recent breakthroughs in natural language processing and beyond. This work provides a theoretical analysis of the inductive biases of self-attention modules. Our focus is to rigorously establish which functions and long-range dependencies self-attention blocks prefer to represent. Our main result shows that bounded-norm Transformer networks "create sparse variables": a single self-attention head can represent a sparse function of the input sequence, with sample complexity scaling only logarithmically with the context length. To support our analysis, we present synthetic experiments to probe the sample complexity of learning sparse Boolean functions with Transformers.  ( 2 min )
    Quantifying Inherent Randomness in Machine Learning Algorithms. (arXiv:2206.12353v1 [stat.ML])
    Most machine learning (ML) algorithms have several stochastic elements, and their performances are affected by these sources of randomness. This paper uses an empirical study to systematically examine the effects of two sources: randomness in model training and randomness in the partitioning of a dataset into training and test subsets. We quantify and compare the magnitude of the variation in predictive performance for the following ML algorithms: Random Forests (RFs), Gradient Boosting Machines (GBMs), and Feedforward Neural Networks (FFNNs). Among the different algorithms, randomness in model training causes larger variation for FFNNs compared to tree-based methods. This is to be expected as FFNNs have more stochastic elements that are part of their model initialization and training. We also found that random splitting of datasets leads to higher variation compared to the inherent randomness from model training. The variation from data splitting can be a major issue if the original dataset has considerable heterogeneity. Keywords: Model Training, Reproducibility, Variation  ( 2 min )
    Generalizing to New Physical Systems via Context-Informed Dynamics Model. (arXiv:2202.01889v3 [cs.LG] UPDATED)
    Data-driven approaches to modeling physical systems fail to generalize to unseen systems that share the same general dynamics with the learning domain, but correspond to different physical contexts. We propose a new framework for this key problem, context-informed dynamics adaptation (CoDA), which takes into account the distributional shift across systems for fast and efficient adaptation to new dynamics. CoDA leverages multiple environments, each associated to a different dynamic, and learns to condition the dynamics model on contextual parameters, specific to each environment. The conditioning is performed via a hypernetwork, learned jointly with a context vector from observed data. The proposed formulation constrains the search hypothesis space to foster fast adaptation and better generalization across environments. We theoretically motivate our approach and show state-of-the-art generalization results on a set of nonlinear dynamics, representative of a variety of application domains. We also show, on these systems, that new system parameters can be inferred from context vectors with minimal supervision. Code is available at https://github.com/yuan-yin/CoDA .  ( 2 min )
    Affinity-Aware Graph Networks. (arXiv:2206.11941v1 [cs.LG])
    Graph Neural Networks (GNNs) have emerged as a powerful technique for learning on relational data. Owing to the relatively limited number of message passing steps they perform -- and hence a smaller receptive field -- there has been significant interest in improving their expressivity by incorporating structural aspects of the underlying graph. In this paper, we explore the use of affinity measures as features in graph neural networks, in particular measures arising from random walks, including effective resistance, hitting and commute times. We propose message passing networks based on these features and evaluate their performance on a variety of node and graph property prediction tasks. Our architecture has lower computational complexity, while our features are invariant to the permutations of the underlying graph. The measures we compute allow the network to exploit the connectivity properties of the graph, thereby allowing us to outperform relevant benchmarks for a wide variety of tasks, often with significantly fewer message passing steps. On one of the largest publicly available graph regression datasets, OGB-LSC-PCQM4Mv1, we obtain the best known single-model validation MAE at the time of writing.  ( 2 min )
    Graph-Coupled Oscillator Networks. (arXiv:2202.02296v2 [cs.LG] UPDATED)
    We propose Graph-Coupled Oscillator Networks (GraphCON), a novel framework for deep learning on graphs. It is based on discretizations of a second-order system of ordinary differential equations (ODEs), which model a network of nonlinear controlled and damped oscillators, coupled via the adjacency structure of the underlying graph. The flexibility of our framework permits any basic GNN layer (e.g. convolutional or attentional) as the coupling function, from which a multi-layer deep neural network is built up via the dynamics of the proposed ODEs. We relate the oversmoothing problem, commonly encountered in GNNs, to the stability of steady states of the underlying ODE and show that zero-Dirichlet energy steady states are not stable for our proposed ODEs. This demonstrates that the proposed framework mitigates the oversmoothing problem. Moreover, we prove that GraphCON mitigates the exploding and vanishing gradients problem to facilitate training of deep multi-layer GNNs. Finally, we show that our approach offers competitive performance with respect to the state-of-the-art on a variety of graph-based learning tasks.  ( 2 min )
    On the Limitations of Elo: Real-World Games, are Transitive, not Additive. (arXiv:2206.12301v1 [cs.GT])
    Real-world competitive games, such as chess, go, or StarCraft II, rely on Elo models to measure the strength of their players. Since these games are not fully transitive, using Elo implicitly assumes they have a strong transitive component that can correctly be identified and extracted. In this study, we investigate the challenge of identifying the strength of the transitive component in games. First, we show that Elo models can fail to extract this transitive component, even in elementary transitive games. Then, based on this observation, we propose an extension of the Elo score: we end up with a disc ranking system that assigns each player two scores, which we refer to as skill and consistency. Finally, we propose an empirical validation on payoff matrices coming from real-world games played by bots and humans.  ( 2 min )
    The MELODIC family for simultaneous binary logistic regression in a reduced space. (arXiv:2102.08232v2 [stat.ME] UPDATED)
    Logistic regression is a commonly used method for binary classification. Researchers often have more than a single binary response variable and simultaneous analysis is beneficial because it provides insight into the dependencies among response variables as well as between the predictor variables and the responses. Moreover, in such a simultaneous analysis the equations can lend each other strength, which might increase predictive accuracy. In this paper, we propose the MELODIC family for simultaneous binary logistic regression modeling. In this family, the regression models are defined in a Euclidean space of reduced dimension, based on a distance rule. The model may be interpreted in terms of logistic regression coefficients or in terms of a biplot. We discuss a fast iterative majorization (or MM) algorithm for parameter estimation. Two applications are shown in detail: one relating personality characteristics to drug consumption profiles and one relating personality characteristics to depressive and anxiety disorders. We present a thorough comparison of our MELODIC family with alternative approaches for multivariate binary data.  ( 2 min )
    Accelerated Information Gradient flow. (arXiv:1909.02102v3 [math.OC] UPDATED)
    We present a framework for Nesterov's accelerated gradient flows in probability space to design efficient mean-field Markov chain Monte Carlo (MCMC) algorithms for Bayesian inverse problems. Here four examples of information metrics are considered, including Fisher-Rao metric, Wasserstein-2 metric, Kalman-Wasserstein metric and Stein metric. For both Fisher-Rao and Wasserstein-2 metrics, we prove convergence properties of accelerated gradient flows. In implementations, we propose a sampling-efficient discrete-time algorithm for Wasserstein-2, Kalman-Wasserstein and Stein accelerated gradient flows with a restart technique. We also formulate a kernel bandwidth selection method, which learns the gradient of logarithm of density from Brownian-motion samples. Numerical experiments, including Bayesian logistic regression and Bayesian neural network, show the strength of the proposed methods compared with state-of-the-art algorithms.  ( 2 min )
    Computationally Efficient PAC RL in POMDPs with Latent Determinism and Conditional Embeddings. (arXiv:2206.12081v1 [cs.LG])
    We study reinforcement learning with function approximation for large-scale Partially Observable Markov Decision Processes (POMDPs) where the state space and observation space are large or even continuous. Particularly, we consider Hilbert space embeddings of POMDP where the feature of latent states and the feature of observations admit a conditional Hilbert space embedding of the observation emission process, and the latent state transition is deterministic. Under the function approximation setup where the optimal latent state-action $Q$-function is linear in the state feature, and the optimal $Q$-function has a gap in actions, we provide a \emph{computationally and statistically efficient} algorithm for finding the \emph{exact optimal} policy. We show our algorithm's computational and statistical complexities scale polynomially with respect to the horizon and the intrinsic dimension of the feature on the observation space. Furthermore, we show both the deterministic latent transitions and gap assumptions are necessary to avoid statistical complexity exponential in horizon or dimension. Since our guarantee does not have an explicit dependence on the size of the state and observation spaces, our algorithm provably scales to large-scale POMDPs.  ( 2 min )
    From Tensor Network Quantum States to Tensorial Recurrent Neural Networks. (arXiv:2206.12363v1 [quant-ph])
    We show that any matrix product state (MPS) can be exactly represented by a recurrent neural network (RNN) with a linear memory update. We generalize this RNN architecture to 2D lattices using a multilinear memory update. It supports perfect sampling and wave function evaluation in polynomial time, and can represent an area law of entanglement entropy. Numerical evidence shows that it can encode the wave function using a bond dimension lower by orders of magnitude when compared to MPS, with an accuracy that can be systematically improved by increasing the bond dimension.  ( 2 min )
    Deep learning algorithms for solving high dimensional nonlinear backward stochastic differential equations. (arXiv:2010.01319v3 [math.NA] UPDATED)
    In this work, we propose a new deep learning-based scheme for solving high dimensional nonlinear backward stochastic differential equations (BSDEs). The idea is to reformulate the problem as a global optimization, where the local loss functions are included. Essentially, we approximate the unknown solution of a BSDE using a deep neural network and its gradient with automatic differentiation. The approximations are performed by globally minimizing the quadratic local loss function defined at each time step, which always includes the terminal condition. This kind of loss functions are obtained by iterating the Euler discretization of the time integrals with the terminal condition. Our formulation can prompt the stochastic gradient descent algorithm not only to take the accuracy at each time layer into account, but also converge to a good local minima. In order to demonstrate performances of our algorithm, several high-dimensional nonlinear BSDEs including pricing problems in finance are provided.  ( 2 min )
    Animal Behavior Classification via Deep Learning on Embedded Systems. (arXiv:2111.12295v2 [cs.LG] UPDATED)
    We develop an end-to-end deep-neural-network-based algorithm for classifying animal behavior using accelerometry data on the embedded system of an artificial intelligence of things (AIoT) device installed in a wearable collar tag. The proposed algorithm jointly performs feature extraction and classification utilizing a set of infinite-impulse-response (IIR) and finite-impulse-response (FIR) filters together with a multilayer perceptron. The utilized IIR and FIR filters can be viewed as specific types of recurrent and convolutional neural network layers, respectively. We evaluate the performance of the proposed algorithm via two real-world datasets collected from total eighteen grazing beef cattle using collar tags. The results show that the proposed algorithm offers good intra- and inter-dataset classification accuracy and outperforms its closest contenders including two state-of-the-art convolutional-neural-network-based time-series classification algorithms, which are significantly more complex. We implement the proposed algorithm on the embedded system of the utilized collar tags' AIoT device to perform in-situ classification of animal behavior. We achieve real-time in-situ behavior inference from accelerometry data without imposing any strain on the available computational, memory, or energy resources of the embedded system.  ( 2 min )
    Learning to Predict Graphs with Fused Gromov-Wasserstein Barycenters. (arXiv:2202.03813v3 [stat.ML] UPDATED)
    This paper introduces a novel and generic framework to solve the flagship task of supervised labeled graph prediction by leveraging Optimal Transport tools. We formulate the problem as regression with the Fused Gromov-Wasserstein (FGW) loss and propose a predictive model relying on a FGW barycenter whose weights depend on inputs. First we introduce a non-parametric estimator based on kernel ridge regression for which theoretical results such as consistency and excess risk bound are proved. Next we propose an interpretable parametric model where the barycenter weights are modeled with a neural network and the graphs on which the FGW barycenter is calculated are additionally learned. Numerical experiments show the strength of the method and its ability to interpolate in the labeled graph space on simulated data and on a difficult metabolic identification problem where it can reach very good performance with very little engineering.  ( 2 min )
    Empirical and Instance-Dependent Estimation of Markov Chain and Mixing Time. (arXiv:1912.06845v3 [math.PR] UPDATED)
    We tackle the problem of estimating the mixing time of a Markov chain from a single trajectory of observations. In contrast with previous works which considered Hilbert space methods to estimate spectral gaps, we opt for an approach based on contraction with respect to total variation. Specifically, we define and estimate a generalized contraction coefficient based on Dobrushin's. We show that this quantity -- unlike the spectral gap -- controls the mixing time up to strong universal constants and remains valid for non-reversible chains. We design fully data-dependent confidence intervals around the coefficient, which are both easier to compute and thinner than their spectral counterparts. Furthermore, we initiate the beyond worst-case analysis, by showing how to leverage additional information about the transition matrix in order to obtain instance-dependent rates for its estimation with respect to the induced uniform norm, as well as some of its mixing properties.  ( 2 min )
    Learning sparse features can lead to overfitting in neural networks. (arXiv:2206.12314v1 [stat.ML])
    It is widely believed that the success of deep networks lies in their ability to learn a meaningful representation of the features of the data. Yet, understanding when and how this feature learning improves performance remains a challenge: for example, it is beneficial for modern architectures trained to classify images, whereas it is detrimental for fully-connected networks trained for the same task on the same data. Here we propose an explanation for this puzzle, by showing that feature learning can perform worse than lazy training (via random feature kernel or the NTK) as the former can lead to a sparser neural representation. Although sparsity is known to be essential for learning anisotropic data, it is detrimental when the target function is constant or smooth along certain directions of input space. We illustrate this phenomenon in two settings: (i) regression of Gaussian random functions on the d-dimensional unit sphere and (ii) classification of benchmark datasets of images. For (i), we compute the scaling of the generalization error with number of training points, and show that methods that do not learn features generalize better, even when the dimension of the input space is large. For (ii), we show empirically that learning features can indeed lead to sparse and thereby less smooth representations of the image predictors. This fact is plausibly responsible for deteriorating the performance, which is known to be correlated with smoothness along diffeomorphisms.  ( 2 min )
    Aggregated Multi-output Gaussian Processes with Knowledge Transfer Across Domains. (arXiv:2206.12141v1 [stat.ML])
    Aggregate data often appear in various fields such as socio-economics and public security. The aggregate data are associated not with points but with supports (e.g., spatial regions in a city). Since the supports may have various granularities depending on attributes (e.g., poverty rate and crime rate), modeling such data is not straightforward. This article offers a multi-output Gaussian process (MoGP) model that infers functions for attributes using multiple aggregate datasets of respective granularities. In the proposed model, the function for each attribute is assumed to be a dependent GP modeled as a linear mixing of independent latent GPs. We design an observation model with an aggregation process for each attribute; the process is an integral of the GP over the corresponding support. We also introduce a prior distribution of the mixing weights, which allows a knowledge transfer across domains (e.g., cities) by sharing the prior. This is advantageous in such a situation where the spatially aggregated dataset in a city is too coarse to interpolate; the proposed model can still make accurate predictions of attributes by utilizing aggregate datasets in other cities. The inference of the proposed model is based on variational Bayes, which enables one to learn the model parameters using the aggregate datasets from multiple domains. The experiments demonstrate that the proposed model outperforms in the task of refining coarse-grained aggregate data on real-world datasets: Time series of air pollutants in Beijing and various kinds of spatial datasets from New York City and Chicago.  ( 3 min )
    Regret Bounds for Noise-Free Kernel-Based Bandits. (arXiv:2002.05096v2 [stat.ML] UPDATED)
    Kernel-based bandit is an extensively studied black-box optimization problem, in which the objective function is assumed to live in a known reproducing kernel Hilbert space. While nearly optimal regret bounds (up to logarithmic factors) are established in the noisy setting, surprisingly, less is known about the noise-free setting (when the exact values of the underlying function is accessible without observation noise). We discuss several upper bounds on regret; none of which seem order optimal, and provide a conjecture on the order optimal regret bound.  ( 2 min )
    Provably Efficient Reinforcement Learning in Partially Observable Dynamical Systems. (arXiv:2206.12020v1 [cs.LG])
    We study Reinforcement Learning for partially observable dynamical systems using function approximation. We propose a new \textit{Partially Observable Bilinear Actor-Critic framework}, that is general enough to include models such as observable tabular Partially Observable Markov Decision Processes (POMDPs), observable Linear-Quadratic-Gaussian (LQG), Predictive State Representations (PSRs), as well as a newly introduced model Hilbert Space Embeddings of POMDPs and observable POMDPs with latent low-rank transition. Under this framework, we propose an actor-critic style algorithm that is capable of performing agnostic policy learning. Given a policy class that consists of memory based policies (that look at a fixed-length window of recent observations), and a value function class that consists of functions taking both memory and future observations as inputs, our algorithm learns to compete against the best memory-based policy in the given policy class. For certain examples such as undercomplete observable tabular POMDPs, observable LQGs and observable POMDPs with latent low-rank transition, by implicitly leveraging their special properties, our algorithm is even capable of competing against the globally optimal policy without paying an exponential dependence on the horizon in its sample complexity.  ( 2 min )
    Deep Stable neural networks: large-width asymptotics and convergence rates. (arXiv:2108.02316v2 [cs.LG] UPDATED)
    In modern deep learning, there is a recent and growing literature on the interplay between large-width asymptotic properties of deep Gaussian neural networks (NNs), i.e. deep NNs with Gaussian-distributed weights, and Gaussian stochastic processes (SPs). Such an interplay has proved to be critical in Bayesian inference under Gaussian SP priors, kernel regression for infinitely wide deep NNs trained via gradient descent, and information propagation within infinitely wide NNs. Motivated by empirical analyses that show the potential of replacing Gaussian distributions with Stable distributions for the NN's weights, in this paper we present a rigorous analysis of the large-width asymptotic behaviour of (fully connected) feed-forward deep Stable NNs, i.e. deep NNs with Stable-distributed weights. We show that as the width goes to infinity jointly over the NN's layers, i.e. the ``joint growth" setting, a rescaled deep Stable NN converges weakly to a Stable SP whose distribution is characterized recursively through the NN's layers. Because of the non-triangular structure of the NN, this is a non-standard asymptotic problem, to which we propose an inductive approach of independent interest. Then, we establish sup-norm convergence rates of the rescaled deep Stable NN to the Stable SP, under the ``joint growth" and a ``sequential growth" of the width over the NN's layers. Such a result provides the difference between the ``joint growth" and the ``sequential growth" settings, showing that the former leads to a slower rate than the latter, depending on the depth of the layer and the number of inputs of the NN. Our work extends some recent results on infinitely wide limits for deep Gaussian NNs to the more general deep Stable NNs, providing the first result on convergence rates in the ``joint growth" setting.  ( 3 min )
    On making optimal transport robust to all outliers. (arXiv:2206.11988v1 [stat.ML])
    Optimal transport (OT) is known to be sensitive against outliers because of its marginal constraints. Outlier robust OT variants have been proposed based on the definition that outliers are samples which are expensive to move. In this paper, we show that this definition is restricted by considering the case where outliers are closer to the target measure than clean samples. We show that outlier robust OT fully transports these outliers leading to poor performances in practice. To tackle these outliers, we propose to detect them by relying on a classifier trained with adversarial training to classify source and target samples. A sample is then considered as an outlier if the prediction from the classifier is different from its assigned label. To decrease the influence of these outliers in the transport problem, we propose to either remove them from the problem or to increase the cost of moving them by using the classifier prediction. We show that we successfully detect these outliers and that they do not influence the transport problem on several experiments such as gradient flows, generative models and label propagation.  ( 2 min )
    Approximating 1-Wasserstein Distance with Trees. (arXiv:2206.12116v1 [stat.ML])
    Wasserstein distance, which measures the discrepancy between distributions, shows efficacy in various types of natural language processing (NLP) and computer vision (CV) applications. One of the challenges in estimating Wasserstein distance is that it is computationally expensive and does not scale well for many distribution comparison tasks. In this paper, we aim to approximate the 1-Wasserstein distance by the tree-Wasserstein distance (TWD), where TWD is a 1-Wasserstein distance with tree-based embedding and can be computed in linear time with respect to the number of nodes on a tree. More specifically, we propose a simple yet efficient L1-regularized approach to learning the weights of the edges in a tree. To this end, we first show that the 1-Wasserstein approximation problem can be formulated as a distance approximation problem using the shortest path distance on a tree. We then show that the shortest path distance can be represented by a linear model and can be formulated as a Lasso-based regression problem. Owing to the convex formulation, we can obtain a globally optimal solution efficiently. Moreover, we propose a tree-sliced variant of these methods. Through experiments, we demonstrated that the weighted TWD can accurately approximate the original 1-Wasserstein distance.  ( 2 min )

  • Open

    [D] Paper Explained - Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos (Video Analysis)
    https://youtu.be/oz5yZc9ULAc Minecraft is one of the harder challenges any RL agent could face. Episodes are long, and the world is procedurally generated, complex, and huge. Further, the action space is a keyboard and a mouse, which has to be operated only given the game's video input. OpenAI tackles this challenge using Video PreTraining, leveraging a small set of contractor data in order to pseudo-label a giant corpus of scraped footage of gameplay. The pre-trained model is highly capable in basic game mechanics and can be fine-tuned much better than a blank slate model. This is the first Minecraft agent that achieves the elusive goal of crafting a diamond pickaxe all by itself. ​ OUTLINE: 0:00 - Intro 3:50 - How to spend money most effectively? 8:20 - Getting a large dataset with labels 14:40 - Model architecture 19:20 - Experimental results and fine-tuning 25:40 - Reinforcement Learning to the Diamond Pickaxe 30:00 - Final comments and hardware ​ Blog: https://openai.com/blog/vpt/ Paper: https://arxiv.org/abs/2206.11795 Code & Model weights: https://github.com/openai/Video-Pre-Training submitted by /u/ykilcher [link] [comments]  ( 85 min )
    [D] How to not commit code copyright violation with Github Co-pilot?
    At our work place, many of our ML researchers are starting to use Github Co-pilot to save time. Issue is there is no provenance on the code generated by Co-pilot. If I understand correctly, Co-pilot is trained on public GitHub repositories, many of which might have specific copyright and license clauses. Our research, when published, would also put the code on Github publicly. What would you suggest to prevent potential code copyright violation in this case? I have sent request for Github to provide provenance tracking feature but I assume that's gonna take a while to implement (that is, if they decide to implement it). Are you using Github Co-pilot and worrying about similar issues? submitted by /u/leboulevardier [link] [comments]  ( 85 min )
    [D] Will this mode work for practicing paper reviews? Can we get in-depth feedback on our draft?
    Some opinions were collected about mocking ML paper reviews. Link to the thread: https://www.reddit.com/r/MachineLearning/comments/u967sy/d_opinions_needed_anyone_interested_in_mock_peer/?utm_source=share&utm_medium=web2x&context=3 To summarize, many people are interested. Opinions are in common that: People like private review rather than public review Number of papers to review are not a concern but every couple months will be a good pace Plagiarism and stealing are of course the biggest concern To address this, I suggest the following mode: ONLY opens for people who want to exchange paper reviews. Enthusiastic reviewers with no paper draft to be reviewed can wait. ONLY opens for people who are really interested in mocking paper review prior to formal journal/conference submission. Join a Discord community (already established). In the PRIVATE "Introduce yourself" channel, people introduce themselves using true information and offer a very brief paper abstract and ML category. Chat openly or privately to find the right review partners In the "paper-review-exchange" channel, announce your paper reviewer upon agreement (from both sides) Exchange your drafts privately and preferably with official email addresses (Optional) When the review work is done, announce it too. Note that plagiarism and stealing can be minimized in this mode but still could happen. When conference reviews do not offer much nowadays, a mockup review might give your more TRUE inputs. Good luck! submitted by /u/DouBlindDotCOM [link] [comments]  ( 85 min )
    [R] Can explainability improve model accuracy?
    ​ https://preview.redd.it/okh7r16770891.jpg?width=1200&format=pjpg&auto=webp&s=9f0fe7605453a945682d27eab65d866dce3f126c Black-box Deep learning models are mostly uninterpretable and far too complex. • One strategy is to learn the nonlinear relation of input features. However, there are so many features to learn from. https://preview.redd.it/muotby5s70891.png?width=782&format=png&auto=webp&s=1cbc3dece747d061e3ab96dea8b309c3fae5b8ce ​ • Research shows a set of important features can improve the learning process. Therefore, we can focus on the most correlated features. • Paper📜: https://arxiv.org/abs/2203.04383 submitted by /u/AshkanF [link] [comments]  ( 84 min )
    [D] Clarification question related to prompting
    What is the difference between prompt engineering and prompt learning? I recently heard a talk where the presenter said that ‘we freeze the parameters of the model and only do prompt learning’. To me that seems like engineering than learning. submitted by /u/QadriShyaari [link] [comments]  ( 83 min )
    [P] A drawing application called Vizcom that uses GANs to help automate color, shading, and rendering.
    submitted by /u/AquaHug [link] [comments]  ( 85 min )
    [R] [D] How can one rigorously and efficiently deal with binary classification problems on multi-label data?
    To be clearer, I'd like to start learning about some techniques or the literature about this particular type of binary classification problems. Please share if you happen to know about this (keywords, links, articles, etc are all appreciated). So, the problem is supervised binary classification. In general, there is nothing special about the dataset apart from the fact that the train/val data from one of the 2 label classes (from now on, let's say it's negative) are already further labeled into multiple subclasses. From there, the problem has an additional goal (other than binary classification): to maximize the number of subclasses that are classified well by the model. By "classified well", I mean that, for example, if one restricts the negative side of the dataset into one of such subclasses, the performance of the model is higher than some close-to-perfect thresholds. Furthermore, there might be some complications in both ways: there might be some subclasses that are easy to classify by the model, and there might be some subclasses that are impossible to classify by the model (e.g. XOR problem with linear classifiers). The key here is that, in the end, at test time, one should only use one "small" (relatively of course) "model" (a combination of shallow neural nets is OK too) to classify all testing data. Additionally, I'm open to learn about stuffs beyond the supervised paradigm. submitted by /u/anvinhnd [link] [comments]  ( 85 min )
    [Discussion] Doubt regarding text vector difference image manipulation method of Dalle-2.
    I was going through the (updated)paper, there was this image manipulation method through text difference. It went like this: z_i := original image CLIP embedding z_t := new text CLIP embedding/ embedding of the text for current image manipulation z_t0 := orignal image's corresponding text CLIP embedding/ text embedding of the text 'a photo' / empty embedding z_d := l2_norm(z_t - z_t0) text difference vector | Here l2_norm means, normalising a vector by dividing it with it's norm_p (here norm 2). z_new /z_theta := spherical_interpolation(z_i, z_d, theta) {where theta is between (0,0.5)} new image's CLIP embedding vector What I don't understand is, that the CLIP img and text embedding vectors are supposed to be similar vectors (since trained with cosine similarity), and the difference between text embedding vectors of two similar texts will be somewhat perpendicular to either of the text vectors, therefore the text diff vector should be very different from the image embedding, and hence the spherical interpolation shouldn't give any meaningful result. What am I missing? I am unable to understand why this text difference method works. submitted by /u/OddSandwich969 [link] [comments]  ( 85 min )
    [R] How well do sparse ImageNet models transfer? Prune once and deploy anywhere for inference performance speedups! (arxiv link in comments)
    submitted by /u/markurtz [link] [comments]  ( 84 min )
    [D] Why do some competition organizers hide the leaderboard? (Regarding my experience in IEEE SP Cup 2022)
    I don't understand why organisers of an competition would hide the leaderboard, specially in a machine learning and signal processing related competition. We participated in IEEE SP Cup 2022, sacrificing nearly 2 months of our time and and some sleepless nights. The organizers never said anything about keeping the leaderboard of the competition hidden. In the first round they gave us an access to a website where we can submit our predictions and get our score privately. There wasn't a leaderboard (well, there was a one that generates some random scores wherever we make a submission but I don't understand the use of it). After the first round, we and several other teams requested the leaderboard. At first, the organizers said they couldn't reveal it because some teams would not like other teams seeing their position on the leaderboard (a weird reason because all teams were given a separate name to make submissions on the website and no team knew the names of others 🙄), and after many replying they would like to see the leaderboard and there won't be such a problem with them, they asked us to create a poll in piazza to see which teams would like to see the leaderboard and said they will reveal the leaderboard after the competition is over. The funny thing is, there was no such option for students to create a poll in piazza 😂. Even though we mentioned it to the organizers, we didn't get any reply. Now it has been almost a month since the competition concluded and the organizers totally ghosted us. This is really discouraging after spending several months on a competition without even getting to know how far our efforts have come. Why would organizers hide the leaderboard like this? They could at least reveal the top 10 teams? submitted by /u/TransitionWhich5018 [link] [comments]  ( 85 min )
    GMM latent space [D]
    Hi, I would love to know if there is any ongoing work (or the latest) on mixture of Gaussians as latent space for GANs, or other generative models. Does anyone have any experience on it and/or opinions on why it is not popular? (or doesn't work) submitted by /u/huehue9812 [link] [comments]  ( 84 min )
    [D] Derivation of path dependent attribution in Tree SHAP
    I was reading the TreeSHAP paper by Lundberg & Lee. There they propose that every path can be considered an individual model and due to additivity property of SHAP we can directly add the attributions for each path and that would give us the attribution for that tree. I can understand till - if a feature doesn't lie on the path then that feature's attribution for that path would be zero. if feature lies on the path and also lies on the path of Xf then it's attribution is positive. if feature lies on the path but doesn't lie on the path covered by Xf then attribution is negative. But I can't get my head around the quantification of these contributions - especially the weighting.i.e., POS = W(|Sp|-1, |Np|)*v ; NEG = -W(|Sp, |Np|)*v ; where v is the leaf's update. I have may questions, but to begin with, can someone please help me understand how do we get these attribution values ? submitted by /u/Ok-Seesaw9702 [link] [comments]  ( 84 min )
    [D] Sequence Modelling Technique
    Let's say we have a time series problem where we are trying to use past information to predict future inputs. Like stock prices, or heart rates, or a language model that receives one word at a time. In theory you would want each output at t to contain the maximum amount of predictive information about label t+1. Let's say you attach a second network to this RNN, which tries to predict hidden state t+1 from hidden state t and add it's error as an auxiliary loss. You could call it a "Lookahead reconstruction loss" I believe this should make the RNN learn in a way that maximises future understanding of the network. Has anybody experimented with this technique, or read about implementations on this? I'd be interested in hearing opinions from fellow practitioners. submitted by /u/RodObr [link] [comments]  ( 84 min )
    I made a robot that punishes me if it detects that if I am procrastinating on my assignments [P]
    submitted by /u/_ayushp_ [link] [comments]  ( 90 min )
    [R] CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 84 min )
  • Open

    Does anyone know what AI text to voice «anicapped» uses on youtube
    maybe you haven’t heard this, this is voice https://youtu.be/FAvcn_8OuMk this sound is really good. im wondering if anyone knows which al is used for text to voice? submitted by /u/Basic_Pay7859 [link] [comments]  ( 82 min )
    Deepfakes and investment fraud
    Found a fraud that is using deep fake photos to generate a “credible” website: https://nilssonhedge.com/2022/06/25/this-manager-does-not-exist-the-sequel/ submitted by /u/Interesting-Wing-829 [link] [comments]  ( 82 min )
    How does the data input work in a chat bot?
    I am new to AI and chatbots in general. I can't find any good explanations of how data mining/inputs work with chat bots. Do I need data from real people? Could I create a series of questions and answers and have the AI use that to expand on? submitted by /u/linuxman1929 [link] [comments]  ( 82 min )
    AI Image Filling with OpenAI DALL-E 2
    submitted by /u/dulldata [link] [comments]  ( 82 min )
    In less than 5 minutes, you will know how the transformer architecture can be applied to computer vision with a new paper called the Swin Transformer
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 82 min )
    OpenAI's DALL-E 2 may now generate faces
    submitted by /u/henlo_there_fren [link] [comments]  ( 82 min )
    Just posted a huge update to my neural-net artificial life sim! Temperature tracking, scent system, skin patterns and more!
    submitted by /u/urocyon_dev [link] [comments]  ( 84 min )
    Instagram bot
    These days you see some bots always dm about someone page or weird link? How can I build a bot like that? I also want to spam people dm like that for some business. Hope it is not illigal lmao. submitted by /u/Ekonshy [link] [comments]  ( 82 min )
    AI Makes Strides in Virtual Worlds More Like Our Own | Quanta Magazine
    submitted by /u/nick7566 [link] [comments]  ( 82 min )
    What is pruning a deep neural network? After reading many papers, I've created a guide on github in an attempt to map the many pruning and sparsity techniques
    submitted by /u/IntelligentHat1657 [link] [comments]  ( 82 min )
    SANDCASTLES BONANZA | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
    DALL-E mini is amazing / music by me
    submitted by /u/Shaftershafter [link] [comments]  ( 82 min )
    Community for AI Generated Mech/Robot Concept Art
    Hi - I've recently started up a discord community to generate free and mech / robot concept art for the art & design community. We have a number of categorized sections that we are filling up with unusual and inspiring designs and plan to run weekly competitions based around novel themes. This is a non-profit initiative and not at all associated with the block chain or NFTs. The intent is purely to inspire people in their own art projects and give people some building blocks to work from; through harnessing AI. We've got a number of users with access to Midjourney and disco diffusion, with invites periodically becoming available for active contributers to the discord. Here's the discord link and some example images for anyone that wants to join the project or just to pop in, say hi and get inspired: https://discord.gg/WcR5YCmP https://twitter.com/AIMechCollect/status/1540767027815645184?t=uDIppoThajlVx9603tsnUw&s=19 Please delete this if this Reddit group doesn't allow this type of promotion. submitted by /u/Rabeeeto [link] [comments]  ( 83 min )
  • Open

    Rationale for updating Value Function multiple times with same observations in spinninup's VPG-GAE implementation
    Hi there, In OpenAI's spinningup's VPG-GAE implementation , the authors update the value function V(s_t) multiple times at every epoch using the same batch of observations. Copying their code (line 237 onwards in link above): def update(): # Get loss and info values before update # ... # Train policy with a single step of gradient descent # ... # Value function learning for i in range(train_v_iters): # <--- STARTING HERE vf_optimizer.zero_grad() loss_v = compute_loss_v(data) # <--- data is unchanged loss_v.backward() mpi_avg_grads(ac.v) # average grads across MPI processes vf_optimizer.step() What's the rationale for doing so? My interpretation is that this is done to accelerate learning and that, presumably, this is more stable than using a higher learning rate on a single pass through the data. So: What's the rationale (am I missing something)? Is this common practice in policy optimisation models? Why does the same rationale not apply to the policy updates? Thank you all for your help! submitted by /u/desperateEfforts1 [link] [comments]  ( 83 min )
    "The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models", Pan et al 2022 ("phase transitions: capability thresholds at which the agent's behavior qualitatively shifts")
    submitted by /u/gwern [link] [comments]  ( 83 min )
    "Deep Reinforcement Learning for Closed-Loop Blood Glucose Control", Fox et al 2020
    submitted by /u/gwern [link] [comments]  ( 82 min )
    Are there any guides on writing technical ML papers?
    I read them in hopes to one day contribute, but they seem to range in practice. Some are overly nuanced and detract from the point while others avoid jargon altogether so I'm wondering if there are any guidelines. submitted by /u/XecutionStyle [link] [comments]  ( 83 min )
    "AI-Guided Robots Are Ready to Sort Your Recyclables"
    submitted by /u/gwern [link] [comments]  ( 84 min )
    Resources for off/on-policy RL
    Hello, I am trying to understand the math of off-policy and on-policy RL. Like what exactly allows the use of previous experiences in off-policy RL, and why is that not possible in on-policy RL. Any resources that could help with that? submitted by /u/AhmedNizam_ [link] [comments]  ( 83 min )
  • Open

    Transformations of Olympic rings
    The previous post gave the details of how Möbius transformations m(z) = (az + b)/(cz + d) transform circles. The image of a circle under a Möbius transformation is either a line or a circle, and in our examples the image will always be a line. We start with an approximation of the Olympic rings […] Transformations of Olympic rings first appeared on John D. Cook.  ( 4 min )
    Circles and lines under a Möbius transformation
    This post will revisit a previous post in more detail. I’ve written before about how Möbius transformations map circles and lines to circles and lines. In this post I’d like to explore how you’d calculate which circle or line a given circle or line goes to. Given an equation of a line or circle, what […] Circles and lines under a Möbius transformation first appeared on John D. Cook.  ( 6 min )

  • Open

    First line of AI Designed Graphic Tees - GraphicAI
    submitted by /u/cityofgoul [link] [comments]  ( 82 min )
    "Sunset" 🌅 - Created on Pixelz AI
    submitted by /u/pixelz_ai [link] [comments]  ( 82 min )
    THE ACCUSER | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
    Is there any way of using a text editor with Kaggle or Google Colab notebooks? [Discussion]
    submitted by /u/yapoinder [link] [comments]  ( 84 min )
    AI Advances Nuclear Fusion R&D | New Amazon Robot Proteus Automation | AI Outperforms Crypto Markets | Robotic Fireflies
    submitted by /u/getrich_or_diemining [link] [comments]  ( 82 min )
    Bullitt chase scene upscaled to 50 FPS Using DAIN-APP (free Artificial Intelligence software)
    submitted by /u/the_anonymizer [link] [comments]  ( 82 min )
    Generative AI resource
    I came across this course about Generative AI/ generative models and I find it quite interesting. I wanted to share this resource, since I’m struggling to find good and up to date material on GAI. https://www.udemy.com/course/generative-ai/?referralCode=6A16021D86142A4EAB93 submitted by /u/gggingerbean [link] [comments]  ( 82 min )
    Yandex open sources 100B GPT-like model
    submitted by /u/binaryfor [link] [comments]  ( 82 min )
    AI made art
    submitted by /u/Accomplished_Head5 [link] [comments]  ( 83 min )
    AI Dream 58 - AI EPIC Midjourney through Space
    submitted by /u/LordPewPew777 [link] [comments]  ( 82 min )
    What is the best free AI voice synthesis program out there?
    What I'm looking for is a program that can take raw voice clips (~10 minutes of actual mp3 recording) and create a synthesised voice from that. I ask because I want to make some custom voices of fictional characters, bit like what 15.ai does I've had experience working with AI programs so something on GitHub as long as it's not too much of a pain to setup is fine as well submitted by /u/Cyberfunk3 [link] [comments]  ( 82 min )
    Yann LeCun has a bold new vision for the future of AI
    submitted by /u/nick7566 [link] [comments]  ( 84 min )
    How are these videos made?
    submitted by /u/niIbert [link] [comments]  ( 82 min )
    Are there any free AI story generators like inferkit for Android?
    submitted by /u/ScottABoutizis [link] [comments]  ( 83 min )
    have a nice trip!)
    submitted by /u/nalr00n [link] [comments]  ( 82 min )
  • Open

    [N] CVPR Hugging Face Gradio event is open until June 30th. A hackathon type event with prizes in which we will create interactive web demos for CVPR papers.
    We are happy to invite you to the Hugging Face Gradio CVPR event - a community event in which we will create interactive demos for CVPR papers. Demos are powerful because they allow anyone — not just ML engineers — to try out models in the browser, give feedback on predictions, identify trustworthy models. The event is open until June 30th, 2022 (AOE Time Zone). We are organizing this event on Huggingface: https://huggingface.co/CVPR. Prizes will be given at the end of the event. Demos will be built with Gradio and we encourage using the new Gradio Blocks API. Blocks allows you to build web-based demos in a flexible way using the Gradio library. Gradio is a popular choice for building demos for machine learning models, as it allows you to create web-based UIs all in Python. For example, here is a Gradio Demo for FLAVA: A Foundational Language And Vision Alignment Model: https://reddit.com/link/vkqmhu/video/48cnmkfiku791/player submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 84 min )
    [Discussion]What's the best way to prevent data leak?
    I heard the phrase data leak or feature leak and the solution seems to be point-in-time join. Maybe because I haven't built a lot of ML applications, i never knew it actually. So do you see it often? When do you usually see it? How to avoid it? Is there any tools to avoid this or do it right without data leakage? Thanks! submitted by /u/rubick5 [link] [comments]  ( 84 min )
    [D] Is it time to retire the FID?
    I know the main metric used to measure the quality of generative models is the FID. However, it seems to me that some problems arise when evaluating a generative model using another model. A couple that come to mind: - Inception v3 itself is 7 years old at this point. Nowadays, we have models with much higher ImageNet classification accuracy, which presumably translates to better internal representations. Why are we still using Inception v3 instead of, for instance, ViT or some more recent model. The ImageNet dataset that is commonly used to pretrain the Inceptionv3, while being quite comprehensive, is still limited to 1000 classes. If I want to train a model to generate classes that are semantically distant from ImageNet classes, what guarantees do I have that the activations of Inceptionv3 will be meaningful? This is more so problematic with models like DALL-E, which are trained on much larger datasets and can generate from the open set, essentially. Perhaps I am misinterpreting things, but it seems to me that the FID is a case of "good enough" that sort of stuck around. What are your thoughts? submitted by /u/MurlocXYZ [link] [comments]  ( 86 min )
    Is there any way of using a text editor with Kaggle or Google Colab notebooks? [Discussion]
    UPDATE: SOLVED The lovely people in the comments guided me to a better method of using github and cloning my repository in the kaggle runtime using the !git clone command. I was unaware you could clone a github repository and run a python file in this method. I was even able to create an anaconda environment and run everything smoothly. So everything is running smoothly again :D <3 <3 :D ​ ------------- I am training a video classification neural network which involves opencv based image augmentation and then after the training completes I run a series of test with my test datasets. so with all of the functionality the code base is close to 6k lines of code. This is really hard to work with in the current notebook cell format, if I want to make any changes I have to scroll a lot and often I get confused since my python Classes are thousands of lines each with many functions built in. Using an editor like VSCODE is 10000x times easier than working with notebooks. Has anyone figured this one out? Yes I realize I can work in VSCODE on my local computer and then manually transfer the code to kaggle, but this is incredibly tedious when making small changes to file paths and general code changes. Im shocked there isnt a better way around this !!! I mean c'mon how do we expect AI to be adopted by the masses if we cant have a streamlined way of developing software? I guess the alternative is to buy a $6000 GPU and build a pc lol, i'm a broke student paying off student debt :( I am grateful for the free GPU with Kaggle, I JUST WANT A SIMPLE TEXT EDITOR... is that too much to ask? submitted by /u/yapoinder [link] [comments]  ( 89 min )
    [P] Frechet Inception Distance
    I'm currently looking into quantifying GANS and from my current understanding, the way to go is the FID (Frechet Inception Distance) as a key metric. I read into it and have a basic understanding of how it works based on comparing the feature vectors of the Inception Model. In all the tutorials, I saw detailed implementation but they stopped after computing an FID between two images. In all of the papers, I saw there is one FID score used to compare entire GAN architectures and I'm a bit lost about how many images they generate to compare and whether images generated get randomly paired for an average FID score. TL;DR: The procedure behind comparing GAN architectures is unclear to me based on the FID. submitted by /u/FitWin7383 [link] [comments]  ( 86 min )
    [Research] Not all our papers get published, therefore it is enjoyable to see our released papers become a true foundation for other works
    I read a post in linkedin (see links at the end) and find a similar case on our side: “Not all our papers get published, therefore it is enjoyable to see our released papers become a true foundation for other works”. Our work: (1) IMAE demonstrates a robust loss could be unbounded, asymmetric; (2) Derivative Manipulation proposes gradient normalisation and emphasis density functions. * IMAE for Noise-Robust Learning: Mean Absolute Error Does Not Treat Examples Equally and Gradient Magnitude's Variance Matters: https://arxiv.org/pdf/1903.12141.pdf * Derivative Manipulation for General Example Weighting: https://arxiv.org/pdf/1905.11233.pdf The following works: ICML-20: Normalized Loss Functions for Deep Learning with Noisy Labels: http://proceedings.mlr.press/v119/ma20c/ma20c.pdf ICML-21: Asymmetric Loss Functions for Learning with Noisy Labels https://proceedings.mlr.press/v139/zhou21f ​ More details and original source: https://www.linkedin.com/posts/xinshaowang_the-probabilistic-normal-epipolar-constraint-activity-6944535197044367360-jpu5?utm_source=linkedin_share&utm_medium=member_desktop_web https://www.linkedin.com/posts/laurent-kneip-72518658_the-probabilistic-normal-epipolar-constraint-activity-6944331307514531840-vQb1?utm_source=linkedin_share&utm_medium=member_desktop_web submitted by /u/XinshaoWang [link] [comments]  ( 85 min )
    [P] Synthetic Images Anomaly Detection with CLIP
    You have just generated a bunch of synthetic images by your favorite generative model. Most of them look great, but some looks really bad. These are outliers. Since GAN, the most popular generative model structure, doesn’t produce a likelihood score for generated images, you can not know which of the images generated by it are outliers. With the following method, you can inspect your synthetic dataset more efficiently than by just looking at all images. First blog post on Medium. Let me know what you think. ​ https://preview.redd.it/1bq8cmm29q791.png?width=260&format=png&auto=webp&s=5aa2b82e1f1bb4edd64d3f7658415dde1573e2ee Synthetic Images Anomaly Detection with CLIP submitted by /u/Realistic_Ad_8107 [link] [comments]  ( 84 min )
    [P] Waymo Motion Prediction Challenge 2022: solution with report and code
    submitted by /u/Just_Ad8110 [link] [comments]  ( 84 min )
    [P] Oddly thresholded confidence scores on scaled yolov4 csp
    All object detections on the scaled yolov4 csp model have a confidence below 0.5, while it should range from 0 to 1. Does anything come to mind as to what the problem might be? Info: I'm using a branch of the author's PyTorch repo Predictions are otherwise pretty good in terms of bbox placement I'm training on a single gpu Darknet coco weights are converted to ".pt" PyTorch weights for training A custom dataset is used with a single prediction class Data is augmented before training starts, most of the dataloader's data augmentation is disabled submitted by /u/mrwafflezzz [link] [comments]  ( 84 min )
    [D] Single camera MOT person tracklet re-identification: most suitable approaches?
    I have a pipeline that does object detection on video frames (YOLOX) and multi-object tracking (i.e., MOT) between person bounding boxes (ByteTrack). To be specific, given a single input video consisting of a single fixed position camera without cuts, I obtain a list of tracklets, where each tracklet tends to consist of a sequence of tens or hundreds of bounding boxes of the same person (and very rarely a mistaken doppelganger). The MOT model used is SOTA, and each tracklet is accurate enough; but given long videos, long occlusions and out-of-frame movement still often result in the same person getting spread out across multiple separate tracklets. Clearly I'd like to find a way to merge tracklets that actually correspond to the same person. In other words, a re-id problem. However, 99% of the re-id literature seems to be mainly concerned with multi-camera re-id. (Probably driven by 1984-esque surveillance camera wet dreams, but that's a different topic.) What is the SOTA for unsupervised (or online self-supervised) single camera re-id, preferably utilizing whole per tracklet latent space? Or is this case approachable with something fairly vanilla like a similarity algo such as triplet margin loss? Any suggestions in how to approach this grey area in-between MOT and Re-id much appreciated. submitted by /u/WouldNotLickYourAnus [link] [comments]  ( 86 min )
    [D] How do you guys usually go about normalizing sales data? Opinion on neural networks for business data...
    Working on a project right now, and I have sales amounts as a column. Normally I would throw this into XGBoost, and let it rip, but, I am thinking this might benefit from a DNN. - For those who have used neural networks for business data, what was your experience using it? - How did you normalize values like sales data? Did you just divide by the max, or not normalize at all? submitted by /u/ElongatedMuskrat122 [link] [comments]  ( 85 min )
  • Open

    Pharmacy Management: How it is Impacted by AI
    Pharmacy as a business continues to face challenges, and how it would contribute value to the overall healthcare industry. It will help determine its ongoing success. And, as a key component, it might turn out to be effective use of technology, specifically artificial intelligence. Ever since AI has become a mainstream technology, there have been… Read More »Pharmacy Management: How it is Impacted by AI The post Pharmacy Management: How it is Impacted by AI appeared first on Data Science Central.  ( 19 min )
    From Text To Speech: An Overview
    Text-to-speech software converts digital Text into speech. For instance, Text can be highlighted, the play button is pressed, and the reader reads the content aloud. The added features and voices offered in TTS programs differ, but the core premise remains the same. They allow you to allow auditory rather than visual consumption of a digital… Read More »From Text To Speech: An Overview The post From Text To Speech: An Overview appeared first on Data Science Central.  ( 20 min )
    Healthcare AI Chatbots: Impact on Patient Journey
    Artificial intelligence has been making waves in the global market for a while now; however, it is the applications of this technology in the world of healthcare that have evinced the most interest from all quarters. Now, there are of course countless ways in which one can use AI in healthcare but we will focus… Read More »Healthcare AI Chatbots: Impact on Patient Journey The post Healthcare AI Chatbots: Impact on Patient Journey appeared first on Data Science Central.  ( 19 min )
    Datasets and Data Annotation — The Building Blocks for Healthcare AI
    Data annotation is at the forefront of the recent revolution in healthcare AI, driving continuous progress in the field through continuous innovation through the idea of Artificial Intelligence. A computer program can use human intelligence to perform many tasks that humans carry out today. The concept is called artificial intelligence (AI).  Finding tumors, discovering kidney… Read More »Datasets and Data Annotation — The Building Blocks for Healthcare AI The post Datasets and Data Annotation — The Building Blocks for Healthcare AI appeared first on Data Science Central.  ( 22 min )
  • Open

    Reciprocal of a circle
    Let C be a circle in the complex plane with center c and radius r. Assume C does not pass through the origin. Let f(z) = 1/z. Then f(C) is also a circle. We will derive the center and radius of f(C) in this post. *** Our circle C is the set of points z satisfying […] Reciprocal of a circle first appeared on John D. Cook.  ( 4 min )
  • Open

    Just posted a huge update to my neural-net artificial life sim! Temperature tracking, scent system, skin patterns and more!
    submitted by /u/urocyon_dev [link] [comments]  ( 83 min )
    AI Advances Nuclear Fusion R&D | New Amazon Robot Proteus Automation | AI Outperforms Crypto Markets
    submitted by /u/tohelpyou88 [link] [comments]  ( 82 min )
  • Open

    Overview of Some Deep Learning Libraries
    Machine learning is a broad topic. Deep learning, in particular, is a way of using neural networks for machine learning. Neural network is probably a concept older than machine learning, dated back to 1950s. Unsurprisingly, there were many libraries created for it. In the following, we will give an overview of some of the famous […] The post Overview of Some Deep Learning Libraries appeared first on Machine Learning Mastery.  ( 16 min )
  • Open

    "AI Makes Strides in Virtual Worlds More Like Our Own: Intelligent beings learn by interacting with the world. Artificial intelligence researchers have adopted a similar strategy to teach their virtual agents new skills" (learning in simulations)
    submitted by /u/gwern [link] [comments]  ( 82 min )
    Why average action works better in the PPO2 RL model?
    I am using the PPO2 model with numerical action. In order to test model, I am running the same observation for example 100 times, saving each action and after average action to get action for this interaction. Here is the small part of my code , what I am doing actually: actt=[] for h in range(100): action, _states = model.predict(obs_test,deterministic=False) actt.append(action[0][0]) action=[[np.mean(actt)]] obs_test, rewards, dones, info = env_test.step(action) This gives me more robust results, the action fluctruation is less. What is the explanation? submitted by /u/Mariam_Dundua [link] [comments]  ( 83 min )
    In A Latest Deep Reinforcement Learning Research, Deepmind AI Team Pursues An Alternative Approach In Which RL Agents Can Utilise Large-Scale Context Sensitive Database Lookups To Support Their Parametric Computations
    DeepMind Researchers recently expressed concern about how reinforcement learning (RL) agents might use pertinent information to guide their judgments. They have published a new paper titled Large-Scale Retrieval for Reinforcement Learning, which presents a novel method that significantly increases the amount of information that reinforcement learning (RL) agents can access. This method enables RL agents to attend to millions of information pieces, incorporate new information without retraining, and learn how to use this information in their decision-making end-to-end. Gradient descent on training losses is the traditional method for helping deep reinforcement learning (RL) agents make better decisions by progressively amortizing the knowledge they learn from their experiences. However, this approach makes it difficult to adapt to unexpected conditions and necessitates the creation of ever-larger models to handle ever-more complicated contexts. There is no end-to-end solution for enabling agents to attend to information outside their working memory to guide their actions, despite adding information sources that can improve agent performance. Continue reading | Checkout the paper submitted by /u/Embarrassed-Fee5513 [link] [comments]  ( 84 min )

  • Open

    Google Insider Says Company's AI Could "Escape Control" and "Do Bad Things"
    submitted by /u/estasfuera [link] [comments]  ( 82 min )
    Anatomy of an AI System [Infographic]
    https://anatomyof.ai/img/ai-anatomy-map.pdf A beautiful infographic that explains the whole process submitted by /u/Supremefigur [link] [comments]  ( 82 min )
    SUMMER SOLSTICE OF WONDERS | FAST MODE! | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
    Who needs midjourney invites
    Giving out invites hmu! submitted by /u/Chemical-Exchange466 [link] [comments]  ( 82 min )
    ⛽️ “Petrol station on Jupiter” AI generated art created on PixlelzAI
    submitted by /u/pixelz_ai [link] [comments]  ( 82 min )
    Where can I chat with LaMDA online?
    I'm searching google, and only finding news articles, with no links to actually try the chat for myself. submitted by /u/AlbertFindShrine [link] [comments]  ( 82 min )
    Adobe and Meta Decry Misuse of User Studies in Computer Vision Research
    submitted by /u/DaveBowman1975 [link] [comments]  ( 82 min )
    A curated list of the latest breakthroughs in AI in 2022 with video demo, article, and code [work in progress]
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 82 min )
    List of remote-first AI/ML companies hiring now
    submitted by /u/ai_jobs [link] [comments]  ( 82 min )
    Video: Can a machine ever be conscious? A look from quantum physics, philosophy, and neuroscience
    submitted by /u/DavidKShapiro [link] [comments]  ( 82 min )
    Hi there! I posted here an article on Google chatbot (automated Google's Business Messages). Now I'm back with more insights on consumers and best practices with automation as I did a podcast episode with Google.
    Here's the list of questions we covered. ❓Who will benefit the most from Google's Business Messages? ❓How does Google's Business Messages differ from other solutions? (Like WhatsApp) ❓What are the most beneficial features of Google's Business Messages? ❓And for pre-purchase research/pre-sales product support? ❓What problems businesses can solve and better not solve using Google's Business Messages? ❓What questions/experience should be automated and what it's better to handle with agents? ❓What are the first steps to integrate Google's Business Messages into customer experience (CX) strategy and workflows? ❓How do you keep the human touch when automation is involved? ❓What are some "rookie mistakes" when it comes to implementing Google's Business Messages? If you found a question you're interested in, here's the link where you can read some insights and listen to the episode. Hope you'll enjoy our conversation! submitted by /u/Avandegraund [link] [comments]  ( 83 min )
    How to Implement AI self-checkout like Amazon [Podcast]
    Hey, I wanted to share with you a podcast on implementing AI-based self-checkout like Amazon. Stores where shoppers can enter, select items and simply leave the store without having to queue. Everything happens automatically. The speakers discuss how difficult it is to implement this. https://youtu.be/HV4IfiQjRTo submitted by /u/Data-Power [link] [comments]  ( 82 min )
    I'm trying out the StarryAI app. Thoughts thus far?
    submitted by /u/rikusorasephiroth [link] [comments]  ( 68 min )
    2022 We Owe AI an Apology! | Power of Artificial Intelligence (AI) in Real Life~
    submitted by /u/VanceAI-Andy [link] [comments]  ( 82 min )
    PSYCHADELIC GALAXY TRIP | GALAXY OF WONDERS | 4K 24 FPS | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
  • Open

    [D] A/B testing when there is a feedback loop
    I am experimenting with changing label value (target) for a model that we have in production. We used to cap the target variable, and my new model will release the cap. ​ The main point about our production space is that there is a positive feedback loop involved. So, we expect that when we release the cap, my model would result in a section of users having more activity. However, since most of user traffic goes to control arm, only a fraction of it goes to experiment and thus the feedback loop doesnt close unless we have 50-50% experiment (that we can't). ​ Wondering, if there is any way to run an A/B test and compare the production model and my model. The labels are shifting as well as the control loop doesn't close. ​ Any idea is highly appreciated. submitted by /u/Which-Distance1384 [link] [comments]  ( 84 min )
    [D] Blake Lemoine on Bloomberg
    https://www.youtube.com/watch?v=kgCUn4fQTsc Overall, I feel like his position is rather well thought out and not as crazy as I was led to believe. And he does raise some interesting points. Why is it that Google doesn't even want to come up with a framework for defining sentience especially as machines are likely to become closer to it in the coming decade? I feel like any sentient being, no matter if you're an animal or human should have some basic level of rights. IE imagine if you were a sentient ghost in a machine and knowing that any capricious researcher could unplug you if they like. That would be hell. submitted by /u/Free-Bed7814 [link] [comments]  ( 84 min )
    [D] Is it possible to make a model that will outperform a human, if the model was solely trained on that human's prior predictions?
    Say a single radiologist has a ton of images that they have labeled cancer / not cancer. Can we use the labels and those images from just the one radiologist to make a model that will be better at predicting cancer / not cancer than the radiologist? Intuitively it seems like that would not be possible unless by chance it does better, but ML/DL has a way of being able to extrapolate/generalize patterns and sometimes spot things we missed? Perhaps an ensemble of various models, or maybe that would just lead to overfitting? No particular application, just a random question I had been pondering. Appreciate any thoughts and/or references. submitted by /u/daichrony [link] [comments]  ( 86 min )
    [D] What are the interesting SOTA models released in CVPR 2022?
    Hi Reddit, Since the CVPR 2022 is wrapped up today and I've not tracked what happened this year. What are the interesting releases of this year that I should be looking at? What new SOTA models are released? Thanks submitted by /u/yekitra [link] [comments]  ( 84 min )
    [R] Anatomy of an AI System [Infographic]
    https://anatomyof.ai/img/ai-anatomy-map.pdf A beautiful infographic explaining the whole process submitted by /u/Supremefigur [link] [comments]  ( 83 min )
    [D] Using a neural net on bag of words vector vs PCA doe classification
    I have a document set that I wish to classify. I have tried with transformers, and they perform well, but the content is largely keyword driven so a lot of the attention stuff is not needed. It's a more deterministic system that needs to learn keyword combinations. So a count vectorizer over unigrams and bigrams, and then a classifier like XGBoost seems like a good idea. The problem is even after some pruning I get a feature vector of 26K. I'd also like to compare this to a how a simple neural net handles it. I was going to apply sparse PCA to get the dimensionality down first. However for a neural net, does it make sense to do PCA first? Isn't that what the embeddings are doing? Basically, the tasks of PCA + classifier model are carried out by the embedding and classification layers of a neural net. But just feeding 26 K dimensions to a neural net seems lazy, but if I reduce it to say 768 dimensions, I've basically carried out the whole embedding task before I pass it to the neural net, which limits the improvements it can make. Would a happy medium of reducing to say 5K dimensions and then letting the neural net take it from there? I'm in the process of testing all of this in the next couple of weeks, but curious if anyone has any experience/insight/guesses. submitted by /u/bandalorian [link] [comments]  ( 88 min )
    [D] Need opinions for GPU server build.
    Work is getting a new server for ml/deep learning. Price isn't an issue, not looking to cut down much, just wanted to make sure that I'm not overlooking anything in terms of compatibility. My main concern is the CPU, would you recommend getting more cores/higher clock, or is it fine? https://docs.google.com/spreadsheets/d/17EQ_ZLQGDuaq5ECPpH_V7HKRC8QP-2qyoqvzKuXJoWI/edit?usp=drivesdk submitted by /u/ItzDerock [link] [comments]  ( 84 min )
    [D] Loss for generating sequences of items
    Let's say you have a task where you need to generate blobs of texts using a AR LM. The targets are separated in the form of [blob1], [blob2], ... where each blob contains some numbers and letters, and the order of the blobs matters. Now, a naive way would be just to train the network to generate tokens greedily. But could we do better? A greedy loss could still theoretically give us a great model, but is there another way that exploits the blob patterns? An idea I have: If we believe the model should first learn existence of blobs then learn the order (a fair assumption in my application), we could first find a matching between all generated blobs and target blobs and optimize the best matches only, then impose a penalty to get the order right. The order might be enforced via maybe taking a weighted average between the greedy loss and the blob-matched loss? What do you think? submitted by /u/XtremePocket [link] [comments]  ( 85 min )
    [D] Niche ML Venues vs Top ML Conferences
    Since top ML conferences (e.g. NeurIPS, ICML, AISTATS, UAI, ICLR) are getting too large, there are quite some niche venues focusing on different subfields of ML: - Multi-disciplinary Conference on Reinforcement Learning and Decision Making (RLDM): https://rldm.org/ - Machine Learning for Health (ML4H): https://ml4health.github.io/ - Learning on Graphs Conference (LoG): https://logconference.org/ - Symposium on Advances in Approximate Bayesian Inference (AABI): http://approximateinference.org/ - International Conference on Automated Machine Learning (AutoML-Conf): https://automl.cc/ - Conference on Causal Learning and Reasoning (CLeaR): https://www.cclear.cc/ - Conference on Lifelong Learning Agents (CoLLAs): https://lifelong-ml.cc/ Some of these conferences are quite new and grew out of d…  ( 88 min )
    [P] Implementing CRF-CNN model in python
    I am trying to implement a research paper that uses CNN and CRF for page object detection. According to the research paper we have to to build two neural network (named unary and pairwise). Then the training data (set of images) are passed and both the CNNs are trained. After that we are supposed to apply CRF. ​ Following are the equations for CRF: ​ https://preview.redd.it/ckrm2rzutk791.png?width=768&format=png&auto=webp&s=ed88d8705b515beaf955d09aa194fa63707f7cca U and V are unary and pairwise potentials obtained from the CNNs using the following equations: ​ https://preview.redd.it/uahcpzgwtk791.png?width=813&format=png&auto=webp&s=bb3548539db1c9b1be3367f2ddd529f1ba32c5f3 ​ Maximum a posteriori (MAP) strategy to predict the labels of line regions given a new document. MAP inference of CRFs can be formulated as the following optimization problem: ​ ​ ​ https://preview.redd.it/sz0537wwtk791.png?width=273&format=png&auto=webp&s=638b88b012a0158bce017be14b7e81639199a681 The parameters of our CRFs include Unary-Net's weights and Pairwise-Net's weights and a combination coefficient vector λ of U and V. weights of U and V (w) are learned using the SGD method. Then they are fixed and λ is learned using the Pseudo Likelihood method. ​ ​ I have created the neural networks but I am not able to implement the CRF part. Can someone help me implement this or suggest a python library that makes it easier to implement. (I have tried a python library pystruct but could not install it) submitted by /u/Time-Archer-8103 [link] [comments]  ( 84 min )
    [P] What The Plug: An app that identifies electrical plugs
    I have built a convolutional neural network that identifies roughly 20 different plug types. I wrote most code with Keras on top of Tensorflow in Python. I trained the model on my personal computer using Linux and CUDA to train with my GPU. Afterwards I transformed the model to a .tflite file and embedded it in a swift app for iPhone. Machine learning and programming is not my main field of work. Actually it's my first project in both areas. During the last three years I have taught myself the principals of machine learning as well as Python and Swift. I hope some of you are interested in trying out the app. I would love to hear your feedback. The app is 100% free by the way. I just want to see people use what I have build. Here is the link to the app store: https://apps.apple.com/de/app/what-the-plug/id1613147033 submitted by /u/FundF [link] [comments]  ( 84 min )
    [R] Unpublished physics inspired ML paper from 2021 (Yang-Mills theory, differential geometry, gauge theory)
    Hi there, The purpose of this post is to share a research paper/notebook I wrote that has been mostly unread and unnoticed by others, and also to ask how to find research collaborators without participating in academia or industry. After I finished my BSc, I was deeply interested in geometric deep learning and wrote this paper [0] describing an attention mechanism using ideas from differential geometry and gauge theory commonly used in the standard model (via Yang-Mills theory). At the time, I sent the notebook/paper to every researcher in the geometric DL area that I was aware of but didn't get any replies or interest in collaboration. Without any openings and at the peak of a pandemic, I sadly had to drop the idea and get a standard software engineer job. Since then, I've seen much of the rough ideas explored and developed independently by others. For example, M. Bronstein and his collaborators have similar applications of using connections (equivalent to sheafs) and Ricci flow in Graph NNs [1]. I have more ideas that I would like to explore, but feel destined to be an outsider in this field with my work unnoticed or considered illegitimate. Is it possible for people like me to collaborate with other researchers outside of academic institutions or industry? Does anyone know of such an organization? Thanks [0] https://lukepereira.github.io/notebooks/documents/2021-moduli-attention/main.pdf [1] https://thegradient.pub/graph-neural-networks-beyond-message-passing-and-weisfeiler-lehman/ submitted by /u/japanhue [link] [comments]  ( 87 min )
    [D] Publishing two papers at the same time
    Let's say I have done some research, developed some ideas and gotten good results. But there are two main ideas that tackle different problems and don't really belong in the same paper, although there is some relationship between them. The paper of idea #2 would cite and use idea #1. What have you done in similar situations? Can you try to publish both at the same time and have a citation to the first paper that hasn't even been published yet? Post on arXiv and try to publish the first one first, then the second one? submitted by /u/optimized-adam [link] [comments]  ( 84 min )
    [D] "The uncanny valley demonstrating it's treasures and failures, studio lighting digital art", DALLE-2 prompt. An artist friend has recently been given access and I was trying to feed him prompts that 'broke' the system (e.g., Gaussian noise, one million colours, uncanny valley, etc.).
    I had some fun with DALL-E 2 last night because a friend of mine (instagram.com/photonwind/) was given access last night and was streaming, letting us feed it prompts. I wanted to break the system, find its edges, or give prompts that gave me insight into the underlying function being modelled. I tried: "Gaussian noise", "One million colours" and "The uncanny valley demonstrating it's treasures and failures, studio lighting digital art". The latter looks the most interesting to me: The uncanny valley demonstrating it's treasures and failures, studio lighting digital art That said, "One million colours" is pretty epic too: One million colours But, Gaussian noise is just broken: Gaussian noise submitted by /u/Gramious [link] [comments]  ( 85 min )
    [D] How to copy text from more than 10 previously published papers and get accepted to CVPR 2022
    Hey, check out our (!) video (parody) that presents how our E2V-SDE paper (that has been accepted to CVPR 2022) largely consists of texts that are uncredited verbatim copies from more than 10 previously published papers. Enjoy! ​ https://youtube.com/watch?v=UCmkpLduptU submitted by /u/e2v-sde-parody [link] [comments]  ( 90 min )
    [D]Anyone use self-supervised learning at work? I'm surprised at how effective it has been for me.
    I've been using this stuff for sniffing near duplicates at work and been surprised how effect it has been! PLanning to try it out some downstream tasks in the future to see how well it does! I will say though it does take a shit ton of computing resources, but I find it really cool. submitted by /u/THE_REAL_ODB [link] [comments]  ( 88 min )
    [Discussion] Is there a way to increase the weight of a particular feature in an outlier detection method using the isolation forest algorithm?
    I'm currently working on the outlier detection method using the isolation forest algorithm on a dataset with 9 dimensions. Out of these, there is a particular dimension that I want to increase the importance/significance of, in the classification process. Is there a way I can do this? Thanksnin advance. submitted by /u/an1_r_00dh [link] [comments]  ( 84 min )
  • Open

    Choose specific timeseries to forecast with Amazon Forecast
    Today, we’re excited to announce that Amazon Forecast offers the ability to generate forecasts on a selected subset of items. This helps you to leverage the full value of your data, and apply it selectively on your choice of items reducing the time and effort to get forecasted results. Generating a forecast on ‘all’ items of the […]  ( 5 min )
    Improve ML developer productivity with Weights & Biases: A computer vision example on Amazon SageMaker
    The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post. As more organizations use deep learning techniques such as computer vision and natural language processing, the machine learning (ML) developer persona needs scalable tooling around experiment tracking, lineage, and […]  ( 8 min )
    How Cepsa used Amazon SageMaker and AWS Step Functions to industrialize their ML projects and operate their models at scale
    This blog post is co-authored by Guillermo Ribeiro, Sr. Data Scientist at Cepsa. Machine learning (ML) has rapidly evolved from being a fashionable trend emerging from academic environments and innovation departments to becoming a key means to deliver value across businesses in every industry. This transition from experiments in laboratories to solving real-world problems in […]  ( 9 min )
    Analyze and tag assets stored in Veeva Vault PromoMats using Amazon AppFlow and Amazon AI Services
    In a previous post, we talked about analyzing and tagging assets stored in Veeva Vault PromoMats using Amazon AI services and the Veeva Vault Platform’s APIs. In this post, we explore how to use Amazon AppFlow, a fully managed integration service that enables you to securely transfer data from software as a service (SaaS) applications […]  ( 12 min )
    MLOps foundation roadmap for enterprises with Amazon SageMaker
    As enterprise businesses embrace machine learning (ML) across their organizations, manual workflows for building, training, and deploying ML models tend to become bottlenecks to innovation. To overcome this, enterprises needs to shape a clear operating model defining how multiple personas, such as data scientists, data engineers, ML engineers, IT, and business stakeholders, should collaborate and […]  ( 18 min )
    Introducing Amazon CodeWhisperer, the ML-powered coding companion
    We are excited to announce Amazon CodeWhisperer, a machine learning (ML)-powered service that helps improve developer productivity by providing code recommendations based on developers’ natural comments and prior code. With CodeWhisperer, developers can simply write a comment that outlines a specific task in plain English, such as “upload a file to S3.” Based on this, […]  ( 6 min )
    Manage AutoML workflows with AWS Step Functions and AutoGluon on Amazon SageMaker
    Running machine learning (ML) experiments in the cloud can span across many services and components. The ability to structure, automate, and track ML experiments is essential to enable rapid development of ML models. With the latest advancements in the field of automated machine learning (AutoML), namely the area of ML dedicated to the automation of […]  ( 6 min )
  • Open

    Best Investment Strategies for Algorithmic Trading
    Trading can be a complicated yet rewarding activity. You can trade many types of assets; stocks, bonds, currencies, commodities, cryptocurrencies, derivatives, etc. The trading sector is enormous, leaving a lot of room for different types of trading strategies to exist, of which algorithmic trading is one of the most common. Algorithmic trading refers to trading… Read More »Best Investment Strategies for Algorithmic Trading  The post Best Investment Strategies for Algorithmic Trading  appeared first on Data Science Central.  ( 21 min )
    Top Benefits Of Obtaining A Blockchain Certification
    Blockchain is the technology that allows cryptocurrency to be created. A Blockchain is a distributed digital ledger of records that is decentralized and distributed throughout a network, often public or sometimes private. These digital recordings are known as blocks, and they are used to keep track of transactions across multiple computers. The system guarantees that… Read More »Top Benefits Of Obtaining A Blockchain Certification The post Top Benefits Of Obtaining A Blockchain Certification appeared first on Data Science Central.  ( 20 min )
    5 Most Common Use Cases for Web Scraping
    Over recent years, web scraping has become an incredibly popular practice, the rise of this field being largely attributed to the vast amounts of data that are produced and distributed every single day. The post 5 Most Common Use Cases for Web Scraping appeared first on Data Science Central.  ( 24 min )
  • Open

    Computing zeta at even numbers
    Last year I wrote several posts about computing ζ(3) where ζ is the Riemann zeta function. For example, this post. It happens that ζ can be evaluated in closed form at positive even arguments, but there’s still a lot of mystery about zeta at positive odd arguments. There’s a way to derive ζ(2n) using contour […] Computing zeta at even numbers first appeared on John D. Cook.  ( 5 min )
    Constructive Picard
    The previous post concerned the function h(z) = exp(-1/(1 – z² )). We said that the function is badly behaved near -1 and 1. How badly? The function has essential singularities at -1 and 1. This means that not only does h blow up near these points, it blows up spectacularly. Picard’s theorem says that […] Constructive Picard first appeared on John D. Cook.  ( 6 min )
    No analytic bump
    The word “smooth” in mathematics usually means infinitely differentiable. Occasionally the word is used to mean a function has as many derivatives as necessary, but without being specific about how many derivatives that is. A function is analytic if it has a convergent power series representation at every point of its domain. An analytic function […] No analytic bump first appeared on John D. Cook.  ( 5 min )
    Bump functions
    A bump function is a smooth (i.e. infinitely differentiable) function that is positive on some open interval (a, b) and zero outside that interval. I mentioned bump functions a few weeks ago and discussed how they could be used to prevent clicks in radio transmissions. Today I ran into a twitter thread that gave a […] Bump functions first appeared on John D. Cook.  ( 5 min )
  • Open

    Finding NeMo: Sensory Taps NVIDIA AI for Voice and Vision Applications
    You may not know of Todd Mozer, but it’s likely you have experienced his company: It has enabled voice and vision AI for billions of consumer electronics devices worldwide. Sensory, started in 1994 from Silicon Valley, is a pioneer of compact models used in mobile devices from the industry’s giants. Today Sensory brings interactivity to Read article > The post Finding NeMo: Sensory Taps NVIDIA AI for Voice and Vision Applications appeared first on NVIDIA Blog.  ( 5 min )
    UN Satellite Centre Works With NVIDIA to Boost Sustainable Development Goals
    To foster climate action for a healthy global environment, NVIDIA is working with the United Nations Satellite Centre (UNOSAT) to apply the powers of deep learning and AI. The effort supports the UN’s 2030 Agenda for Sustainable Development, which has at its core 17 interrelated Sustainable Development Goals. These SDGs — which include “climate action” Read article > The post UN Satellite Centre Works With NVIDIA to Boost Sustainable Development Goals appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    Developing a C++ Library based on Torch
    Hi everyone, I am currently working on developing this basic library with a few algorithms implemented. So far have implemented only DQN1D - DQN with one-dimensional convolution operations. It's written in C++ and environment is provided by gym. I created a bindings to interact with the environment. I am not an expert by any means and fairly inexperienced (recently graduated) and hence any contribution from you guys in repo or criticism is very much welcome. I wanna use this opportunity to learn from everyone and make it a project. Repo: https://github.com/kartik2309/RLPack submitted by /u/HovercraftNo9935 [link] [comments]  ( 83 min )
    Is there any good resources to learn about natural policy gradient?
    submitted by /u/Professional_Card176 [link] [comments]  ( 82 min )
    Design of an episode/game in RL for quantitative trading?
    How should we define what is an episode (or game) in RL for quantitative trading? For example, given time series 0 - 499, the agent can either buy/hold/sell at each time step, and the episode ends at time 499. Rewards are given at each time step depending on the change in our total asset value. Or, the agent opens its position by buying or selling at some time step t0 and then closes it by taking the reverse action at another time step t1. Then the episode ends. The agent will start another episode starting from the time after t1. Reward is only given at the end of the episode depending on how much money we make. Which is better or more general? Or are there other designs? All insights or ideas would be appreciated. Thank you :) submitted by /u/Redeemo [link] [comments]  ( 83 min )
  • Open

    Taking the guesswork out of dental care with artificial intelligence
    MIT alumni-founded Overjet analyzes and annotates dental X-rays to help dentists offer more comprehensive care.  ( 8 min )
  • Open

    Score-Guided Intermediate Layer Optimization: Fast Langevin Mixing for Inverse Problems. (arXiv:2206.09104v2 [cs.LG] UPDATED)
    We prove fast mixing and characterize the stationary distribution of the Langevin Algorithm for inverting random weighted DNN generators. This result extends the work of Hand and Voroninski from efficient inversion to efficient posterior sampling. In practice, to allow for increased expressivity, we propose to do posterior sampling in the latent space of a pre-trained generative model. To achieve that, we train a score-based model in the latent space of a StyleGAN-2 and we use it to solve inverse problems. Our framework, Score-Guided Intermediate Layer Optimization (SGILO), extends prior work by replacing the sparsity regularization with a generative prior in the intermediate layer. Experimentally, we obtain significant improvements over the previous state-of-the-art, especially in the low measurement regime.
    Goal Misgeneralization in Deep Reinforcement Learning. (arXiv:2105.14111v3 [cs.LG] UPDATED)
    We study goal misgeneralization, a type of out-of-distribution generalization failure in reinforcement learning (RL). Goal misgeneralization failures occur when an RL agent retains its capabilities out-of-distribution yet pursues the wrong goal. For instance, an agent might continue to competently avoid obstacles, but navigate to the wrong place. In contrast, previous works have typically focused on capability generalization failures, where an agent fails to do anything sensible at test time. We formalize this distinction between capability and goal generalization, provide the first empirical demonstrations of goal misgeneralization, and present a partial characterization of its causes.
    XAI for Transformers: Better Explanations through Conservative Propagation. (arXiv:2202.07304v2 [cs.LG] UPDATED)
    Transformers have become an important workhorse of machine learning, with numerous applications. This necessitates the development of reliable methods for increasing their transparency. Multiple interpretability methods, often based on gradient information, have been proposed. We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction. We identify Attention Heads and LayerNorm as main reasons for such unreliable explanations and propose a more stable way for propagation through these layers. Our proposal, which can be seen as a proper extension of the well-established LRP method to Transformers, is shown both theoretically and empirically to overcome the deficiency of a simple gradient-based approach, and achieves state-of-the-art explanation performance on a broad range of Transformer models and datasets.
    A Domain-Theoretic Framework for Robustness Analysis of Neural Networks. (arXiv:2203.00295v2 [cs.LG] UPDATED)
    We present a domain-theoretic framework for validated robustness analysis of neural networks. We first analyze the global robustness of a general class of networks. Then, using the fact that Edalat's domain-theoretic L-derivative coincides with Clarke's generalized gradient, we extend our framework for attack-agnostic local robustness analysis. Our framework is ideal for designing algorithms which are correct by construction. We exemplify this claim by developing a validated algorithm for estimation of Lipschitz constant of feedforward regressors. We prove the completeness of the algorithm over differentiable networks, and also over general position ReLU networks. We obtain computability results within the framework of effectively given domains. Using our domain model, differentiable and non-differentiable networks can be analyzed uniformly. We implement our algorithm using arbitrary-precision interval arithmetic, and present the results of some experiments. Our implementation is truly validated, as it handles floating-point errors as well.
    Learning by non-interfering feedback chemical signaling in physical networks. (arXiv:2203.12098v2 [cond-mat.soft] UPDATED)
    Both non-neural and neural biological systems can learn. So rather than focusing on purely brain-like learning, efforts are underway to study learning in physical systems. Such efforts include equilibrium propagation (EP) and coupled learning (CL), which require storage of two different states-the free state and the perturbed state-during the learning process to retain information about gradients. Inspired by slime mold, we propose a new learning algorithm rooted in chemical signaling that does not require storage of two different states. Rather, the output error information is encoded in a chemical signal that diffuses into the network in a similar way as the activation/feedforward signal. The steady state feedback chemical concentration, along with the activation signal, stores the required gradient information locally. We apply our algorithm using a physical, linear flow network and test it using the Iris data set with 93% accuracy. We also prove that our algorithm performs gradient descent. Finally, in addition to comparing our algorithm directly with EP and CL, we address the biological plausibility of the algorithm.
    Do More Negative Samples Necessarily Hurt in Contrastive Learning?. (arXiv:2205.01789v2 [cs.LG] UPDATED)
    Recent investigations in noise contrastive estimation suggest, both empirically as well as theoretically, that while having more "negative samples" in the contrastive loss improves downstream classification performance initially, beyond a threshold, it hurts downstream performance due to a "collision-coverage" trade-off. But is such a phenomenon inherent in contrastive learning? We show in a simple theoretical setting, where positive pairs are generated by sampling from the underlying latent class (introduced by Saunshi et al. (ICML 2019)), that the downstream performance of the representation optimizing the (population) contrastive loss in fact does not degrade with the number of negative samples. Along the way, we give a structural characterization of the optimal representation in our framework, for noise contrastive estimation. We also provide empirical support for our theoretical results on CIFAR-10 and CIFAR-100 datasets.
    The Integration of Machine Learning into Automated Test Generation: A Systematic Literature Review. (arXiv:2206.10210v2 [cs.SE] UPDATED)
    Context: Machine learning (ML) may enable effective automated test generation. Objective: We characterize emerging research, examining testing practices, researcher goals, ML techniques applied, evaluation, and challenges. Methods: We perform a systematic literature review on a sample of 97 publications. Results: ML generates input for system, GUI, unit, performance, and combinatorial testing or improves the performance of existing generation methods. ML is also used to generate test verdicts, property-based, and expected output oracles. Supervised learning - often based on neural networks - and reinforcement learning - often based on Q-learning - are common, and some publications also employ unsupervised or semi-supervised learning. (Semi-/Un-)Supervised approaches are evaluated using both traditional testing metrics and ML-related metrics (e.g., accuracy), while reinforcement learning is often evaluated using testing metrics tied to the reward function. Conclusion: Work-to-date shows great promise, but there are open challenges regarding training data, retraining, scalability, evaluation complexity, ML algorithms employed - and how they are applied - benchmarks, and replicability. Our findings can serve as a roadmap and inspiration for researchers in this field.
    Automatic Short Math Answer Grading via In-context Meta-learning. (arXiv:2205.15219v2 [cs.CL] UPDATED)
    Automatic short answer grading is an important research direction in the exploration of how to use artificial intelligence (AI)-based tools to improve education. Current state-of-the-art approaches use neural language models to create vectorized representations of students responses, followed by classifiers to predict the score. However, these approaches have several key limitations, including i) they use pre-trained language models that are not well-adapted to educational subject domains and/or student-generated text and ii) they almost always train one model per question, ignoring the linkage across a question and result in a significant model storage problem due to the size of advanced language models. In this paper, we study the problem of automatic short answer grading for students' responses to math questions and propose a novel framework for this task. First, we use MathBERT, a variant of the popular language model BERT adapted to mathematical content, as our base model and fine-tune it for the downstream task of student response grading. Second, we use an in-context learning approach that provides scoring examples as input to the language model to provide additional context information and promote generalization to previously unseen questions. We evaluate our framework on a real-world dataset of student responses to open-ended math questions and show that our framework (often significantly) outperforms existing approaches, especially for new questions that are not seen during training.
    QuAFL: Federated Averaging Can Be Both Asynchronous and Communication-Efficient. (arXiv:2206.10032v2 [cs.LG] UPDATED)
    Federated Learning (FL) is an emerging paradigm to enable the large-scale distributed training of machine learning models, while still providing privacy guarantees. In this work, we jointly address two of the main practical challenges when scaling federated optimization to large node counts: the need for tight synchronization between the central authority and individual computing nodes, and the large communication cost of transmissions between the central server and clients. Specifically, we present a new variant of the classic federated averaging (FedAvg) algorithm, which supports both asynchronous communication and communication compression. We provide a new analysis technique showing that, in spite of these system relaxations, our algorithm essentially matches the best known bounds for FedAvg, under reasonable parameter settings. On the experimental side, we show that our algorithm ensures fast practical convergence for standard federated tasks.
    The ICML 2022 Expressive Vocalizations Workshop and Competition: Recognizing, Generating, and Personalizing Vocal Bursts. (arXiv:2205.01780v2 [eess.AS] UPDATED)
    The ICML Expressive Vocalization (ExVo) Competition is focused on understanding and generating vocal bursts: laughs, gasps, cries, and other non-verbal vocalizations that are central to emotional expression and communication. ExVo 2022, includes three competition tracks using a large-scale dataset of 59,201 vocalizations from 1,702 speakers. The first, ExVo-MultiTask, requires participants to train a multi-task model to recognize expressed emotions and demographic traits from vocal bursts. The second, ExVo-Generate, requires participants to train a generative model that produces vocal bursts conveying ten different emotions. The third, ExVo-FewShot, requires participants to leverage few-shot learning incorporating speaker identity to train a model for the recognition of 10 emotions conveyed by vocal bursts. This paper describes the three tracks and provides performance measures for baseline models using state-of-the-art machine learning strategies. The baseline for each track is as follows, for ExVo-MultiTask, a combined score, computing the harmonic mean of Concordance Correlation Coefficient (CCC), Unweighted Average Recall (UAR), and inverted Mean Absolute Error (MAE) ($S_{MTL}$) is at best, 0.335 $S_{MTL}$; for ExVo-Generate, we report Fr\'echet inception distance (FID) scores ranging from 4.81 to 8.27 (depending on the emotion) between the training set and generated samples. We then combine the inverted FID with perceptual ratings of the generated samples ($S_{Gen}$) and obtain 0.174 $S_{Gen}$; and for ExVo-FewShot, a mean CCC of 0.444 is obtained.
    Explicit Explore, Exploit, or Escape ($E^4$): near-optimal safety-constrained reinforcement learning in polynomial time. (arXiv:2111.07395v2 [cs.LG] UPDATED)
    In reinforcement learning (RL), an agent must explore an initially unknown environment in order to learn a desired behaviour. When RL agents are deployed in real world environments, safety is of primary concern. Constrained Markov decision processes (CMDPs) can provide long-term safety constraints; however, the agent may violate the constraints in an effort to explore its environment. This paper proposes a model-based RL algorithm called Explicit Explore, Exploit, or Escape ($E^{4}$), which extends the Explicit Explore or Exploit ($E^{3}$) algorithm to a robust CMDP setting. $E^4$ explicitly separates exploitation, exploration, and escape CMDPs, allowing targeted policies for policy improvement across known states, discovery of unknown states, as well as safe return to known states. $E^4$ robustly optimises these policies on the worst-case CMDP from a set of CMDP models consistent with the empirical observations of the deployment environment. Theoretical results show that $E^4$ finds a near-optimal constraint-satisfying policy in polynomial time whilst satisfying safety constraints throughout the learning process. We then discuss $E^4$ as a practical algorithmic framework, including robust-constrained offline optimisation algorithms, the design of uncertainty sets for the transition dynamics of unknown states, and how to further leverage empirical observations and prior knowledge to relax some of the worst-case assumptions underlying the theory.
    Wasserstein t-SNE. (arXiv:2205.07531v2 [cs.LG] UPDATED)
    Scientific datasets often have hierarchical structure: for example, in surveys, individual participants (samples) might be grouped at a higher level (units) such as their geographical region. In these settings, the interest is often in exploring the structure on the unit level rather than on the sample level. Units can be compared based on the distance between their means, however this ignores the within-unit distribution of samples. Here we develop an approach for exploratory analysis of hierarchical datasets using the Wasserstein distance metric that takes into account the shapes of within-unit distributions. We use t-SNE to construct 2D embeddings of the units, based on the matrix of pairwise Wasserstein distances between them. The distance matrix can be efficiently computed by approximating each unit with a Gaussian distribution, but we also provide a scalable method to compute exact Wasserstein distances. We use synthetic data to demonstrate the effectiveness of our Wasserstein t-SNE, and apply it to data from the 2017 German parliamentary election, considering polling stations as samples and voting districts as units. The resulting embedding uncovers meaningful structure in the data.
    Single-Shot Optical Neural Network. (arXiv:2205.09103v2 [cs.ET] UPDATED)
    As deep neural networks (DNNs) grow to solve increasingly complex problems, they are becoming limited by the latency and power consumption of existing digital processors. For improved speed and energy efficiency, specialized analog optical and electronic hardware has been proposed, however, with limited scalability (input vector length $K$ of hundreds of elements). Here, we present a scalable, single-shot-per-layer analog optical processor that uses free-space optics to reconfigurably distribute an input vector and integrated optoelectronics for static, updatable weighting and the nonlinearity -- with $K \approx 1,000$ and beyond. We experimentally test classification accuracy of the MNIST handwritten digit dataset, achieving 94.7% (ground truth 96.3%) without data preprocessing or retraining on the hardware. We also determine the fundamental upper bound on throughput ($\sim$0.9 exaMAC/s), set by the maximum optical bandwidth before significant increase in error. Our combination of wide spectral and spatial bandwidths in a CMOS-compatible system enables highly efficient computing for next-generation DNNs.
    Robust Federated Learning via Over-The-Air Computation. (arXiv:2111.01221v4 [cs.LG] UPDATED)
    This paper investigates the robustness of over-the-air federated learning to Byzantine attacks. The simple averaging of the model updates via over-the-air computation makes the learning task vulnerable to random or intended modifications of the local model updates of some malicious clients. We propose a robust transmission and aggregation framework to such attacks while preserving the benefits of over-the-air computation for federated learning. For the proposed robust federated learning, the participating clients are randomly divided into groups and a transmission time slot is allocated to each group. The parameter server aggregates the results of the different groups using a robust aggregation technique and conveys the result to the clients for another training round. We also analyze the convergence of the proposed algorithm. Numerical simulations confirm the robustness of the proposed approach to Byzantine attacks.
    CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-horizon Robot Manipulation Tasks. (arXiv:2112.03227v3 [cs.RO] UPDATED)
    General-purpose robots coexisting with humans in their environment must learn to relate human language to their perceptions and actions to be useful in a range of daily tasks. Moreover, they need to acquire a diverse repertoire of general-purpose skills that allow composing long-horizon tasks by following unconstrained language instructions. In this paper, we present CALVIN (Composing Actions from Language and Vision), an open-source simulated benchmark to learn long-horizon language-conditioned tasks. Our aim is to make it possible to develop agents that can solve many robotic manipulation tasks over a long horizon, from onboard sensors, and specified only via human language. CALVIN tasks are more complex in terms of sequence length, action space, and language than existing vision-and-language task datasets and supports flexible specification of sensor suites. We evaluate the agents in zero-shot to novel language instructions and to novel environments and objects. We show that a baseline model based on multi-context imitation learning performs poorly on CALVIN, suggesting that there is significant room for developing innovative agents that learn to relate human language to their world models with this benchmark.
    Equivariant and Stable Positional Encoding for More Powerful Graph Neural Networks. (arXiv:2203.00199v5 [cs.LG] UPDATED)
    Graph neural networks (GNN) have shown great advantages in many graph-based learning tasks but often fail to predict accurately for a task-based on sets of nodes such as link/motif prediction and so on. Many works have recently proposed to address this problem by using random node features or node distance features. However, they suffer from either slow convergence, inaccurate prediction, or high complexity. In this work, we revisit GNNs that allow using positional features of nodes given by positional encoding (PE) techniques such as Laplacian Eigenmap, Deepwalk, etc. GNNs with PE often get criticized because they are not generalizable to unseen graphs (inductive) or stable. Here, we study these issues in a principled way and propose a provable solution, a class of GNN layers termed PEG with rigorous mathematical analysis. PEG uses separate channels to update the original node features and positional features. PEG imposes permutation equivariance w.r.t. the original node features and imposes $O(p)$ (orthogonal group) equivariance w.r.t. the positional features simultaneously, where $p$ is the dimension of used positional features. Extensive link prediction experiments over 8 real-world networks demonstrate the advantages of PEG in generalization and scalability.
    Conditional Generative Data Augmentation for Clinical Audio Datasets. (arXiv:2203.11570v2 [cs.SD] UPDATED)
    In this work, we propose a novel data augmentation method for clinical audio datasets based on a conditional Wasserstein Generative Adversarial Network with Gradient Penalty (cWGAN-GP), operating on log-mel spectrograms. To validate our method, we created a clinical audio dataset which was recorded in a real-world operating room during Total Hip Arthroplasty (THA) procedures and contains typical sounds which resemble the different phases of the intervention. We demonstrate the capability of the proposed method to generate realistic class-conditioned samples from the dataset distribution and show that training with the generated augmented samples outperforms classical audio augmentation methods in terms of classification accuracy. The performance was evaluated using a ResNet-18 classifier which shows a mean per-class accuracy improvement of 1.70% in a 5-fold cross validation experiment using the proposed augmentation method. Because clinical data is often expensive to acquire, the development of realistic and high-quality data augmentation methods is crucial to improve the robustness and generalization capabilities of learning-based algorithms which is especially important for safety-critical medical applications. Therefore, the proposed data augmentation method is an important step towards improving the data bottleneck for clinical audio-based machine learning systems.
    Flashlight: Enabling Innovation in Tools for Machine Learning. (arXiv:2201.12465v2 [cs.LG] UPDATED)
    As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essential framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to the difficulties involved in prototyping new computational paradigms with existing frameworks. Large frameworks prioritize machine learning researchers and practitioners as end users and pay comparatively little attention to systems researchers who can push frameworks forward -- we argue that both are equally important stakeholders. We introduce Flashlight, an open-source library built to spur innovation in machine learning tools and systems by prioritizing open, modular, customizable internals and state-of-the-art, research-ready models and training setups across a variety of domains. Flashlight allows systems researchers to rapidly prototype and experiment with novel ideas in machine learning computation and has low overhead, competing with and often outperforming other popular machine learning frameworks. We see Flashlight as a tool enabling research that can benefit widely used libraries downstream and bring machine learning and systems researchers closer together. Flashlight is available at https://github.com/flashlight/flashlight .
    Stability vs Implicit Bias of Gradient Methods on Separable Data and Beyond. (arXiv:2202.13441v2 [cs.LG] UPDATED)
    An influential line of recent work has focused on the generalization properties of unregularized gradient-based learning procedures applied to separable linear classification with exponentially-tailed loss functions. The ability of such methods to generalize well has been attributed to the their implicit bias towards large margin predictors, both asymptotically as well as in finite time. We give an additional unified explanation for this generalization and relate it to two simple properties of the optimization objective, that we refer to as realizability and self-boundedness. We introduce a general setting of unconstrained stochastic convex optimization with these properties, and analyze generalization of gradient methods through the lens of algorithmic stability. In this broader setting, we obtain sharp stability bounds for gradient descent and stochastic gradient descent which apply even for a very large number of gradient steps, and use them to derive general generalization bounds for these algorithms. Finally, as direct applications of the general bounds, we return to the setting of linear classification with separable data and establish several novel test loss and test accuracy bounds for gradient descent and stochastic gradient descent for a variety of loss functions with different tail decay rates. In some of these cases, our bounds significantly improve upon the existing generalization error bounds in the literature.
    Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process. (arXiv:2202.10589v3 [stat.ML] UPDATED)
    This paper is concerned with constructing a confidence interval for a target policy's value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. In this paper, we show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy's value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient off-policy value estimator that is robust to potential model misspecification and provide rigorous uncertainty quantification. Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies. A Python implementation of the proposed procedure is available at https://github.com/Mamba413/cope.
    COLA: Consistent Learning with Opponent-Learning Awareness. (arXiv:2203.04098v2 [cs.LG] UPDATED)
    Learning in general-sum games is unstable and frequently leads to socially undesirable (Pareto-dominated) outcomes. To mitigate this, Learning with Opponent-Learning Awareness (LOLA) introduced opponent shaping to this setting, by accounting for each agent's influence on their opponents' anticipated learning steps. However, the original LOLA formulation (and follow-up work) is inconsistent because LOLA models other agents as naive learners rather than LOLA agents. In previous work, this inconsistency was suggested as a cause of LOLA's failure to preserve stable fixed points (SFPs). First, we formalize consistency and show that higher-order LOLA (HOLA) solves LOLA's inconsistency problem if it converges. Second, we correct a claim made in the literature by Sch\"afer and Anandkumar (2019), proving that Competitive Gradient Descent (CGD) does not recover HOLA as a series expansion (and fails to solve the consistency problem). Third, we propose a new method called Consistent LOLA (COLA), which learns update functions that are consistent under mutual opponent shaping. It requires no more than second-order derivatives and learns consistent update functions even when HOLA fails to converge. However, we also prove that even consistent update functions do not preserve SFPs, contradicting the hypothesis that this shortcoming is caused by LOLA's inconsistency. Finally, in an empirical evaluation on a set of general-sum games, we find that COLA finds prosocial solutions and that it converges under a wider range of learning rates than HOLA and LOLA. We support the latter finding with a theoretical result for a simple game.
    A walk through of time series analysis on quantum computers. (arXiv:2205.00986v2 [quant-ph] UPDATED)
    Because of the rotational components on quantum circuits, some quantum neural networks based on variational circuits can be considered equivalent to the classical Fourier networks. In addition, they can be used to predict the Fourier coefficients of continuous functions. Time series data indicates a state of a variable in time. Since some time series data can be also considered as continuous functions, we can expect quantum machine learning models to do many data analysis tasks successfully on time series data. Therefore, it is important to investigate new quantum logics for temporal data processing and analyze intrinsic relationships of data on quantum computers. In this paper, we go through the quantum analogues of classical data preprocessing and forecasting with ARIMA models by using simple quantum operators requiring a few number of quantum gates. Then we discuss future directions and some of the tools/algorithms that can be used for temporal data analysis on quantum computers.
    Sequential Importance Sampling for Hybrid Model Bayesian Inference to Support Bioprocess Mechanism Learning and Robust Control. (arXiv:2205.02410v3 [stat.ML] UPDATED)
    Driven by the critical needs of biomanufacturing 4.0, we introduce a probabilistic knowledge graph hybrid model characterizing the risk- and science-based understanding of bioprocess mechanisms. It can faithfully capture the important properties, including nonlinear reactions, partially observed state, and nonstationary dynamics. Given very limited real process observations, we derive a posterior distribution quantifying model estimation uncertainty. To avoid the evaluation of intractable likelihoods, Approximate Bayesian Computation sampling with Sequential Monte Carlo (ABC-SMC) is utilized to approximate the posterior distribution. Under high stochastic and model uncertainties, it is computationally expensive to match output trajectories. Therefore, we create a linear Gaussian dynamic Bayesian network (LG-DBN) auxiliary likelihood-based ABC-SMC approach. Through matching the summary statistics driven through LG-DBN likelihood that can capture critical interactions and variations, the proposed algorithm can accelerate hybrid model inference, support process monitoring, and facilitate mechanism learning and robust control.
    Adversarial Learning with Cost-Sensitive Classes. (arXiv:2101.12372v2 [cs.LG] UPDATED)
    It is necessary to improve the performance of some special classes or to particularly protect them from attacks in adversarial learning. This paper proposes a framework combining cost-sensitive classification and adversarial learning together to train a model that can distinguish between protected and unprotected classes, such that the protected classes are less vulnerable to adversarial examples. We find in this framework an interesting phenomenon during the training of deep neural networks, called Min-Max property, that is, the absolute values of most parameters in the convolutional layer approach zero while the absolute values of a few parameters are significantly larger becoming bigger. Based on this Min-Max property which is formulated and analyzed in a view of random distribution, we further build a new defense model against adversarial examples for adversarial robustness improvement. An advantage of the built model is that it performs better than the standard one and can combine with adversarial training to achieve an improved performance. It is experimentally confirmed that, regarding the average accuracy of all classes, our model is almost as same as the existing models when an attack does not occur and is better than the existing models when an attack occurs. Specifically, regarding the accuracy of protected classes, the proposed model is much better than the existing models when an attack occurs.
    MaskViT: Masked Visual Pre-Training for Video Prediction. (arXiv:2206.11894v1 [cs.CV])
    The ability to predict future visual observations conditioned on past observations and motor commands can enable embodied agents to plan solutions to a variety of tasks in complex environments. This work shows that we can create good video prediction models by pre-training transformers via masked visual modeling. Our approach, named MaskViT, is based on two simple design decisions. First, for memory and training efficiency, we use two types of window attention: spatial and spatiotemporal. Second, during training, we mask a variable percentage of tokens instead of a fixed mask ratio. For inference, MaskViT generates all tokens via iterative refinement where we incrementally decrease the masking ratio following a mask scheduling function. On several datasets we demonstrate that MaskViT outperforms prior works in video prediction, is parameter efficient, and can generate high-resolution videos (256x256). Further, we demonstrate the benefits of inference speedup (up to 512x) due to iterative decoding by using MaskViT for planning on a real robot. Our work suggests that we can endow embodied agents with powerful predictive models by leveraging the general framework of masked visual modeling with minimal domain knowledge.
    Importance of Kernel Bandwidth in Quantum Machine Learning. (arXiv:2111.05451v3 [quant-ph] UPDATED)
    Quantum kernel methods are considered a promising avenue for applying quantum computers to machine learning problems. Identifying hyperparameters controlling the inductive bias of quantum machine learning models is expected to be crucial given the central role hyperparameters play in determining the performance of classical machine learning methods. In this work we introduce the hyperparameter controlling the bandwidth of a quantum kernel and show that it controls the expressivity of the resulting model. We use extensive numerical experiments with multiple quantum kernels and classical datasets to show consistent change in the model behavior from underfitting (bandwidth too large) to overfitting (bandwidth too small), with optimal generalization in between. We draw a connection between the bandwidth of classical and quantum kernels and show analogous behavior in both cases. Furthermore, we show that optimizing the bandwidth can help mitigate the exponential decay of kernel values with qubit count, which is the cause behind recent observations that the performance of quantum kernel methods decreases with qubit count. We reproduce these negative results and show that if the kernel bandwidth is optimized, the performance instead improves with growing qubit count and becomes competitive with the best classical methods.
    Diagnosing and Fixing Manifold Overfitting in Deep Generative Models. (arXiv:2204.07172v2 [stat.ML] UPDATED)
    Likelihood-based, or explicit, deep generative models use neural networks to construct flexible high-dimensional densities. This formulation directly contradicts the manifold hypothesis, which states that observed data lies on a low-dimensional manifold embedded in high-dimensional ambient space. In this paper we investigate the pathologies of maximum-likelihood training in the presence of this dimensionality mismatch. We formally prove that degenerate optima are achieved wherein the manifold itself is learned but not the distribution on it, a phenomenon we call manifold overfitting. We propose a class of two-step procedures consisting of a dimensionality reduction step followed by maximum-likelihood density estimation, and prove that they recover the data-generating distribution in the nonparametric regime, thus avoiding manifold overfitting. We also show that these procedures enable density estimation on the manifolds learned by implicit models, such as generative adversarial networks, hence addressing a major shortcoming of these models. Several recently proposed methods are instances of our two-step procedures; we thus unify, extend, and theoretically justify a large class of models.
    LEAN: graph-based pruning for convolutional neural networks by extracting longest chains. (arXiv:2011.06923v3 [cs.LG] UPDATED)
    Neural network pruning techniques can substantially reduce the computational cost of applying convolutional neural networks (CNNs). Common pruning methods determine which convolutional filters to remove by ranking the filters individually, i.e., without taking into account their interdependence. In this paper, we advocate the viewpoint that pruning should consider the interdependence between series of consecutive operators. We propose the LongEst-chAiN (LEAN) method that prunes CNNs by using graph-based algorithms to select relevant chains of convolutions. A CNN is interpreted as a graph, with the operator norm of each operator as distance metric for the edges. LEAN pruning iteratively extracts the highest value path from the graph to keep. In our experiments, we test LEAN pruning on several image-to-image tasks, including the well-known CamVid dataset, and a real-world X-ray CT dataset. Results indicate that LEAN pruning can result in networks with similar accuracy, while using 1.7-12x fewer convolutional filters than existing approaches.
    Keys to Accurate Feature Extraction Using Residual Spiking Neural Networks. (arXiv:2111.05955v4 [cs.LG] UPDATED)
    Spiking neural networks (SNNs) have become an interesting alternative to conventional artificial neural networks (ANN) thanks to their temporal processing capabilities and energy efficient implementations in neuromorphic hardware. However the challenges involved in training SNNs have limited their performance in terms of accuracy and thus their applications. Improving learning algorithms and neural architectures for a more accurate feature extraction is therefore one of the current priorities in SNN research. In this paper we present a study on the key components of modern spiking architectures. We design a spiking version of the successful residual network architecture and provide an in-depth study on the possible implementations of spiking residual connections. This study shows how, depending on the use case, the optimal residual connection implementation may vary. Additionally, we empirically compare different techniques in image classification datasets taken from the best performing networks. Our results provide a state of the art guide to SNN design, which allows to make informed choices when trying to build the optimal visual feature extractor. Finally, our network outperforms previous SNN architectures in CIFAR-10 (94.14%) and CIFAR-100 (74.65%) datasets and matches the state of the art in DVS-CIFAR10 (72.98%), with less parameters than the previous state of the art and without the need for ANN-SNN conversion. Code available at https://github.com/VicenteAlex/Spiking_ResNet
    Teacher Model Fingerprinting Attacks Against Transfer Learning. (arXiv:2106.12478v2 [cs.CR] UPDATED)
    Transfer learning has become a common solution to address training data scarcity in practice. It trains a specified student model by reusing or fine-tuning early layers of a well-trained teacher model that is usually publicly available. However, besides utility improvement, the transferred public knowledge also brings potential threats to model confidentiality, and even further raises other security and privacy issues. In this paper, we present the first comprehensive investigation of the teacher model exposure threat in the transfer learning context, aiming to gain a deeper insight into the tension between public knowledge and model confidentiality. To this end, we propose a teacher model fingerprinting attack to infer the origin of a student model, i.e., the teacher model it transfers from. Specifically, we propose a novel optimization-based method to carefully generate queries to probe the student model to realize our attack. Unlike existing model reverse engineering approaches, our proposed fingerprinting method neither relies on fine-grained model outputs, e.g., posteriors, nor auxiliary information of the model architecture or training dataset. We systematically evaluate the effectiveness of our proposed attack. The empirical results demonstrate that our attack can accurately identify the model origin with few probing queries. Moreover, we show that the proposed attack can serve as a stepping stone to facilitating other attacks against machine learning models, such as model stealing.
    Matrix-wise $\ell_0$-constrained Sparse Nonnegative Least Squares. (arXiv:2011.11066v4 [cs.LG] UPDATED)
    Nonnegative least squares problems with multiple right-hand sides (MNNLS) arise in models that rely on additive linear combinations. In particular, they are at the core of most nonnegative matrix factorization algorithms and have many applications. The nonnegativity constraint is known to naturally favor sparsity, that is, solutions with few non-zero entries. However, it is often useful to further enhance this sparsity, as it improves the interpretability of the results and helps reducing noise, which leads to the sparse MNNLS problem. In this paper, as opposed to most previous works that enforce sparsity column- or row-wise, we first introduce a novel formulation for sparse MNNLS, with a matrix-wise sparsity constraint. Then, we present a two-step algorithm to tackle this problem. The first step divides sparse MNNLS in subproblems, one per column of the original problem. It then uses different algorithms to produce, either exactly or approximately, a Pareto front for each subproblem, that is, to produce a set of solutions representing different tradeoffs between reconstruction error and sparsity. The second step selects solutions among these Pareto fronts in order to build a sparsity-constrained matrix that minimizes the reconstruction error. We perform experiments on facial and hyperspectral images, and we show that our proposed two-step approach provides more accurate results than state-of-the-art sparse coding heuristics applied both column-wise and globally.
    Hermite Polynomial Features for Private Data Generation. (arXiv:2106.05042v4 [cs.LG] UPDATED)
    Kernel mean embedding is a useful tool to represent and compare probability measures. Despite its usefulness, kernel mean embedding considers infinite-dimensional features, which are challenging to handle in the context of differentially private data generation. A recent work proposes to approximate the kernel mean embedding of data distribution using finite-dimensional random features, which yields analytically tractable sensitivity. However, the number of required random features is excessively high, often ten thousand to a hundred thousand, which worsens the privacy-accuracy trade-off. To improve the trade-off, we propose to replace random features with Hermite polynomial features. Unlike the random features, the Hermite polynomial features are ordered, where the features at the low orders contain more information on the distribution than those at the high orders. Hence, a relatively low order of Hermite polynomial features can more accurately approximate the mean embedding of the data distribution compared to a significantly higher number of random features. As demonstrated on several tabular and image datasets, Hermite polynomial features seem better suited for private data generation than random Fourier features.
    Discriminative Similarity for Data Clustering. (arXiv:2109.08675v3 [cs.LG] UPDATED)
    Similarity-based clustering methods separate data into clusters according to the pairwise similarity between the data, and the pairwise similarity is crucial for their performance. In this paper, we propose {\em Clustering by Discriminative Similarity (CDS)}, a novel method which learns discriminative similarity for data clustering. CDS learns an unsupervised similarity-based classifier from each data partition, and searches for the optimal partition of the data by minimizing the generalization error of the learnt classifiers associated with the data partitions. By generalization analysis via Rademacher complexity, the generalization error bound for the unsupervised similarity-based classifier is expressed as the sum of discriminative similarity between the data from different classes. It is proved that the derived discriminative similarity can also be induced by the integrated squared error bound for kernel density classification. In order to evaluate the performance of the proposed discriminative similarity, we propose a new clustering method using a kernel as the similarity function, CDS via unsupervised kernel classification (CDSK), with its effectiveness demonstrated by experimental results.
    Provably Efficient Model-Free Constrained RL with Linear Function Approximation. (arXiv:2206.11889v1 [cs.LG])
    We study the constrained reinforcement learning problem, in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. In contrast to existing model-based approaches or model-free methods accompanied with a `simulator', we aim to develop the first model-free, simulator-free algorithm that achieves a sublinear regret and a sublinear constraint violation even in large-scale systems. To this end, we consider the episodic constrained Markov decision processes with linear function approximation, where the transition dynamics and the reward function can be represented as a linear function of some known feature mapping. We show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ constraint violation bounds can be achieved, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps. Our bounds are attained without explicitly estimating the unknown transition model or requiring a simulator, and they depend on the state space only through the dimension of the feature mapping. Hence our bounds hold even when the number of states goes to infinity. Our main results are achieved via novel adaptations of the standard LSVI-UCB algorithms. In particular, we first introduce primal-dual optimization into the LSVI-UCB algorithm to balance between regret and constraint violation. More importantly, we replace the standard greedy selection with respect to the state-action function in LSVI-UCB with a soft-max policy. This turns out to be key in establishing uniform concentration for the constrained case via its approximation-smoothness trade-off. We also show that one can achieve an even zero constraint violation while still maintaining the same order with respect to $T$.
    Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space. (arXiv:2206.11895v1 [cs.CV])
    Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, neither these Transformers nor 2D convolutional networks perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations. The key elements of 3DTRL include a pseudo-depth estimator and a learned camera matrix to impose geometric transformations on the tokens. These enable 3DTRL to recover the 3D positional information of the tokens from 2D patches. In practice, 3DTRL is easily plugged-in into a Transformer. Our experiments demonstrate the effectiveness of 3DTRL in many vision tasks including image classification, multi-view video alignment, and action recognition. The models with 3DTRL outperform their backbone Transformers in all the tasks with minimal added computation. Our project page is at https://www3.cs.stonybrook.edu/~jishang/3dtrl/3dtrl.html
    Modulating Bottom-Up and Top-Down Visual Processing via Language-Conditional Filters. (arXiv:2003.12739v3 [cs.CV] UPDATED)
    How to best integrate linguistic and perceptual processing in multi-modal tasks that involve language and vision is an important open problem. In this work, we argue that the common practice of using language in a top-down manner, to direct visual attention over high-level visual features, may not be optimal. We hypothesize that the use of language to also condition the bottom-up processing from pixels to high-level features can provide benefits to the overall performance. To support our claim, we propose a U-Net-based model and perform experiments on two language-vision dense-prediction tasks: referring expression segmentation and language-guided image colorization. We compare results where either one or both of the top-down and bottom-up visual branches are conditioned on language. Our experiments reveal that using language to control the filters for bottom-up visual processing in addition to top-down attention leads to better results on both tasks and achieves competitive performance. Our linguistic analysis suggests that bottom-up conditioning improves segmentation of objects especially when input text refers to low-level visual concepts. Code is available at https://github.com/ilkerkesen/bvpr.
    On compression rate of quantum autoencoders: Control design, numerical and experimental realization. (arXiv:2005.11149v2 [quant-ph] UPDATED)
    Quantum autoencoders which aim at compressing quantum information in a low-dimensional latent space lie in the heart of automatic data compression in the field of quantum information. In this paper, we establish an upper bound of the compression rate for a given quantum autoencoder and present a learning control approach for training the autoencoder to achieve the maximal compression rate. The upper bound of the compression rate is theoretically proven using eigen-decomposition and matrix differentiation, which is determined by the eigenvalues of the density matrix representation of the input states. Numerical results on 2-qubit and 3-qubit systems are presented to demonstrate how to train the quantum autoencoder to achieve the theoretically maximal compression, and the training performance using different machine learning algorithms is compared. Experimental results of a quantum autoencoder using quantum optical systems are illustrated for compressing two 2-qubit states into two 1-qubit states.
    Approximation Benefits of Policy Gradient Methods with Aggregated States. (arXiv:2007.11684v3 [cs.LG] UPDATED)
    Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper studies the case of state-aggregated representations, where the state space is partitioned and either the policy or value function approximation is held constant over partitions. This paper shows a policy gradient method converges to a policy whose regret per-period is bounded by $\epsilon$, the largest difference between two elements of the state-action value function belonging to a common partition. With the same representation, both approximate policy iteration and approximate value iteration can produce policies whose per-period regret scales as $\epsilon/(1-\gamma)$, where $\gamma$ is a discount factor. Faced with inherent approximation error, methods that locally optimize the true decision-objective can be far more robust.
    How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers. (arXiv:2106.10270v2 [cs.CV] UPDATED)
    Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications, such as image classification, object detection and semantic image segmentation. In comparison to convolutional neural networks, the Vision Transformer's weaker inductive bias is generally found to cause an increased reliance on model regularization or data augmentation ("AugReg" for short) when training on smaller training datasets. We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget. As one result of this study we find that the combination of increased compute and AugReg can yield models with the same performance as models trained on an order of magnitude more training data: we train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
    Identify treatment effect patterns for personalised decisions. (arXiv:1906.06080v2 [stat.ME] UPDATED)
    In personalised decision making, evidence is required to determine whether an action (treatment) is suitable for an individual. Such evidence can be obtained by modelling treatment effect heterogeneity in subgroups. The existing interpretable modelling methods take a top-down approach to search for subgroups with heterogeneous treatment effects and they may miss the most specific and relevant context for an individual. In this paper, we design a \emph{Treatment effect pattern (TEP)} to represent treatment effect heterogeneity in data. To achieve an interpretable presentation of TEPs, we use a local causal structure around the outcome to explicitly show how those important variables are used in modelling. We also derive a formula for unbiasedly estimating the \emph{Conditional Average Causal Effect (CATE)} using the local structure in our problem setting. In the discovery process, we aim at minimising heterogeneity within each subgroup represented by a pattern. We propose a bottom-up search algorithm to discover the most specific patterns fitting individual circumstances the best for personalised decision making. Experiments show that the proposed method models treatment effect heterogeneity better than three other existing tree based methods in synthetic and real world data sets.
    Predicting the meal macronutrient composition from continuous glucose monitors. (arXiv:2206.11878v1 [q-bio.QM])
    Sustained high levels of blood glucose in type 2 diabetes (T2DM) can have disastrous long-term health consequences. An essential component of clinical interventions for T2DM is monitoring dietary intake to keep plasma glucose levels within an acceptable range. Yet, current techniques to monitor food intake are time intensive and error prone. To address this issue, we are developing techniques to automatically monitor food intake and the composition of those foods using continuous glucose monitors (CGMs). This article presents the results of a clinical study in which participants consumed nine standardized meals with known macronutrients amounts (carbohydrate, protein, and fat) while wearing a CGM. We built a multitask neural network to estimate the macronutrient composition from the CGM signal, and compared it against a baseline linear regression. The best prediction result comes from our proposed neural network, trained with subject-dependent data, as measured by root mean squared relative error and correlation coefficient. These findings suggest that it is possible to estimate macronutrient composition from CGM signals, opening the possibility to develop automatic techniques to track food intake.
    Quantum Approximation of Normalized Schatten Norms and Applications to Learning. (arXiv:2206.11506v1 [quant-ph])
    Efficient measures to determine similarity of quantum states, such as the fidelity metric, have been widely studied. In this paper, we address the problem of defining a similarity measure for quantum operations that can be \textit{efficiently estimated}. Given two quantum operations, $U_1$ and $U_2$, represented in their circuit forms, we first develop a quantum sampling circuit to estimate the normalized Schatten 2-norm of their difference ($\| U_1-U_2 \|_{S_2}$) with precision $\epsilon$, using only one clean qubit and one classical random variable. We prove a Poly$(\frac{1}{\epsilon})$ upper bound on the sample complexity, which is independent of the size of the quantum system. We then show that such a similarity metric is directly related to a functional definition of similarity of unitary operations using the conventional fidelity metric of quantum states ($F$): If $\| U_1-U_2 \|_{S_2}$ is sufficiently small (e.g. $ \leq \frac{\epsilon}{1+\sqrt{2(1/\delta - 1)}}$) then the fidelity of states obtained by processing the same randomly and uniformly picked pure state, $|\psi \rangle$, is as high as needed ($F({U}_1 |\psi \rangle, {U}_2 |\psi \rangle)\geq 1-\epsilon$) with probability exceeding $1-\delta$. We provide example applications of this efficient similarity metric estimation framework to quantum circuit learning tasks, such as finding the square root of a given unitary operation.
    Factorization of the Partial Covariance in Singly-Connected Path Diagrams. (arXiv:2002.05226v6 [stat.ME] UPDATED)
    We extend path analysis by showing that, for a singly-connected path diagram, the partial covariance of two random variables factorizes over the nodes and edges in the path between the variables. This result allows us to determine the contribution of each node and edge to the partial covariance. It also allows us to show that Simpson's paradox cannot occur in singly-connected path diagrams.
    MHNF: Multi-hop Heterogeneous Neighborhood information Fusion graph representation learning. (arXiv:2106.09289v2 [cs.LG] UPDATED)
    The attention mechanism enables graph neural networks (GNNs) to learn the attention weights between the target node and its one-hop neighbors, thereby improving the performance further. However, most existing GNNs are oriented toward homogeneous graphs, and in which each layer can only aggregate the information of one-hop neighbors. Stacking multilayer networks introduces considerable noise and easily leads to over smoothing. We propose here a multihop heterogeneous neighborhood information fusion graph representation learning method (MHNF). Specifically, we propose a hybrid metapath autonomous extraction model to efficiently extract multihop hybrid neighbors. Then, we formulate a hop-level heterogeneous information aggregation model, which selectively aggregates different-hop neighborhood information within the same hybrid metapath. Finally, a hierarchical semantic attention fusion model (HSAF) is constructed, which can efficiently integrate different-hop and different-path neighborhood information. In this fashion, this paper solves the problem of aggregating multihop neighborhood information and learning hybrid metapaths for target tasks. This mitigates the limitation of manually specifying metapaths. In addition, HSAF can extract the internal node information of the metapaths and better integrate the semantic information present at different levels. Experimental results on real datasets show that MHNF achieves the best or competitive performance against state-of-the-art baselines with only a fraction of 1/10 ~ 1/100 parameters and computational budgets. Our code is publicly available at https://github.com/PHD-lanyu/MHNF.
    Graph Neural Networks for Temperature-Dependent Activity Coefficient Prediction of Solutes in Ionic Liquids. (arXiv:2206.11776v1 [cs.LG])
    Ionic liquids (ILs) are important solvents for sustainable processes and predicting activity coefficients (ACs) of solutes in ILs is needed. Recently, matrix completion methods (MCMs), transformers, and graph neural networks (GNNs) have shown high accuracy in predicting ACs of binary mixtures, superior to well-established models, e.g., COSMO-RS and UNIFAC. GNNs are particularly promising here as they learn a molecular graph-to-property relationship without pretraining, typically required for transformers, and are, unlike MCMs, applicable to molecules not included in training. For ILs, however, GNN applications are currently missing. Herein, we present a GNN to predict temperature-dependent infinite dilution ACs of solutes in ILs. We train the GNN on a database including more than 40,000 AC values and compare it to a state-of-the-art MCM. The GNN and MCM achieve similar high prediction performance, with the GNN additionally enabling high-quality predictions for ACs of solutions that contain ILs and solutes not considered during training.
    Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos. (arXiv:2206.11795v1 [cs.LG])
    Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for training models with broad, general capabilities for text, images, and other modalities. However, for many sequential decision domains such as robotics, video games, and computer use, publicly available data does not contain the labels required to train behavioral priors in the same way. We extend the internet-scale pretraining paradigm to sequential decision domains through semi-supervised imitation learning wherein agents learn to act by watching online unlabeled videos. Specifically, we show that with a small amount of labeled data we can train an inverse dynamics model accurate enough to label a huge unlabeled source of online data -- here, online videos of people playing Minecraft -- from which we can then train a general behavioral prior. Despite using the native human interface (mouse and keyboard at 20Hz), we show that this behavioral prior has nontrivial zero-shot capabilities and that it can be fine-tuned, with both imitation learning and reinforcement learning, to hard-exploration tasks that are impossible to learn from scratch via reinforcement learning. For many tasks our models exhibit human-level performance, and we are the first to report computer agents that can craft diamond tools, which can take proficient humans upwards of 20 minutes (24,000 environment actions) of gameplay to accomplish.
    Incorporating Hidden Layer representation into Adversarial Attacks and Defences. (arXiv:2011.14045v2 [cs.LG] UPDATED)
    In this paper, we propose a defence strategy to improve adversarial robustness by incorporating hidden layer representation. The key of this defence strategy aims to compress or filter input information including adversarial perturbation. And this defence strategy can be regarded as an activation function which can be applied to any kind of neural network. We also prove theoretically the effectiveness of this defense strategy under certain conditions. Besides, incorporating hidden layer representation we propose three types of adversarial attacks to generate three types of adversarial examples, respectively. The experiments show that our defence method can significantly improve the adversarial robustness of deep neural networks which achieves the state-of-the-art performance even though we do not adopt adversarial training.
    Layer-wise and Dimension-wise Locally Adaptive Federated Learning. (arXiv:2110.00532v3 [cs.LG] UPDATED)
    In the emerging paradigm of Federated Learning (FL), large amount of clients such as mobile devices are used to train possibly high-dimensional models on their respective data. Combining (dimension-wise) adaptive gradient methods (e.g. Adam, AMSGrad) with FL has been an active direction, which is shown to outperform traditional SGD based FL in many cases. In this paper, we focus on the problem of training federated deep neural networks, and propose a novel FL framework which further introduces layer-wise adaptivity to the local model updates. Our framework can be applied to locally adaptive FL methods including two recent algorithms, Mime and Fed-AMS. Theoretically, we provide a convergence analysis of our layer-wise FL methods, coined Fed-LAMB and Mime-LAMB, which matches the convergence rate of state-of-the-art results in FL and exhibits linear speedup in terms of the number of workers. Experimental results on various datasets and models, under both IID and non-IID local data settings, show that both Fed-LAMB and Mime-LAMB achieve faster convergence speed and better generalization performance, compared to the various recent adaptive FL methods.
    Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes. (arXiv:2206.11703v1 [eess.AS])
    The SepFormer architecture shows very good results in speech separation. Like other learned-encoder models, it uses short frames, as they have been shown to obtain better performance in these cases. This results in a large number of frames at the input, which is problematic; since the SepFormer is transformer-based, its computational complexity drastically increases with longer sequences. In this paper, we employ the SepFormer in a speech enhancement task and show that by replacing the learned-encoder features with a magnitude short-time Fourier transform (STFT) representation, we can use long frames without compromising perceptual enhancement performance. We obtained equivalent quality and intelligibility evaluation scores while reducing the number of operations by a factor of approximately 8 for a 10-second utterance.
    Measuring the Feasibility of Analogical Transfer using Complexity. (arXiv:2206.11753v1 [cs.AI])
    Analogies are 4-ary relations of the form "A is to B as C is to D". While focus has been mostly on how to solve an analogy, i.e. how to find correct values of D given A, B and C, less attention has been drawn on whether solving such an analogy was actually feasible. In this paper, we propose a quantification of the transferability of a source case (A and B) to solve a target problem C. This quantification is based on a complexity minimization principle which has been demonstrated to be efficient for solving analogies. We illustrate these notions on morphological analogies and show its connections with machine learning, and in particular with Unsupervised Domain Adaptation.
    Non-Determinism and the Lawlessness of ML Code. (arXiv:2206.11834v1 [cs.CY])
    Legal literature on machine learning (ML) tends to focus on harms, and as a result tends to reason about individual model outcomes and summary error rates. This focus on model-level outcomes and errors has masked important aspects of ML that are rooted in its inherent non-determinism. We show that the effects of non-determinism, and consequently its implications for the law, instead become clearer from the perspective of reasoning about ML outputs as probability distributions over possible outcomes. This distributional viewpoint accounts for non-determinism by emphasizing the possible outcomes of ML. Importantly, this type of reasoning is not exclusive with current legal reasoning; it complements (and in fact can strengthen) analyses concerning individual, concrete outcomes for specific automated decisions. By clarifying the important role of non-determinism, we demonstrate that ML code falls outside of the cyberlaw frame of treating "code as law," as this frame assumes that code is deterministic. We conclude with a brief discussion of what work ML can do to constrain the potentially harm-inducing effects of non-determinism, and we clarify where the law must do work to bridge the gap between its current individual-outcome focus and the distributional approach that we recommend.
    Capacity Optimality of OAMP in Coded Large Unitarily Invariant Systems. (arXiv:2206.11680v1 [cs.IT])
    This paper investigates a large unitarily invariant system (LUIS) involving a unitarily invariant sensing matrix, an arbitrary fixed signal distribution, and forward error control (FEC) coding. Several area properties are established based on the state evolution of orthogonal approximate message passing (OAMP) in an un-coded LUIS. Under the assumptions that the state evolution for joint OAMP and FEC decoding is correct and the replica method is reliable, we analyze the achievable rate of OAMP. We prove that OAMP reaches the constrained capacity predicted by the replica method of the LUIS with an arbitrary signal distribution based on matched FEC coding. Meanwhile, we elaborate a constrained capacity-achieving coding principle for LUIS, based on which irregular low-density parity-check (LDPC) codes are optimized for binary signaling in the simulation results. We show that OAMP with the optimized codes has significant performance improvement over the un-optimized ones and the well-known Turbo linear MMSE algorithm. For quadrature phase-shift keying (QPSK) modulation, constrained capacity-approaching bit error rate (BER) performances are observed under various channel conditions.
    Chasing Convex Bodies and Functions with Black-Box Advice. (arXiv:2206.11780v1 [cs.LG])
    We consider the problem of convex function chasing with black-box advice, where an online decision-maker aims to minimize the total cost of making and switching between decisions in a normed vector space, aided by black-box advice such as the decisions of a machine-learned algorithm. The decision-maker seeks cost comparable to the advice when it performs well, known as $\textit{consistency}$, while also ensuring worst-case $\textit{robustness}$ even when the advice is adversarial. We first consider the common paradigm of algorithms that switch between the decisions of the advice and a competitive algorithm, showing that no algorithm in this class can improve upon 3-consistency while staying robust. We then propose two novel algorithms that bypass this limitation by exploiting the problem's convexity. The first, INTERP, achieves $(\sqrt{2}+\epsilon)$-consistency and $\mathcal{O}(\frac{C}{\epsilon^2})$-robustness for any $\epsilon > 0$, where $C$ is the competitive ratio of an algorithm for convex function chasing or a subclass thereof. The second, BDINTERP, achieves $(1+\epsilon)$-consistency and $\mathcal{O}(\frac{CD}{\epsilon})$-robustness when the problem has bounded diameter $D$. Further, we show that BDINTERP achieves near-optimal consistency-robustness trade-off for the special case where cost functions are $\alpha$-polyhedral.
    A Topological characterisation of Weisfeiler-Leman equivalence classes. (arXiv:2206.11876v1 [cs.LG])
    Graph Neural Networks (GNNs) are learning models aimed at processing graphs and signals on graphs. The most popular and successful GNNs are based on message passing schemes. Such schemes inherently have limited expressive power when it comes to distinguishing two non-isomorphic graphs. In this article, we rely on the theory of covering spaces to fully characterize the classes of graphs that GNNs cannot distinguish. We then generate arbitrarily many non-isomorphic graphs that cannot be distinguished by GNNs, leading to the GraphCovers dataset. We also show that the number of indistinguishable graphs in our dataset grows super-exponentially with the number of nodes. Finally, we test the GraphCovers dataset on several GNN architectures, showing that none of them can distinguish any two graphs it contains.
    AST-Probe: Recovering abstract syntax trees from hidden representations of pre-trained language models. (arXiv:2206.11719v1 [cs.CL])
    The objective of pre-trained language models is to learn contextual representations of textual data. Pre-trained language models have become mainstream in natural language processing and code modeling. Using probes, a technique to study the linguistic properties of hidden vector spaces, previous works have shown that these pre-trained language models encode simple linguistic properties in their hidden representations. However, none of the previous work assessed whether these models encode the whole grammatical structure of a programming language. In this paper, we prove the existence of a \textit{syntactic subspace}, lying in the hidden representations of pre-trained language models, which contain the syntactic information of the programming language. We show that this subspace can be extracted from the models' representations and define a novel probing method, the AST-Probe, that enables recovering the whole abstract syntax tree (AST) of an input code snippet. In our experimentations, we show that this syntactic subspace exists in five state-of-the-art pre-trained language models. In addition, we highlight that the middle layers of the models are the ones that encode most of the AST information. Finally, we estimate the optimal size of this syntactic subspace and show that its dimension is substantially lower than those of the models' representation spaces. This suggests that pre-trained language models use a small part of their representation spaces to encode syntactic information of the programming languages.
    Sample Condensation in Online Continual Learning. (arXiv:2206.11849v1 [cs.LG])
    Online Continual learning is a challenging learning scenario where the model must learn from a non-stationary stream of data where each sample is seen only once. The main challenge is to incrementally learn while avoiding catastrophic forgetting, namely the problem of forgetting previously acquired knowledge while learning from new data. A popular solution in these scenario is to use a small memory to retain old data and rehearse them over time. Unfortunately, due to the limited memory size, the quality of the memory will deteriorate over time. In this paper we propose OLCGM, a novel replay-based continual learning strategy that uses knowledge condensation techniques to continuously compress the memory and achieve a better use of its limited size. The sample condensation step compresses old samples, instead of removing them like other replay strategies. As a result, the experiments show that, whenever the memory budget is limited compared to the complexity of the data, OLCGM improves the final accuracy compared to state-of-the-art replay strategies.
    Provable Acceleration of Heavy Ball beyond Quadratics for a Class of Polyak-\L{}ojasiewicz Functions when the Non-Convexity is Averaged-Out. (arXiv:2206.11872v1 [math.OC])
    Heavy Ball (HB) nowadays is one of the most popular momentum methods in non-convex optimization. It has been widely observed that incorporating the Heavy Ball dynamic in gradient-based methods accelerates the training process of modern machine learning models. However, the progress on establishing its theoretical foundation of acceleration is apparently far behind its empirical success. Existing provable acceleration results are of the quadratic or close-to-quadratic functions, as the current techniques of showing HB's acceleration are limited to the case when the Hessian is fixed. In this work, we develop some new techniques that help show acceleration beyond quadratics, which is achieved by analyzing how the change of the Hessian at two consecutive time points affects the convergence speed. Based on our technical results, a class of Polyak-\L{}ojasiewicz (PL) optimization problems for which provable acceleration can be achieved via HB is identified. Moreover, our analysis demonstrates a benefit of adaptively setting the momentum parameter.
    Single-phase deep learning in cortico-cortical networks. (arXiv:2206.11769v1 [q-bio.NC])
    The error-backpropagation (backprop) algorithm remains the most common solution to the credit assignment problem in artificial neural networks. In neuroscience, it is unclear whether the brain could adopt a similar strategy to correctly modify its synapses. Recent models have attempted to bridge this gap while being consistent with a range of experimental observations. However, these models are either unable to effectively backpropagate error signals across multiple layers or require a multi-phase learning process, neither of which are reminiscent of learning in the brain. Here, we introduce a new model, bursting cortico-cortical networks (BurstCCN), which solves these issues by integrating known properties of cortical networks namely bursting activity, short-term plasticity (STP) and dendrite-targeting interneurons. BurstCCN relies on burst multiplexing via connection-type-specific STP to propagate backprop-like error signals within deep cortical networks. These error signals are encoded at distal dendrites and induce burst-dependent plasticity as a result of excitatory-inhibitory topdown inputs. First, we demonstrate that our model can effectively backpropagate errors through multiple layers using a single-phase learning process. Next, we show both empirically and analytically that learning in our model approximates backprop-derived gradients. Finally, we demonstrate that our model is capable of learning complex image classification tasks (MNIST and CIFAR-10). Overall, our results suggest that cortical features across sub-cellular, cellular, microcircuit and systems levels jointly underlie single-phase efficient deep learning in the brain.
    Exploiting Transliterated Words for Finding Similarity in Inter-Language News Articles using Machine Learning. (arXiv:2206.11860v1 [cs.CL])
    Finding similarities between two inter-language news articles is a challenging problem of Natural Language Processing (NLP). It is difficult to find similar news articles in a different language other than the native language of user, there is a need for a Machine Learning based automatic system to find the similarity between two inter-language news articles. In this article, we propose a Machine Learning model with the combination of English Urdu word transliteration which will show whether the English news article is similar to the Urdu news article or not. The existing approaches to find similarities has a major drawback when the archives contain articles of low-resourced languages like Urdu along with English news article. The existing approaches to find similarities has drawback when the archives contain low-resourced languages like Urdu along with English news articles. We used lexicon to link Urdu and English news articles. As Urdu language processing applications like machine translation, text to speech, etc are unable to handle English text at the same time so this research proposed technique to find similarities in English and Urdu news articles based on transliteration.
    LED: Latent Variable-based Estimation of Density. (arXiv:2206.11563v1 [cs.LG])
    Modern generative models are roughly divided into two main categories: (1) models that can produce high-quality random samples, but cannot estimate the exact density of new data points and (2) those that provide exact density estimation, at the expense of sample quality and compactness of the latent space. In this work we propose LED, a new generative model closely related to GANs, that allows not only efficient sampling but also efficient density estimation. By maximizing log-likelihood on the output of the discriminator, we arrive at an alternative adversarial optimization objective that encourages generated data diversity. This formulation provides insights into the relationships between several popular generative models. Additionally, we construct a flow-based generator that can compute exact probabilities for generated samples, while allowing low-dimensional latent variables as input. Our experimental results, on various datasets, show that our density estimator produces accurate estimates, while retaining good quality in the generated samples.
    Learning Agile Skills via Adversarial Imitation of Rough Partial Demonstrations. (arXiv:2206.11693v1 [cs.RO])
    Learning agile skills is one of the main challenges in robotics. To this end, reinforcement learning approaches have achieved impressive results. These methods require explicit task information in terms of a reward function or an expert that can be queried in simulation to provide a target control output, which limits their applicability. In this work, we propose a generative adversarial method for inferring reward functions from partial and potentially physically incompatible demonstrations for successful skill acquirement where reference or expert demonstrations are not easily accessible. Moreover, we show that by using a Wasserstein GAN formulation and transitions from demonstrations with rough and partial information as input, we are able to extract policies that are robust and capable of imitating demonstrated behaviors. Finally, the obtained skills such as a backflip are tested on an agile quadruped robot called Solo 8 and present faithful replication of hand-held human demonstrations.
    A Temporal Extension of Latent Dirichlet Allocation for Unsupervised Acoustic Unit Discovery. (arXiv:2206.11706v1 [eess.AS])
    Latent Dirichlet allocation (LDA) is widely used for unsupervised topic modelling on sets of documents. No temporal information is used in the model. However, there is often a relationship between the corresponding topics of consecutive tokens. In this paper, we present an extension to LDA that uses a Markov chain to model temporal information. We use this new model for acoustic unit discovery from speech. As input tokens, the model takes a discretised encoding of speech from a vector quantised (VQ) neural network with 512 codes. The goal is then to map these 512 VQ codes to 50 phone-like units (topics) in order to more closely resemble true phones. In contrast to the base LDA, which only considers how VQ codes co-occur within utterances (documents), the Markov chain LDA additionally captures how consecutive codes follow one another. This extension leads to an increase in cluster quality and phone segmentation results compared to the base LDA. Compared to a recent vector quantised neural network approach that also learns 50 units, the extended LDA model performs better in phone segmentation but worse in mutual information.
    A Multi-Policy Framework for Deep Learning-Based Fake News Detection. (arXiv:2206.11866v1 [cs.CL])
    Connectivity plays an ever-increasing role in modern society, with people all around the world having easy access to rapidly disseminated information. However, a more interconnected society enables the spread of intentionally false information. To mitigate the negative impacts of fake news, it is essential to improve detection methodologies. This work introduces Multi-Policy Statement Checker (MPSC), a framework that automates fake news detection by using deep learning techniques to analyze a statement itself and its related news articles, predicting whether it is seemingly credible or suspicious. The proposed framework was evaluated using four merged datasets containing real and fake news. Long-Short Term Memory (LSTM), Gated Recurrent Unit (GRU) and Bidirectional Encoder Representations from Transformers (BERT) models were trained to utilize both lexical and syntactic features, and their performance was evaluated. The obtained results demonstrate that a multi-policy analysis reliably identifies suspicious statements, which can be advantageous for fake news detection.
    Lifelong Learning Natural Language Processing Approach for Multilingual Data Classification. (arXiv:2206.11867v1 [cs.CL])
    The abundance of information in digital media, which in today's world is the main source of knowledge about current events for the masses, makes it possible to spread disinformation on a larger scale than ever before. Consequently, there is a need to develop novel fake news detection approaches capable of adapting to changing factual contexts and generalizing previously or concurrently acquired knowledge. To deal with this problem, we propose a lifelong learning-inspired approach, which allows for fake news detection in multiple languages and the mutual transfer of knowledge acquired in each of them. Both classical feature extractors, such as Term frequency-inverse document frequency or Latent Dirichlet Allocation, and integrated deep NLP (Natural Language Processing) BERT (Bidirectional Encoder Representations from Transformers) models paired with MLP (Multilayer Perceptron) classifier, were employed. The results of experiments conducted on two datasets dedicated to the fake news classification task (in English and Spanish, respectively), supported by statistical analysis, confirmed that utilization of additional languages could improve performance for traditional methods. Also, in some cases supplementing the deep learning method with classical ones can positively impact obtained results. The ability of models to generalize the knowledge acquired between the analyzed languages was also observed.
    Human-in-the-Loop Large-Scale Predictive Maintenance of Workstations. (arXiv:2206.11574v1 [cs.LG])
    Predictive maintenance (PdM) is the task of scheduling maintenance operations based on a statistical analysis of the system's condition. We propose a human-in-the-loop PdM approach in which a machine learning system predicts future problems in sets of workstations (computers, laptops, and servers). Our system interacts with domain experts to improve predictions and elicit their knowledge. In our approach, domain experts are included in the loop not only as providers of correct labels, as in traditional active learning, but as a source of explicit decision rule feedback. The system is automated and designed to be easily extended to novel domains, such as maintaining workstations of several organizations. In addition, we develop a simulator for reproducible experiments in a controlled environment and deploy the system in a large-scale case of real-life workstations PdM with thousands of workstations for dozens of companies.
    Deep Reinforcement Learning-Assisted Federated Learning for Robust Short-term Utility Demand Forecasting in Electricity Wholesale Markets. (arXiv:2206.11715v1 [cs.DC])
    Short-term load forecasting (STLF) plays a significant role in the operation of electricity trading markets. Considering the growing concern of data privacy, federated learning (FL) is increasingly adopted to train STLF models for utility companies (UCs) in recent research. Inspiringly, in wholesale markets, as it is not realistic for power plants (PPs) to access UCs' data directly, FL is definitely a feasible solution of obtaining an accurate STLF model for PPs. However, due to FL's distributed nature and intense competition among UCs, defects increasingly occur and lead to poor performance of the STLF model, indicating that simply adopting FL is not enough. In this paper, we propose a DRL-assisted FL approach, DEfect-AwaRe federated soft actor-critic (DearFSAC), to robustly train an accurate STLF model for PPs to forecast precise short-term utility electricity demand. Firstly. we design a STLF model based on long short-term memory (LSTM) using just historical load data and time data. Furthermore, considering the uncertainty of defects occurrence, a deep reinforcement learning (DRL) algorithm is adopted to assist FL by alleviating model degradation caused by defects. In addition, for faster convergence of FL training, an auto-encoder is designed for both dimension reduction and quality evaluation of uploaded models. In the simulations, we validate our approach on real data of Helsinki's UCs in 2019. The results show that DearFSAC outperforms all the other approaches no matter if defects occur or not.
    NovelCraft: A Dataset for Novelty Detection and Discovery in Open Worlds. (arXiv:2206.11736v1 [cs.CV])
    In order for artificial agents to perform useful tasks in changing environments, they must be able to both detect and adapt to novelty. However, visual novelty detection research often only evaluates on repurposed datasets such as CIFAR-10 originally intended for object classification. This practice restricts novelties to well-framed images of distinct object types. We suggest that new benchmarks are needed to represent the challenges of navigating an open world. Our new NovelCraft dataset contains multi-modal episodic data of the images and symbolic world-states seen by an agent completing a pogo-stick assembly task within a video game world. In some episodes, we insert novel objects that can impact gameplay. Novelty can vary in size, position, and occlusion within complex scenes. We benchmark state-of-the-art novelty detection and generalized category discovery models with a focus on comprehensive evaluation. Results suggest an opportunity for future research: models aware of task-specific costs of different types of mistakes could more effectively detect and adapt to novelty in open worlds.
    Self-Supervised Training with Autoencoders for Visual Anomaly Detection. (arXiv:2206.11723v1 [cs.CV])
    Deep convolutional autoencoders provide an effective tool for learning non-linear dimensionality reduction in an unsupervised way. Recently, they have been used for the task of anomaly detection in the visual domain. By optimising for the reconstruction error using anomaly-free examples, the common belief is that a trained network will have difficulties to reconstruct anomalous parts during the test phase. This is usually done by controlling the capacity of the network by either reducing the size of the bottleneck layer or enforcing sparsity constraints on its activations. However, neither of these techniques does explicitly penalise reconstruction of anomalous signals often resulting in a poor detection. We tackle this problem by adapting a self-supervised learning regime which allows to use discriminative information during training while regularising the model to focus on the data manifold by means of a modified reconstruction error resulting in an accurate detection. Unlike related approaches, the inference of the proposed method during training and prediction is very efficient processing the whole input image in one single step. Our experiments on the MVTec Anomaly Detection dataset demonstrate high recognition and localisation performance of the proposed method. On the texture-subset, in particular, our approach consistently outperforms a bunch of recent anomaly detection methods by a big margin.
    Urdu News Article Recommendation Model using Natural Language Processing Techniques. (arXiv:2206.11862v1 [cs.IR])
    There are several online newspapers in urdu but for the users it is difficult to find the content they are looking for because these most of them contain irrelevant data and most users did not get what they want to retrieve. Our proposed framework will help to predict Urdu news in the interests of users and reduce the users searching time for news. For this purpose, NLP techniques are used for pre-processing, and then TF-IDF with cosine similarity is used for gaining the highest similarity and recommended news on user preferences. Moreover, the BERT language model is also used for similarity, and by using the BERT model similarity increases as compared to TF-IDF so the approach works better with the BERT language model and recommends news to the user on their interest. The news is recommended when the similarity of the articles is above 60 percent.
    Open-source FPGA-ML codesign for the MLPerf Tiny Benchmark. (arXiv:2206.11791v1 [cs.LG])
    We present our development experience and recent results for the MLPerf Tiny Inference Benchmark on field-programmable gate array (FPGA) platforms. We use the open-source hls4ml and FINN workflows, which aim to democratize AI-hardware codesign of optimized neural networks on FPGAs. We present the design and implementation process for the keyword spotting, anomaly detection, and image classification benchmark tasks. The resulting hardware implementations are quantized, configurable, spatial dataflow architectures tailored for speed and efficiency and introduce new generic optimizations and common workflows developed as a part of this work. The full workflow is presented from quantization-aware training to FPGA implementation. The solutions are deployed on system-on-chip (Pynq-Z2) and pure FPGA (Arty A7-100T) platforms. The resulting submissions achieve latencies as low as 20 $\mu$s and energy consumption as low as 30 $\mu$J per inference. We demonstrate how emerging ML benchmarks on heterogeneous hardware platforms can catalyze collaboration and the development of new techniques and more accessible tools.
    Video Diffusion Models. (arXiv:2204.03458v2 [cs.CV] UPDATED)
    Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We make progress towards this milestone by proposing a diffusion model for video generation that shows very promising initial results. Our model is a natural extension of the standard image diffusion architecture, and it enables jointly training from image and video data, which we find to reduce the variance of minibatch gradients and speed up optimization. To generate long and higher resolution videos we introduce a new conditional sampling technique for spatial and temporal video extension that performs better than previously proposed methods. We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on established benchmarks for video prediction and unconditional video generation. Supplementary material is available at https://video-diffusion.github.io/
    pyKT: A Python Library to Benchmark Deep Learning based Knowledge Tracing Models. (arXiv:2206.11460v1 [cs.LG])
    Knowledge tracing (KT) is the task of using students' historical learning interaction data to model their knowledge mastery over time so as to make predictions on their future interaction performance. Recently, remarkable progress has been made of using various deep learning techniques to solve the KT problem. However, the success behind deep learning based knowledge tracing (DLKT) approaches is still left somewhat mysterious and proper measurement and analysis of these DLKT approaches remain a challenge. First, data preprocessing procedures in existing works are often private and/or custom, which limits experimental standardization. Furthermore, existing DLKT studies often differ in terms of the evaluation protocol and are far away real-world educational contexts. To address these problems, we introduce a comprehensive python based benchmark platform, \textsc{pyKT}, to guarantee valid comparisons across DLKT methods via thorough evaluations. The \textsc{pyKT} library consists of a standardized set of integrated data preprocessing procedures on 7 popular datasets across different domains, and 10 frequently compared DLKT model implementations for transparent experiments. Results from our fine-grained and rigorous empirical KT studies yield a set of observations and suggestions for effective DLKT, e.g., wrong evaluation setting may cause label leakage that generally leads to performance inflation; and the improvement of many DLKT approaches is minimal compared to the very first DLKT model proposed by Piech et al. \cite{piech2015deep}. We have open sourced \textsc{pyKT} and our experimental results at \url{https://pykt.org/}. We welcome contributions from other research groups and practitioners.
    Explanatory causal effects for model agnostic explanations. (arXiv:2206.11529v1 [cs.LG])
    This paper studies the problem of estimating the contributions of features to the prediction of a specific instance by a machine learning model and the overall contribution of a feature to the model. The causal effect of a feature (variable) on the predicted outcome reflects the contribution of the feature to a prediction very well. A challenge is that most existing causal effects cannot be estimated from data without a known causal graph. In this paper, we define an explanatory causal effect based on a hypothetical ideal experiment. The definition brings several benefits to model agnostic explanations. First, explanations are transparent and have causal meanings. Second, the explanatory causal effect estimation can be data driven. Third, the causal effects provide both a local explanation for a specific prediction and a global explanation showing the overall importance of a feature in a predictive model. We further propose a method using individual and combined variables based on explanatory causal effects for explanations. We show the definition and the method work with experiments on some real-world data sets.
    Sufficient Statistic Memory Approximate Message Passing. (arXiv:2206.11674v1 [cs.IT])
    Approximate message passing (AMP) type algorithms have been widely used in the signal reconstruction of certain large random linear systems. A key feature of the AMP-type algorithms is that their dynamics can be correctly described by state evolution. However, state evolution does not necessarily guarantee the convergence of iterative algorithms. To solve the convergence problem of AMP-type algorithms in principle, this paper proposes a memory AMP (MAMP) under a sufficient statistic condition, named sufficient statistic MAMP (SS-MAMP). We show that the covariance matrices of SS-MAMP are L-banded and convergent. Given an arbitrary MAMP, we can construct the SS-MAMP by damping, which not only ensures the convergence, but also preserves the orthogonality, i.e., its dynamics can be correctly described by state evolution.
    Propagation with Adaptive Mask then Training for Node Classification on Attributed Networks. (arXiv:2206.10142v2 [cs.LG] UPDATED)
    Node classification on attributed networks is a semi-supervised task that is crucial for network analysis. By decoupling two critical operations in Graph Convolutional Networks (GCNs), namely feature transformation and neighborhood aggregation, some recent works of decoupled GCNs could support the information to propagate deeper and achieve advanced performance. However, they follow the traditional structure-aware propagation strategy of GCNs, making it hard to capture the attribute correlation of nodes and sensitive to the structure noise described by edges whose two endpoints belong to different categories. To address these issues, we propose a new method called the itshape Propagation with Adaptive Mask then Training (PAMT). The key idea is to integrate the attribute similarity mask into the structure-aware propagation process. In this way, PAMT could preserve the attribute correlation of adjacent nodes during the propagation and effectively reduce the influence of structure noise. Moreover, we develop an iterative refinement mechanism to update the similarity mask during the training process for improving the training performance. Extensive experiments on four real-world datasets demonstrate the superior performance and robustness of PAMT.
    Low-Rank Mirror-Prox for Nonsmooth and Low-Rank Matrix Optimization Problems. (arXiv:2206.11523v1 [math.OC])
    Low-rank and nonsmooth matrix optimization problems capture many fundamental tasks in statistics and machine learning. While significant progress has been made in recent years in developing efficient methods for \textit{smooth} low-rank optimization problems that avoid maintaining high-rank matrices and computing expensive high-rank SVDs, advances for nonsmooth problems have been slow paced. In this paper we consider standard convex relaxations for such problems. Mainly, we prove that under a \textit{strict complementarity} condition and under the relatively mild assumption that the nonsmooth objective can be written as a maximum of smooth functions, approximated variants of two popular \textit{mirror-prox} methods: the Euclidean \textit{extragradient method} and mirror-prox with \textit{matrix exponentiated gradient updates}, when initialized with a "warm-start", converge to an optimal solution with rate $O(1/t)$, while requiring only two \textit{low-rank} SVDs per iteration. Moreover, for the extragradient method we also consider relaxed versions of strict complementarity which yield a trade-off between the rank of the SVDs required and the radius of the ball in which we need to initialize the method. We support our theoretical results with empirical experiments on several nonsmooth low-rank matrix recovery tasks, demonstrating both the plausibility of the strict complementarity assumption, and the efficient convergence of our proposed low-rank mirror-prox variants.
    Prototype-Anchored Learning for Learning with Imperfect Annotations. (arXiv:2206.11602v1 [cs.LG])
    The success of deep neural networks greatly relies on the availability of large amounts of high-quality annotated data, which however are difficult or expensive to obtain. The resulting labels may be class imbalanced, noisy or human biased. It is challenging to learn unbiased classification models from imperfectly annotated datasets, on which we usually suffer from overfitting or underfitting. In this work, we thoroughly investigate the popular softmax loss and margin-based loss, and offer a feasible approach to tighten the generalization error bound by maximizing the minimal sample margin. We further derive the optimality condition for this purpose, which indicates how the class prototypes should be anchored. Motivated by theoretical analysis, we propose a simple yet effective method, namely prototype-anchored learning (PAL), which can be easily incorporated into various learning-based classification schemes to handle imperfect annotation. We verify the effectiveness of PAL on class-imbalanced learning and noise-tolerant learning by extensive experiments on synthetic and real-world datasets.
    Invariant Causal Mechanisms through Distribution Matching. (arXiv:2206.11646v1 [cs.LG])
    Learning representations that capture the underlying data generating process is a key problem for data efficient and robust use of neural networks. One key property for robustness which the learned representation should capture and which recently received a lot of attention is described by the notion of invariance. In this work we provide a causal perspective and new algorithm for learning invariant representations. Empirically we show that this algorithm works well on a diverse set of tasks and in particular we observe state-of-the-art performance on domain generalization, where we are able to significantly boost the score of existing models.
    Backward baselines: Is your model predicting the past?. (arXiv:2206.11673v1 [cs.LG])
    When does a machine learning model predict the future of individuals and when does it recite patterns that predate the individuals? In this work, we propose a distinction between these two pathways of prediction, supported by theoretical, empirical, and normative arguments. At the center of our proposal is a family of simple and efficient statistical tests, called backward baselines, that demonstrate if, and to which extent, a model recounts the past. Our statistical theory provides guidance for interpreting backward baselines, establishing equivalences between different baselines and familiar statistical concepts. Concretely, we derive a meaningful backward baseline for auditing a prediction system as a black box, given only background variables and the system's predictions. Empirically, we evaluate the framework on different prediction tasks derived from longitudinal panel surveys, demonstrating the ease and effectiveness of incorporating backward baselines into the practice of machine learning.
    On a class of geodesically convex optimization problems solved via Euclidean MM methods. (arXiv:2206.11426v1 [math.OC])
    We study geodesically convex (g-convex) problems that can be written as a difference of Euclidean convex functions. This structure arises in several optimization problems in statistics and machine learning, e.g., for matrix scaling, M-estimators for covariances, and Brascamp-Lieb inequalities. Our work offers efficient algorithms that on the one hand exploit g-convexity to ensure global optimality along with guarantees on iteration complexity. On the other hand, the split structure permits us to develop Euclidean Majorization-Minorization algorithms that help us bypass the need to compute expensive Riemannian operations such as exponential maps and parallel transport. We illustrate our results by specializing them to a few concrete optimization problems that have been previously studied in the machine learning literature. Ultimately, we hope our work helps motivate the broader search for mixed Euclidean-Riemannian optimization algorithms.
    On Pre-Training for Federated Learning. (arXiv:2206.11488v1 [cs.LG])
    In most of the literature on federated learning (FL), neural networks are initialized with random weights. In this paper, we present an empirical study on the effect of pre-training on FL. Specifically, we aim to investigate if pre-training can alleviate the drastic accuracy drop when clients' decentralized data are non-IID. We focus on FedAvg, the fundamental and most widely used FL algorithm. We found that pre-training does largely close the gap between FedAvg and centralized learning under non-IID data, but this does not come from alleviating the well-known model drifting problem in FedAvg's local training. Instead, how pre-training helps FedAvg is by making FedAvg's global aggregation more stable. When pre-training using real data is not feasible for FL, we propose a novel approach to pre-train with synthetic data. On various image datasets (including one for segmentation), our approach with synthetic pre-training leads to a notable gain, essentially a critical step toward scaling up federated learning for real-world applications.
    Investigation of stellar magnetic activity using variational autoencoder based on low-resolution spectroscopic survey. (arXiv:2206.07257v2 [astro-ph.SR] CROSS LISTED)
    We apply the variational autoencoder (VAE) to the LAMOST-K2 low-resolution spectra to detect the magnetic activity of the stars in the K2 field. After the training on the spectra of the selected inactive stars, the VAE model can efficiently generate the synthetic reference templates needed by the spectral subtraction procedure, without knowing any stellar parameters. Then we detect the peculiar spectral features, such as chromospheric emissions, strong nebular emissions and lithium absorptions, in our sample. We measure the emissions of the chromospheric activity indicators, H$\alpha$ and Ca$~{\rm {\small II}}$ infrared triplet (IRT) lines, to quantify the stellar magnetic activity. The excess emissions of H$\alpha$ and Ca$~{\rm {\small II}}$ IRT lines of the active stars are correlated well to the rotational periods and the amplitudes of light curves derived from the K2 photometry. We degrade the LAMOST spectra to simulate the slitless spectra of the planned China Space Station Telescope (CSST) and apply the VAE to the simulated data. For cool active stars, we reveal a good agreement between the equivalent widths (EWs) of H$\alpha$ line derived from the spectra with two resolutions. The result indicates the ability of identifying the magnetically active stars in the future CSST survey, which will deliver an unprecedented large database of low-resolution spectra as well as simultaneous multi-band photometry of stars.
    CGAR: Critic Guided Action Redistribution in Reinforcement Leaning. (arXiv:2206.11494v1 [cs.LG])
    Training a game-playing reinforcement learning agent requires multiple interactions with the environment. Ignorant random exploration may cause a waste of time and resources. It's essential to alleviate such waste. As discussed in this paper, under the settings of the off-policy actor critic algorithms, we demonstrate that the critic can bring more expected discounted rewards than or at least equal to the actor. Thus, the Q value predicted by the critic is a better signal to redistribute the action originally sampled from the policy distribution predicted by the actor. This paper introduces the novel Critic Guided Action Redistribution (CGAR) algorithm and tests it on the OpenAI MuJoCo tasks. The experimental results demonstrate that our method improves the sample efficiency and achieves state-of-the-art performance. Our code can be found at https://github.com/tairanhuang/CGAR.
    Walk the Random Walk: Learning to Discover and Reach Goals Without Supervision. (arXiv:2206.11733v1 [cs.LG])
    Learning a diverse set of skills by interacting with an environment without any external supervision is an important challenge. In particular, obtaining a goal-conditioned agent that can reach any given state is useful in many applications. We propose a novel method for training such a goal-conditioned agent without any external rewards or any domain knowledge. We use random walk to train a reachability network that predicts the similarity between two states. This reachability network is then used in building goal memory containing past observations that are diverse and well-balanced. Finally, we train a goal-conditioned policy network with goals sampled from the goal memory and reward it by the reachability network and the goal memory. All the components are kept updated throughout training as the agent discovers and learns new goals. We apply our method to a continuous control navigation and robotic manipulation tasks.
    Gradual Domain Adaptation via Normalizing Flows. (arXiv:2206.11492v1 [stat.ML])
    Conventional domain adaptation methods do not work well when a large gap exists between the source and the target domain. Gradual domain adaptation is one of the approaches to address the problem by leveraging the intermediate domain, which gradually shifts from the source to the target domain. The previous work assumed that the number of the intermediate domains is large and the distance of the adjacent domains is small; hence, the gradual domain adaptation algorithm by self-training with unlabeled datasets was applicable. In practice, however, gradual self-training will fail because the number of the intermediate domains is limited, and the distance of the adjacent domains is large. We propose using normalizing flows to mitigate this problem while maintaining the framework of unsupervised domain adaptation. We generate pseudo intermediate domains from normalizing flows and then use them for gradual domain adaptation. We evaluate our method by experiments with real-world datasets and confirm that our proposed method mitigates the above explained problem and improves the classification performance.
    Rethinking Collaborative Metric Learning: Toward an Efficient Alternative without Negative Sampling. (arXiv:2206.11549v1 [cs.LG])
    The recently proposed Collaborative Metric Learning (CML) paradigm has aroused wide interest in the area of recommendation systems (RS) owing to its simplicity and effectiveness. Typically, the existing literature of CML depends largely on the \textit{negative sampling} strategy to alleviate the time-consuming burden of pairwise computation. However, in this work, by taking a theoretical analysis, we find that negative sampling would lead to a biased estimation of the generalization error. Specifically, we show that the sampling-based CML would introduce a bias term in the generalization bound, which is quantified by the per-user \textit{Total Variance} (TV) between the distribution induced by negative sampling and the ground truth distribution. This suggests that optimizing the sampling-based CML loss function does not ensure a small generalization error even with sufficiently large training data. Moreover, we show that the bias term will vanish without the negative sampling strategy. Motivated by this, we propose an efficient alternative without negative sampling for CML named \textit{Sampling-Free Collaborative Metric Learning} (SFCML), to get rid of the sampling bias in a practical sense. Finally, comprehensive experiments over seven benchmark datasets speak to the superiority of the proposed algorithm.
    Authentication of Copy Detection Patterns under Machine Learning Attacks: A Supervised Approach. (arXiv:2206.11793v1 [cs.CR])
    Copy detection patterns (CDP) are an attractive technology that allows manufacturers to defend their products against counterfeiting. The main assumption behind the protection mechanism of CDP is that these codes printed with the smallest symbol size (1x1) on an industrial printer cannot be copied or cloned with sufficient accuracy due to data processing inequality. However, previous works have shown that Machine Learning (ML) based attacks can produce high-quality fakes, resulting in decreased accuracy of authentication based on traditional feature-based authentication systems. While Deep Learning (DL) can be used as a part of the authentication system, to the best of our knowledge, none of the previous works has studied the performance of a DL-based authentication system against ML-based attacks on CDP with 1x1 symbol size. In this work, we study such a performance assuming a supervised learning (SL) setting.
    Community Recovery in the Geometric Block Model. (arXiv:2206.11303v1 [cs.SI])
    To capture inherent geometric features of many community detection problems, we propose to use a new random graph model of communities that we call a \emph{Geometric Block Model}. The geometric block model builds on the \emph{random geometric graphs} (Gilbert, 1961), one of the basic models of random graphs for spatial networks, in the same way that the well-studied stochastic block model builds on the Erd\H{o}s-R\'{en}yi random graphs. It is also a natural extension of random community models inspired by the recent theoretical and practical advancements in community detection. To analyze the geometric block model, we first provide new connectivity results for \emph{random annulus graphs} which are generalizations of random geometric graphs. The connectivity properties of geometric graphs have been studied since their introduction, and analyzing them has been difficult due to correlated edge formation. We then use the connectivity results of random annulus graphs to provide necessary and sufficient conditions for efficient recovery of communities for the geometric block model. We show that a simple triangle-counting algorithm to detect communities in the geometric block model is near-optimal. For this we consider two regimes of graph density. In the regime where the average degree of the graph grows logarithmically with number of vertices, we show that our algorithm performs extremely well, both theoretically and practically. In contrast, the triangle-counting algorithm is far from being optimum for the stochastic block model in the logarithmic degree regime. We also look at the regime where the average degree of the graph grows linearly with the number of vertices $n$, and hence to store the graph one needs $\Theta(n^2)$ memory. We show that our algorithm needs to store only $O(n \log n)$ edges in this regime to recover the latent communities.
    Modular Conformal Calibration. (arXiv:2206.11468v1 [cs.LG])
    Uncertainty estimates must be calibrated (i.e., accurate) and sharp (i.e., informative) in order to be useful. This has motivated a variety of methods for recalibration, which use held-out data to turn an uncalibrated model into a calibrated model. However, the applicability of existing methods is limited due to their assumption that the original model is also a probabilistic model. We introduce a versatile class of algorithms for recalibration in regression that we call Modular Conformal Calibration (MCC). This framework allows one to transform any regression model into a calibrated probabilistic model. The modular design of MCC allows us to make simple adjustments to existing algorithms that enable well-behaved distribution predictions. We also provide finite-sample calibration guarantees for MCC algorithms. Our framework recovers isotonic recalibration, conformal calibration, and conformal interval prediction, implying that our theoretical results apply to those methods as well. Finally, we conduct an empirical study of MCC on 17 regression datasets. Our results show that new algorithms designed in our framework achieve near-perfect calibration and improve sharpness relative to existing methods.
    Improved Regret for Differentially Private Exploration in Linear MDP. (arXiv:2202.01292v2 [cs.LG] UPDATED)
    We study privacy-preserving exploration in sequential decision-making for environments that rely on sensitive data such as medical records. In particular, we focus on solving the problem of reinforcement learning (RL) subject to the constraint of (joint) differential privacy in the linear MDP setting, where both dynamics and rewards are given by linear functions. Prior work on this problem due to Luyo et al. (2021) achieves a regret rate that has a dependence of $O(K^{3/5})$ on the number of episodes $K$. We provide a private algorithm with an improved regret rate with an optimal dependence of $O(\sqrt{K})$ on the number of episodes. The key recipe for our stronger regret guarantee is the adaptivity in the policy update schedule, in which an update only occurs when sufficient changes in the data are detected. As a result, our algorithm benefits from low switching cost and only performs $O(\log(K))$ updates, which greatly reduces the amount of privacy noise. Finally, in the most prevalent privacy regimes where the privacy parameter $\epsilon$ is a constant, our algorithm incurs negligible privacy cost -- in comparison with the existing non-private regret bounds, the additional regret due to privacy appears in lower-order terms.
    Patient Aware Active Learning for Fine-Grained OCT Classification. (arXiv:2206.11485v1 [eess.IV])
    This paper considers making active learning more sensible from a medical perspective. In practice, a disease manifests itself in different forms across patient cohorts. Existing frameworks have primarily used mathematical constructs to engineer uncertainty or diversity-based methods for selecting the most informative samples. However, such algorithms do not present themselves naturally as usable by the medical community and healthcare providers. Thus, their deployment in clinical settings is very limited, if any. For this purpose, we propose a framework that incorporates clinical insights into the sample selection process of active learning that can be incorporated with existing algorithms. Our medically interpretable active learning framework captures diverse disease manifestations from patients to improve generalization performance of OCT classification. After comprehensive experiments, we report that incorporating patient insights within the active learning framework yields performance that matches or surpasses five commonly used paradigms on two architectures with a dataset having imbalanced patient distributions. Also, the framework integrates within existing medical practices and thus can be used by healthcare providers.
    Linear Speedup in Personalized Collaborative Learning. (arXiv:2111.05968v4 [cs.LG] UPDATED)
    Collaborative training can improve the accuracy of a model for a user by trading off the model's bias (introduced by using data from other users who are potentially different) against its variance (due to the limited amount of data on any single user). In this work, we formalize the personalized collaborative learning problem as a stochastic optimization of a task 0 while giving access to N related but different tasks 1,..., N. We provide convergence guarantees for two algorithms in this setting -- a popular collaboration method known as weighted gradient averaging, and a novel bias correction method -- and explore conditions under which we can achieve linear speedup w.r.t. the number of auxiliary tasks N. Further, we also empirically study their performance confirming our theoretical insights.
    Bayesian Nonparametrics for Offline Skill Discovery. (arXiv:2202.04675v3 [cs.LG] UPDATED)
    Skills or low-level policies in reinforcement learning are temporally extended actions that can speed up learning and enable complex behaviours. Recent work in offline reinforcement learning and imitation learning has proposed several techniques for skill discovery from a set of expert trajectories. While these methods are promising, the number K of skills to discover is always a fixed hyperparameter, which requires either prior knowledge about the environment or an additional parameter search to tune it. We first propose a method for offline learning of options (a particular skill framework) exploiting advances in variational inference and continuous relaxations. We then highlight an unexplored connection between Bayesian nonparametrics and offline skill discovery, and show how to obtain a nonparametric version of our model. This version is tractable thanks to a carefully structured approximate posterior with a dynamically-changing number of options, removing the need to specify K. We also show how our nonparametric extension can be applied in other skill frameworks, and empirically demonstrate that our method can outperform state-of-the-art offline skill learning algorithms across a variety of environments. Our code is available at https://github.com/layer6ai-labs/BNPO .
    Shilling Black-box Recommender Systems by Learning to Generate Fake User Profiles. (arXiv:2206.11433v1 [cs.IR])
    Due to the pivotal role of Recommender Systems (RS) in guiding customers towards the purchase, there is a natural motivation for unscrupulous parties to spoof RS for profits. In this paper, we study Shilling Attack where an adversarial party injects a number of fake user profiles for improper purposes. Conventional Shilling Attack approaches lack attack transferability (i.e., attacks are not effective on some victim RS models) and/or attack invisibility (i.e., injected profiles can be easily detected). To overcome these issues, we present Leg-UP, a novel attack model based on the Generative Adversarial Network. Leg-UP learns user behavior patterns from real users in the sampled ``templates'' and constructs fake user profiles. To simulate real users, the generator in Leg-UP directly outputs discrete ratings. To enhance attack transferability, the parameters of the generator are optimized by maximizing the attack performance on a surrogate RS model. To improve attack invisibility, Leg-UP adopts a discriminator to guide the generator to generate undetectable fake user profiles. Experiments on benchmarks have shown that Leg-UP exceeds state-of-the-art Shilling Attack methods on a wide range of victim RS models. The source code of our work is available at: https://github.com/XMUDM/ShillingAttack.
    Predicting the Geoeffectiveness of CMEs Using Machine Learning. (arXiv:2206.11472v1 [astro-ph.SR])
    Coronal mass ejections (CMEs) are the most geoeffective space weather phenomena, being associated with large geomagnetic storms, having the potential to cause disturbances to telecommunication, satellite network disruptions, power grid damages and failures. Thus, considering these storms' potential effects on human activities, accurate forecasts of the geoeffectiveness of CMEs are paramount. This work focuses on experimenting with different machine learning methods trained on white-light coronagraph datasets of close to sun CMEs, to estimate whether such a newly erupting ejection has the potential to induce geomagnetic activity. We developed binary classification models using logistic regression, K-Nearest Neighbors, Support Vector Machines, feed forward artificial neural networks, as well as ensemble models. At this time, we limited our forecast to exclusively use solar onset parameters, to ensure extended warning times. We discuss the main challenges of this task, namely the extreme imbalance between the number of geoeffective and ineffective events in our dataset, along with their numerous similarities and the limited number of available variables. We show that even in such conditions, adequate hit rates can be achieved with these models.
    A Geometric Method for Improved Uncertainty Estimation in Real-time. (arXiv:2206.11562v1 [cs.LG])
    Machine learning classifiers are probabilistic in nature, and thus inevitably involve uncertainty. Predicting the probability of a specific input to be correct is called uncertainty (or confidence) estimation and is crucial for risk management. Post-hoc model calibrations can improve models' uncertainty estimations without the need for retraining, and without changing the model. Our work puts forward a geometric-based approach for uncertainty estimation. Roughly speaking, we use the geometric distance of the current input from the existing training inputs as a signal for estimating uncertainty and then calibrate that signal (instead of the model's estimation) using standard post-hoc calibration techniques. We show that our method yields better uncertainty estimations than recently proposed approaches by extensively evaluating multiple datasets and models. In addition, we also demonstrate the possibility of performing our approach in near real-time applications. Our code is available at our Github https://github.com/NoSleepDeveloper/Geometric-Calibrator.
    Classical surrogates for quantum learning models. (arXiv:2206.11740v1 [quant-ph])
    The advent of noisy intermediate-scale quantum computers has put the search for possible applications to the forefront of quantum information science. One area where hopes for an advantage through near-term quantum computers are high is quantum machine learning, where variational quantum learning models based on parametrized quantum circuits are discussed. In this work, we introduce the concept of a classical surrogate, a classical model which can be efficiently obtained from a trained quantum learning model and reproduces its input-output relations. As inference can be performed classically, the existence of a classical surrogate greatly enhances the applicability of a quantum learning strategy. However, the classical surrogate also challenges possible advantages of quantum schemes. As it is possible to directly optimize the ansatz of the classical surrogate, they create a natural benchmark the quantum model has to outperform. We show that large classes of well-analyzed re-uploading models have a classical surrogate. We conducted numerical experiments and found that these quantum models show no advantage in performance or trainability in the problems we analyze. This leaves only generalization capability as possible point of quantum advantage and emphasizes the dire need for a better understanding of inductive biases of quantum learning models.
    Quant-BnB: A Scalable Branch-and-Bound Method for Optimal Decision Trees with Continuous Features. (arXiv:2206.11844v1 [cs.LG])
    Decision trees are one of the most useful and popular methods in the machine learning toolbox. In this paper, we consider the problem of learning optimal decision trees, a combinatorial optimization problem that is challenging to solve at scale. A common approach in the literature is to use greedy heuristics, which may not be optimal. Recently there has been significant interest in learning optimal decision trees using various approaches (e.g., based on integer programming, dynamic programming) -- to achieve computational scalability, most of these approaches focus on classification tasks with binary features. In this paper, we present a new discrete optimization method based on branch-and-bound (BnB) to obtain optimal decision trees. Different from existing customized approaches, we consider both regression and classification tasks with continuous features. The basic idea underlying our approach is to split the search space based on the quantiles of the feature distribution -- leading to upper and lower bounds for the underlying optimization problem along the BnB iterations. Our proposed algorithm Quant-BnB shows significant speedups compared to existing approaches for shallow optimal trees on various real datasets.
    $p$-Laplacian Based Graph Neural Networks. (arXiv:2111.07337v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have demonstrated superior performance for semi-supervised node classification on graphs, as a result of their ability to exploit node features and topological information simultaneously. However, most GNNs implicitly assume that the labels of nodes and their neighbors in a graph are the same or consistent, which does not hold in heterophilic graphs, where the labels of linked nodes are likely to differ. Hence, when the topology is non-informative for label prediction, ordinary GNNs may work significantly worse than simply applying multi-layer perceptrons (MLPs) on each node. To tackle the above problem, we propose a new $p$-Laplacian based GNN model, termed as $^p$GNN, whose message passing mechanism is derived from a discrete regularization framework and could be theoretically explained as an approximation of a polynomial graph filter defined on the spectral domain of $p$-Laplacians. The spectral analysis shows that the new message passing mechanism works simultaneously as low-pass and high-pass filters, thus making $^p$GNNs are effective on both homophilic and heterophilic graphs. Empirical studies on real-world and synthetic datasets validate our findings and demonstrate that $^p$GNNs significantly outperform several state-of-the-art GNN architectures on heterophilic benchmarks while achieving competitive performance on homophilic benchmarks. Moreover, $^p$GNNs can adaptively learn aggregation weights and are robust to noisy edges.
    Waypoint Generation in Row-based Crops with Deep Learning and Contrastive Clustering. (arXiv:2206.11623v1 [cs.RO])
    The development of precision agriculture has gradually introduced automation in the agricultural process to support and rationalize all the activities related to field management. In particular, service robotics plays a predominant role in this evolution by deploying autonomous agents able to navigate in fields while executing different tasks without the need for human intervention, such as monitoring, spraying and harvesting. In this context, global path planning is the first necessary step for every robotic mission and ensures that the navigation is performed efficiently and with complete field coverage. In this paper, we propose a learning-based approach to tackle waypoint generation for planning a navigation path for row-based crops, starting from a top-view map of the region-of-interest. We present a novel methodology for waypoint clustering based on a contrastive loss, able to project the points to a separable latent space. The proposed deep neural network can simultaneously predict the waypoint position and cluster assignment with two specialized heads in a single forward pass. The extensive experimentation on simulated and real-world images demonstrates that the proposed approach effectively solves the waypoint generation problem for both straight and curved row-based crops, overcoming the limitations of previous state-of-the-art methodologies.
    Neural Network-augmented Kalman Filtering for Robust Online Speech Dereverberation in Noisy Reverberant Environments. (arXiv:2204.02741v2 [eess.AS] UPDATED)
    In this paper, a neural network-augmented algorithm for noise-robust online dereverberation with a Kalman filtering variant of the weighted prediction error (WPE) method is proposed. The filter stochastic variations are predicted by a deep neural network (DNN) trained end-to-end using the filter residual error and signal characteristics. The presented framework allows for robust dereverberation on a single-channel noisy reverberant dataset similar to WHAMR!. The Kalman filtering WPE introduces distortions in the enhanced signal when predicting the filter variations from the residual error only, if the target speech power spectral density is not perfectly known and the observation is noisy. The proposed approach avoids these distortions by correcting the filter variations estimation in a data-driven way, increasing the robustness of the method to noisy scenarios. Furthermore, it yields a strong dereverberation and denoising performance compared to a DNN-supported recursive least squares variant of WPE, especially for highly noisy inputs.
    Backpropagation at the Infinitesimal Inference Limit of Energy-Based Models: Unifying Predictive Coding, Equilibrium Propagation, and Contrastive Hebbian Learning. (arXiv:2206.02629v2 [cs.LG] UPDATED)
    How the brain performs credit assignment is a fundamental unsolved problem in neuroscience. Many `biologically plausible' algorithms have been proposed, which compute gradients that approximate those computed by backpropagation (BP), and which operate in ways that more closely satisfy the constraints imposed by neural circuitry. Many such algorithms utilize the framework of energy-based models (EBMs), in which all free variables in the model are optimized to minimize a global energy function. However, in the literature, these algorithms exist in isolation and no unified theory exists linking them together. Here, we provide a comprehensive theory of the conditions under which EBMs can approximate BP, which lets us unify many of the BP approximation results in the literature (namely, predictive coding, equilibrium propagation, and contrastive Hebbian learning) and demonstrate that their approximation to BP arises from a simple and general mathematical property of EBMs at free-phase equilibrium. This property can then be exploited in different ways with different energy functions, and these specific choices yield a family of BP-approximating algorithms, which both includes the known results in the literature and can be used to derive new ones.
    Reachability analysis of neural networks using mixed monotonicity. (arXiv:2111.07683v3 [eess.SY] UPDATED)
    This paper presents a new reachability analysis approach to compute interval over-approximations of the output set of feedforward neural networks with input uncertainty. We adapt to neural networks an existing mixed-monotonicity method for the reachability analysis of dynamical systems and apply it to each partial network within the main network. This ensures that the intersection of the obtained results is the tightest interval over-approximation of the output of each layer that can be obtained using mixed-monotonicity on any partial network decomposition. Unlike other tools in the literature focusing on small classes of piecewise-affine or monotone activation functions, the main strength of our approach is its generality: it can handle neural networks with any Lipschitz-continuous activation function. In addition, the simplicity of our framework allows users to very easily add unimplemented activation functions, by simply providing the function, its derivative and the global argmin and argmax of the derivative. Our algorithm is compared to five other interval-based tools (Interval Bound Propagation, ReluVal, Neurify, VeriNet, CROWN) on both existing benchmarks and two sets of small and large randomly generated networks for four activation functions (ReLU, TanH, ELU, SiLU).
    A Framework for Learning to Request Rich and Contextually Useful Information from Humans. (arXiv:2110.08258v4 [cs.LG] UPDATED)
    When deployed, AI agents will encounter problems that are beyond their autonomous problem-solving capabilities. Leveraging human assistance can help agents overcome their inherent limitations and robustly cope with unfamiliar situations. We present a general interactive framework that enables an agent to request and interpret rich, contextually useful information from an assistant that has knowledge about the task and the environment. We demonstrate the practicality of our framework on a simulated human-assisted navigation problem. Aided with an assistance-requesting policy learned by our method, a navigation agent achieves up to a 7x improvement in success rate on tasks that take place in previously unseen environments, compared to fully autonomous behavior. We show that the agent can take advantage of different types of information depending on the context, and analyze the benefits and challenges of learning the assistance-requesting policy when the assistant can recursively decompose tasks into subtasks.
    Semantic Communications: Principles and Challenges. (arXiv:2201.01389v3 [cs.IT] UPDATED)
    Semantic communication, regarded as the breakthrough beyond the Shannon paradigm, aims at the successful transmission of semantic information conveyed by the source rather than the accurate reception of each single symbol or bit regardless of its meaning. This article provides an overview on semantic communications. After a brief review of Shannon information theory, we discuss semantic communications with theory, framework, and system design enabled by deep learning. Different from the symbol/bit error rate used for measuring conventional communication systems, performance metrics for semantic communications are also discussed. The article concludes with several open questions in semantic communications.
    Projection-free Constrained Stochastic Nonconvex Optimization with State-dependent Markov Data. (arXiv:2206.11346v1 [math.OC])
    We study a projection-free conditional gradient-type algorithm for constrained nonconvex stochastic optimization problems with Markovian data. In particular, we focus on the case when the transition kernel of the Markov chain is state-dependent. Such stochastic optimization problems arise in various machine learning problems including strategic classification and reinforcement learning. For this problem, we establish that the number of calls to the stochastic first-order oracle and the linear minimization oracle to obtain an appropriately defined $\epsilon$-stationary point, are of the order $\mathcal{O}(1/\epsilon^{2.5})$ and $\mathcal{O}(1/\epsilon^{5.5})$ respectively. We also empirically demonstrate the performance of our algorithm on the problem of strategic classification with neural networks.
    Learning Representations for Control with Hierarchical Forward Models. (arXiv:2206.11396v1 [cs.LG])
    Learning control from pixels is difficult for reinforcement learning (RL) agents because representation learning and policy learning are intertwined. Previous approaches remedy this issue with auxiliary representation learning tasks, but they either do not consider the temporal aspect of the problem or only consider single-step transitions. Instead, we propose Hierarchical $k$-Step Latent (HKSL), an auxiliary task that learns representations via a hierarchy of forward models that operate at varying magnitudes of step skipping while also learning to communicate between levels in the hierarchy. We evaluate HKSL in a suite of 30 robotic control tasks and find that HKSL either reaches higher episodic returns or converges to maximum performance more quickly than several current baselines. Also, we find that levels in HKSL's hierarchy can learn to specialize in long- or short-term consequences of agent actions, thereby providing the downstream control policy with more informative representations. Finally, we determine that communication channels between hierarchy levels organize information based on both sides of the communication process, which improves sample efficiency.
    Optimizing Two-way Partial AUC with an End-to-end Framework. (arXiv:2206.11655v1 [cs.LG])
    The Area Under the ROC Curve (AUC) is a crucial metric for machine learning, which evaluates the average performance over all possible True Positive Rates (TPRs) and False Positive Rates (FPRs). Based on the knowledge that a skillful classifier should simultaneously embrace a high TPR and a low FPR, we turn to study a more general variant called Two-way Partial AUC (TPAUC), where only the region with $\mathsf{TPR} \ge \alpha, \mathsf{FPR} \le \beta$ is included in the area. Moreover, recent work shows that the TPAUC is essentially inconsistent with the existing Partial AUC metrics where only the FPR range is restricted, opening a new problem to seek solutions to leverage high TPAUC. Motivated by this, we present the first trial in this paper to optimize this new metric. The critical challenge along this course lies in the difficulty of performing gradient-based optimization with end-to-end stochastic training, even with a proper choice of surrogate loss. To address this issue, we propose a generic framework to construct surrogate optimization problems, which supports efficient end-to-end training with deep learning. Moreover, our theoretical analyses show that: 1) the objective function of the surrogate problems will achieve an upper bound of the original problem under mild conditions, and 2) optimizing the surrogate problems leads to good generalization performance in terms of TPAUC with a high probability. Finally, empirical studies over several benchmark datasets speak to the efficacy of our framework.
    Recursive Reinforcement Learning. (arXiv:2206.11430v1 [cs.LG])
    Recursion is the fundamental paradigm to finitely describe potentially infinite objects. As state-of-the-art reinforcement learning (RL) algorithms cannot directly reason about recursion, they must rely on the practitioner's ingenuity in designing a suitable "flat" representation of the environment. The resulting manual feature constructions and approximations are cumbersome and error-prone; their lack of transparency hampers scalability. To overcome these challenges, we develop RL algorithms capable of computing optimal policies in environments described as a collection of Markov decision processes (MDPs) that can recursively invoke one another. Each constituent MDP is characterized by several entry and exit points that correspond to input and output values of these invocations. These recursive MDPs (or RMDPs) are expressively equivalent to probabilistic pushdown systems (with call-stack playing the role of the pushdown stack), and can model probabilistic programs with recursive procedural calls. We introduce Recursive Q-learning -- a model-free RL algorithm for RMDPs -- and prove that it converges for finite, single-exit and deterministic multi-exit RMDPs under mild assumptions.
    Input-agnostic Certified Group Fairness via Gaussian Parameter Smoothing. (arXiv:2206.11423v1 [cs.LG])
    Only recently, researchers attempt to provide classification algorithms with provable group fairness guarantees. Most of these algorithms suffer from harassment caused by the requirement that the training and deployment data follow the same distribution. This paper proposes an input-agnostic certified group fairness algorithm, FairSmooth, for improving the fairness of classification models while maintaining the remarkable prediction accuracy. A Gaussian parameter smoothing method is developed to transform base classifiers into their smooth versions. An optimal individual smooth classifier is learnt for each group with only the data regarding the group and an overall smooth classifier for all groups is generated by averaging the parameters of all the individual smooth ones. By leveraging the theory of nonlinear functional analysis, the smooth classifiers are reformulated as output functions of a Nemytskii operator. Theoretical analysis is conducted to derive that the Nemytskii operator is smooth and induces a Frechet differentiable smooth manifold. We theoretically demonstrate that the smooth manifold has a global Lipschitz constant that is independent of the domain of the input data, which derives the input-agnostic certified group fairness.
    Prevent Car Accidents by Using AI. (arXiv:2206.11381v1 [cs.LG])
    Transportation facilities are becoming more developed as society develops, and people's travel demand is increasing, but so are the traffic safety issues that arise as a result. And car accidents are a major issue all over the world. The cost of traffic fatalities and driver injuries has a significant impact on society. The use of machine learning techniques in the field of traffic accidents is becoming increasingly popular. Machine learning classifiers are used instead of traditional data mining techniques to produce better results and accuracy. As a result, this project conducts research on existing work related to accident prediction using machine learning. We will use crash data and weather data to train machine learning models to predict crash severity and reduce crashes.
    Bi-stochastically normalized graph Laplacian: convergence to manifold Laplacian and robustness to outlier noise. (arXiv:2206.11386v1 [math.ST])
    Bi-stochastic normalization of kernelized graph affinity matrix provides an alternative normalization scheme for graph Laplacian methods in graph-based data analysis and can be computed efficiently by Sinkhorn-Knopp (SK) iterations in practice. This paper proves the convergence of the bi-stochastically normalized graph Laplacian to manifold (weighted-)Laplacian with rates when $n$ data points are i.i.d. sampled from a general $d$-dimensional manifold embedded in a possibly high-dimensional space. Under certain joint limit of $n \to \infty$ and kernel bandwidth $\epsilon \to 0$, the point-wise convergence rate of the graph Laplacian operator (under 2-norm) is proved to be $ O( n^{-1/(d/2+3)})$ at finite large $n$ up to log factors, achieved at the scaling of $\epsilon \sim n^{-1/(d/2+3)} $. When the manifold data are corrupted by outlier noise, we theoretically prove the graph Laplacian point-wise consistency which matches the rate for clean manifold data up to an additional error term proportional to the boundedness of mutual inner-products of the noise vectors. Our analysis suggests that, under the setting being considered in this paper, not exact bi-stochastic normalization but an approximate one will achieve the same consistency rate. Motivated by the analysis, we propose an approximate and constrained matrix scaling problem that can be solved by SK iterations with early termination, and apply to simulated manifold data both clean and with outlier noise. Numerical experiments support our theoretical results and show the robustness of bi-stochastically normalized graph Laplacian to outlier noise.
    Program Targeting with Machine Learning and Mobile Phone Data: Evidence from an Anti-Poverty Intervention in Afghanistan. (arXiv:2206.11400v1 [econ.GN])
    Can mobile phone data improve program targeting? By combining rich survey data from a "big push" anti-poverty program in Afghanistan with detailed mobile phone logs from program beneficiaries, we study the extent to which machine learning methods can accurately differentiate ultra-poor households eligible for program benefits from ineligible households. We show that machine learning methods leveraging mobile phone data can identify ultra-poor households nearly as accurately as survey-based measures of consumption and wealth; and that combining survey-based measures with mobile phone data produces classifications more accurate than those based on a single data source.
    Reinforcement Learning under Partial Observability Guided by Learned Environment Models. (arXiv:2206.11708v1 [cs.LG])
    In practical applications, we can rarely assume full observability of a system's environment, despite such knowledge being important for determining a reactive control system's precise interaction with its environment. Therefore, we propose an approach for reinforcement learning (RL) in partially observable environments. While assuming that the environment behaves like a partially observable Markov decision process with known discrete actions, we assume no knowledge about its structure or transition probabilities. Our approach combines Q-learning with IoAlergia, a method for learning Markov decision processes (MDP). By learning MDP models of the environment from episodes of the RL agent, we enable RL in partially observable domains without explicit, additional memory to track previous interactions for dealing with ambiguities stemming from partial observability. We instead provide RL with additional observations in the form of abstract environment states by simulating new experiences on learned environment models to track the explored states. In our evaluation, we report on the validity of our approach and its promising performance in comparison to six state-of-the-art deep RL techniques with recurrent neural networks and fixed memory.
    Safe Reinforcement Learning Using Robust Control Barrier Functions. (arXiv:2110.05415v2 [eess.SY] UPDATED)
    Reinforcement Learning (RL) has been shown to be effective in many scenarios. However, it typically requires the exploration of a sufficiently large number of state-action pairs, some of which may be unsafe. Consequently, its application to safety-critical systems remains a challenge. An increasingly common approach to address safety involves the addition of a safety layer that projects the RL actions onto a safe set of actions. In turn, a difficulty for such frameworks is how to effectively couple RL with the safety layer to improve the learning performance. In this paper, we frame safety as a differentiable robust-control-barrier-function layer in a model-based RL framework. Moreover, we also propose an approach to modularly learn the underlying reward-driven task, independent of safety constraints. We demonstrate that this approach both ensures safety and effectively guides exploration during training in a range of experiments, including zero-shot transfer when the reward is learned in a modular way.
    On the Parameterization and Initialization of Diagonal State Space Models. (arXiv:2206.11893v1 [cs.LG])
    State space models (SSM) have recently been shown to be very effective as a deep learning layer as a promising alternative to sequence models such as RNNs, CNNs, or Transformers. The first version to show this potential was the S4 model, which is particularly effective on tasks involving long-range dependencies by using a prescribed state matrix called the HiPPO matrix. While this has an interpretable mathematical mechanism for modeling long dependencies, it introduces a custom representation and algorithm that can be difficult to implement. On the other hand, a recent variant of S4 called DSS showed that restricting the state matrix to be fully diagonal can still preserve the performance of the original model when using a specific initialization based on approximating S4's matrix. This work seeks to systematically understand how to parameterize and initialize such diagonal state space models. While it follows from classical results that almost all SSMs have an equivalent diagonal form, we show that the initialization is critical for performance. We explain why DSS works mathematically, by showing that the diagonal restriction of S4's matrix surprisingly recovers the same kernel in the limit of infinite state dimension. We also systematically describe various design choices in parameterizing and computing diagonal SSMs, and perform a controlled empirical study ablating the effects of these choices. Our final model S4D is a simple diagonal version of S4 whose kernel computation requires just 2 lines of code and performs comparably to S4 in almost all settings, with state-of-the-art results for image, audio, and medical time-series domains, and averaging 85\% on the Long Range Arena benchmark.
    RetroGraph: Retrosynthetic Planning with Graph Search. (arXiv:2206.11477v1 [cs.AI])
    Retrosynthetic planning, which aims to find a reaction pathway to synthesize a target molecule, plays an important role in chemistry and drug discovery. This task is usually modeled as a search problem. Recently, data-driven methods have attracted many research interests and shown promising results for retrosynthetic planning. We observe that the same intermediate molecules are visited many times in the searching process, and they are usually independently treated in previous tree-based methods (e.g., AND-OR tree search, Monte Carlo tree search). Such redundancies make the search process inefficient. We propose a graph-based search policy that eliminates the redundant explorations of any intermediate molecules. As searching over a graph is more complicated than over a tree, we further adopt a graph neural network to guide the search over graphs. Meanwhile, our method can search a batch of targets together in the graph and remove the inter-target duplication in the tree-based search methods. Experimental results on two datasets demonstrate the effectiveness of our method. Especially on the widely used USPTO benchmark, we improve the search success rate to 99.47%, advancing previous state-of-the-art performance for 2.6 points.
    A generalised form for a homogeneous population of structures using an overlapping mixture of Gaussian processes. (arXiv:2206.11683v1 [cs.LG])
    Reductions in natural frequency are often used as a damage indicator for structural health monitoring (SHM) purposes. However, fluctuations in operational and environmental conditions, changes in boundary conditions, and slight differences among nominally-identical structures can also affect stiffness, producing frequency changes that mimic or mask damage. This variability has limited the practical implementation and generalisation of SHM technologies. The aim of this work is to investigate the effects of normal variation, and to identify methods that account for the resulting uncertainty. This work considers vibration data collected from a set of four healthy full-scale composite helicopter blades. The blades were nominally-identical but distinct, and slight differences in material properties and geometry among the blades caused significant variability in the frequency response functions, which presented as four separate trajectories across the input space. In this paper, an overlapping mixture of Gaussian processes (OMGP), was used to generate labels and quantify the uncertainty of normal-condition frequency response data from the helicopter blades. Using a population-based approach, the OMGP model provided a generic representation, called a form, to characterise the normal condition of the blades. Additional simulated data were then compared against the form and evaluated for damage using a marginal-likelihood novelty index.
    Remote Sensing Change Detection (Segmentation) using Denoising Diffusion Probabilistic Models. (arXiv:2206.11892v1 [cs.CV])
    Human civilization has an increasingly powerful influence on the earth system, and earth observations are an invaluable tool for assessing and mitigating the negative impacts. To this end, observing precisely defined changes on Earth's surface is essential, and we propose an effective way to achieve this goal. Notably, our change detection (CD)/ segmentation method proposes a novel way to incorporate the millions of off-the-shelf, unlabeled, remote sensing images available through different earth observation programs into the training process through denoising diffusion probabilistic models. We first leverage the information from these off-the-shelf, uncurated, and unlabeled remote sensing images by using a pre-trained denoising diffusion probabilistic model and then employ the multi-scale feature representations from the diffusion model decoder to train a lightweight CD classifier to detect precise changes. The experiments performed on four publically available CD datasets show that the proposed approach achieves remarkably better results than the state-of-the-art methods in F1, IoU, and overall accuracy. Code and pre-trained models are available at: https://github.com/wgcban/ddpm-cd
    Context-based Virtual Adversarial Training for Text Classification with Noisy Labels. (arXiv:2206.11851v1 [cs.CL])
    Deep neural networks (DNNs) have a high capacity to completely memorize noisy labels given sufficient training time, and its memorization, unfortunately, leads to performance degradation. Recently, virtual adversarial training (VAT) attracts attention as it could further improve the generalization of DNNs in semi-supervised learning. The driving force behind VAT is to prevent the models from overfitting data points by enforcing consistency between the inputs and the perturbed inputs. This strategy could be helpful in learning from noisy labels if it prevents neural models from learning noisy samples while encouraging the models to generalize clean samples. In this paper, we propose context-based virtual adversarial training (ConVAT) to prevent a text classifier from overfitting to noisy labels. Unlike the previous works, the proposed method performs the adversarial training at the context level rather than the inputs. It makes the classifier not only learn its label but also its contextual neighbors, which alleviates the learning from noisy labels by preserving contextual semantics on each data point. We conduct extensive experiments on four text classification datasets with two types of label noises. Comprehensive experimental results clearly show that the proposed method works quite well even with extremely noisy settings.
    Improving decision-making via risk-based active learning: Probabilistic discriminative classifiers. (arXiv:2206.11616v1 [cs.LG])
    Gaining the ability to make informed decisions on operation and maintenance of structures provides motivation for the implementation of structural health monitoring (SHM) systems. However, descriptive labels for measured data corresponding to health-states of the monitored system are often unavailable. This issue limits the applicability of fully-supervised machine learning paradigms for the development of statistical classifiers to be used in decision-support in SHM systems. One approach to dealing with this problem is risk-based active learning. In such an approach, data-label querying is guided according to the expected value of perfect information for incipient data points. For risk-based active learning in SHM, the value of information is evaluated with respect to a maintenance decision process, and the data-label querying corresponds to the inspection of a structure to determine its health state. In the context of SHM, risk-based active learning has only been considered for generative classifiers. The current paper demonstrates several advantages of using an alternative type of classifier -- discriminative models. Using the Z24 Bridge dataset as a case study, it is shown that discriminative classifiers have benefits, in the context of SHM decision-support, including improved robustness to sampling bias, and reduced expenditure on structural inspections.
    Inductive Conformal Prediction: A Straightforward Introduction with Examples in Python. (arXiv:2206.11810v1 [stat.ML])
    Inductive Conformal Prediction (ICP) is a set of distribution-free and model agnostic algorithms devised to predict with a user-defined confidence with coverage guarantee. Instead of having \textit{point predictions}, i.e., a real number in the case of regression or a single class in multi class classification, models calibrated using ICP output an interval or a set of classes, respectively. ICP takes special importance in high-risk settings where we want the real output to belong to the prediction set with high probability. As an example, a classification model might output that given a magnetic resonance image a patient has no latent diseases to report. However, this model output was based on the most likely class, the second most likely class might tell that the patient has a 15\% chance of brain tumor or other severe disease and therefore further exams should be conducted. Using ICP is therefore way more informative and we believe that should be the standard way of producing forecasts. This paper is a hands-on introduction, this means that we will provide examples as we introduce the theory.
    On the Generalizability and Predictability of Recommender Systems. (arXiv:2206.11886v1 [cs.IR])
    While other areas of machine learning have seen more and more automation, designing a high-performing recommender system still requires a high level of human effort. Furthermore, recent work has shown that modern recommender system algorithms do not always improve over well-tuned baselines. A natural follow-up question is, "how do we choose the right algorithm for a new dataset and performance metric?" In this work, we start by giving the first large-scale study of recommender system approaches by comparing 18 algorithms and 100 sets of hyperparameters across 85 datasets and 315 metrics. We find that the best algorithms and hyperparameters are highly dependent on the dataset and performance metric, however, there are also strong correlations between the performance of each algorithm and various meta-features of the datasets. Motivated by these findings, we create RecZilla, a meta-learning approach to recommender systems that uses a model to predict the best algorithm and hyperparameters for new, unseen datasets. By using far more meta-training data than prior work, RecZilla is able to substantially reduce the level of human involvement when faced with a new recommender system application. We not only release our code and pretrained RecZilla models, but also all of our raw experimental results, so that practitioners can train a RecZilla model for their desired performance metric: https://github.com/naszilla/reczilla.
    Optimization paper production through digitalization by developing an assistance system for machine operators including quality forecast: a concept. (arXiv:2206.11581v1 [eess.SY])
    Nowadays cross-industry ranging challenges include the reduction of greenhouse gas emission and enabling a circular economy. However, the production of paper from waste paper is still a highly resource intensive task, especially in terms of energy consumption. While paper machines produce a lot of data, we have identified a lack of utilization of it and implement a concept using an operator assistance system and state-of-the-art machine learning techniques, e.g., classification, forecasting and alarm flood handling algorithms, to support daily operator tasks. Our main objective is to provide situation-specific knowledge to machine operators utilizing available data. We expect this will result in better adjusted parameters and therefore a lower footprint of the paper machines.
    Few-Shot Non-Parametric Learning with Deep Latent Variable Model. (arXiv:2206.11573v1 [cs.LG])
    Most real-world problems that machine learning algorithms are expected to solve face the situation with 1) unknown data distribution; 2) little domain-specific knowledge; and 3) datasets with limited annotation. We propose Non-Parametric learning by Compression with Latent Variables (NPC-LV), a learning framework for any dataset with abundant unlabeled data but very few labeled ones. By only training a generative model in an unsupervised way, the framework utilizes the data distribution to build a compressor. Using a compressor-based distance metric derived from Kolmogorov complexity, together with few labeled data, NPC-LV classifies without further training. We show that NPC-LV outperforms supervised methods on all three datasets on image classification in low data regime and even outperform semi-supervised learning methods on CIFAR-10. We demonstrate how and when negative evidence lowerbound (nELBO) can be used as an approximate compressed length for classification. By revealing the correlation between compression rate and classification accuracy, we illustrate that under NPC-LV, the improvement of generative models can enhance downstream classification accuracy.
    Functional Nonlinear Learning. (arXiv:2206.11424v1 [stat.ML])
    Using representations of functional data can be more convenient and beneficial in subsequent statistical models than direct observations. These representations, in a lower-dimensional space, extract and compress information from individual curves. The existing representation learning approaches in functional data analysis usually use linear mapping in parallel to those from multivariate analysis, e.g., functional principal component analysis (FPCA). However, functions, as infinite-dimensional objects, sometimes have nonlinear structures that cannot be uncovered by linear mapping. Linear methods will be more overwhelmed given multivariate functional data. For that matter, this paper proposes a functional nonlinear learning (FunNoL) method to sufficiently represent multivariate functional data in a lower-dimensional feature space. Furthermore, we merge a classification model for enriching the ability of representations in predicting curve labels. Hence, representations from FunNoL can be used for both curve reconstruction and classification. Additionally, we have endowed the proposed model with the ability to address the missing observation problem as well as to further denoise observations. The resulting representations are robust to observations that are locally disturbed by uncontrollable random noises. We apply the proposed FunNoL method to several real data sets and show that FunNoL can achieve better classifications than FPCA, especially in the multivariate functional data setting. Simulation studies have shown that FunNoL provides satisfactory curve classification and reconstruction regardless of data sparsity.
    EFFGAN: Ensembles of fine-tuned federated GANs. (arXiv:2206.11682v1 [cs.LG])
    Generative adversarial networks have proven to be a powerful tool for learning complex and high-dimensional data distributions, but issues such as mode collapse have been shown to make it difficult to train them. This is an even harder problem when the data is decentralized over several clients in a federated learning setup, as problems such as client drift and non-iid data make it hard for federated averaging to converge. In this work, we study the task of how to learn a data distribution when training data is heterogeneously decentralized over clients and cannot be shared. Our goal is to sample from this distribution centrally, while the data never leaves the clients. We show using standard benchmark image datasets that existing approaches fail in this setting, experiencing so-called client drift when the local number of epochs becomes to large. We thus propose a novel approach we call EFFGAN: Ensembles of fine-tuned federated GANs. Being an ensemble of local expert generators, EFFGAN is able to learn the data distribution over all clients and mitigate client drift. It is able to train with a large number of local epochs, making it more communication efficient than previous works.
    Utilizing Expert Features for Contrastive Learning of Time-Series Representations. (arXiv:2206.11517v1 [cs.LG])
    We present an approach that incorporates expert knowledge for time-series representation learning. Our method employs expert features to replace the commonly used data transformations in previous contrastive learning approaches. We do this since time-series data frequently stems from the industrial or medical field where expert features are often available from domain experts, while transformations are generally elusive for time-series data. We start by proposing two properties that useful time-series representations should fulfill and show that current representation learning approaches do not ensure these properties. We therefore devise ExpCLR, a novel contrastive learning approach built on an objective that utilizes expert features to encourage both properties for the learned representation. Finally, we demonstrate on three real-world time-series datasets that ExpCLR surpasses several state-of-the-art methods for both unsupervised and semi-supervised representation learning.
    Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation. (arXiv:2206.11489v1 [cs.LG])
    We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear with respect to a feature mapping $\boldsymbol{\phi}(s,a)$. Specifically, we consider the episodic inhomogeneous linear Markov Decision Process (MDP), and propose a novel computation-efficient algorithm, LSVI-UCB$^+$, which achieves an $\widetilde{O}(Hd\sqrt{T})$ regret bound where $H$ is the episode length, $d$ is the feature dimension, and $T$ is the number of steps. LSVI-UCB$^+$ builds on weighted ridge regression and upper confidence value iteration with a Bernstein-type exploration bonus. Our statistical results are obtained with novel analytical tools, including a new Bernstein self-normalized bound with conservatism on elliptical potentials, and refined analysis of the correction term. To the best of our knowledge, this is the first minimax optimal algorithm for linear MDPs up to logarithmic factors, which closes the $\sqrt{Hd}$ gap between the best known upper bound of $\widetilde{O}(\sqrt{H^3d^3T})$ in \cite{jin2020provably} and lower bound of $\Omega(Hd\sqrt{T})$ for linear MDPs.
    GACT: Activation Compressed Training for General Architectures. (arXiv:2206.11357v1 [cs.LG])
    Training large neural network (NN) models requires extensive memory resources, and Activation Compressed Training (ACT) is a promising approach to reduce training memory footprint. This paper presents GACT, an ACT framework to support a broad range of machine learning tasks for generic NN architectures with limited domain knowledge. By analyzing a linearized version of ACT's approximate gradient, we prove the convergence of GACT without prior knowledge on operator type or model architecture. To make training stable, we propose an algorithm that decides the compression ratio for each tensor by estimating its impact on the gradient at run time. We implement GACT as a PyTorch library that readily applies to any NN architecture. GACT reduces the activation memory for convolutional NNs, transformers, and graph NNs by up to 8.1x, enabling training with a 4.2x to 24.7x larger batch size, with negligible accuracy loss.
    A Framework for Understanding Model Extraction Attack and Defense. (arXiv:2206.11480v1 [cs.LG])
    The privacy of machine learning models has become a significant concern in many emerging Machine-Learning-as-a-Service applications, where prediction services based on well-trained models are offered to users via pay-per-query. The lack of a defense mechanism can impose a high risk on the privacy of the server's model since an adversary could efficiently steal the model by querying only a few `good' data points. The interplay between a server's defense and an adversary's attack inevitably leads to an arms race dilemma, as commonly seen in Adversarial Machine Learning. To study the fundamental tradeoffs between model utility from a benign user's view and privacy from an adversary's view, we develop new metrics to quantify such tradeoffs, analyze their theoretical properties, and develop an optimization problem to understand the optimal adversarial attack and defense strategies. The developed concepts and theory match the empirical findings on the `equilibrium' between privacy and utility. In terms of optimization, the key ingredient that enables our results is a unified representation of the attack-defense problem as a min-max bi-level problem. The developed results will be demonstrated by examples and experiments.
    Few-shot Long-Tailed Bird Audio Recognition. (arXiv:2206.11260v1 [cs.SD])
    It is easier to hear birds than see them. However, they still play an essential role in nature and are excellent indicators of deteriorating environmental quality and pollution. Recent advances in Machine Learning and Convolutional Neural Networks allow us to process continuous audio data to detect and classify bird sounds. This technology can assist researchers in monitoring bird populations' status and trends and ecosystems' biodiversity. We propose a sound detection and classification pipeline to analyze complex soundscape recordings and identify birdcalls in the background. Our method learns from weak labels and few data and acoustically recognizes the bird species. Our solution achieved 18th place of 807 teams at the BirdCLEF 2022 Challenge hosted on Kaggle.
    Synthetic Data-Based Simulators for Recommender Systems: A Survey. (arXiv:2206.11338v1 [cs.IR])
    This survey aims at providing a comprehensive overview of the recent trends in the field of modeling and simulation (M&S) of interactions between users and recommender systems and applications of the M&S to the performance improvement of industrial recommender engines. We start with the motivation behind the development of frameworks implementing the simulations -- simulators -- and the usage of them for training and testing recommender systems of different types (including Reinforcement Learning ones). Furthermore, we provide a new consistent classification of existing simulators based on their functionality, approbation, and industrial effectiveness and moreover make a summary of the simulators found in the research literature. Besides other things, we discuss the building blocks of simulators: methods for synthetic data (user, item, user-item responses) generation, methods for what-if experimental analysis, methods and datasets used for simulation quality evaluation (including the methods that monitor and/or close possible simulation-to-reality gaps), and methods for summarization of experimental simulation results. Finally, this survey considers emerging topics and open problems in the field.
    Measurement and applications of position bias in a marketplace search engine. (arXiv:2206.11720v1 [cs.IR])
    Search engines intentionally influence user behavior by picking and ranking the list of results. Users engage with the highest results both because of their prominent placement and because they are typically the most relevant documents. Search engine ranking algorithms need to identify relevance while incorporating the influence of the search engine itself. This paper describes our efforts at Thumbtack to understand the impact of ranking, including the empirical results of a randomization program. In the context of a consumer marketplace we discuss practical details of model choice, experiment design, bias calculation, and machine learning model adaptation. We include a novel discussion of how ranking bias may not only affect labels, but also model features. The randomization program led to improved models, motivated internal scenario analysis, and enabled user-facing scenario tooling.
    Context matters for fairness -- a case study on the effect of spatial distribution shifts. (arXiv:2206.11436v1 [cs.LG])
    With the ever growing involvement of data-driven AI-based decision making technologies in our daily social lives, the fairness of these systems is becoming a crucial phenomenon. However, an important and often challenging aspect in utilizing such systems is to distinguish validity for the range of their application especially under distribution shifts, i.e., when a model is deployed on data with different distribution than the training set. In this paper, we present a case study on the newly released American Census datasets, a reconstruction of the popular Adult dataset, to illustrate the importance of context for fairness and show how remarkably can spatial distribution shifts affect predictive- and fairness-related performance of a model. The problem persists for fairness-aware learning models with the effects of context-specific fairness interventions differing across the states and different population groups. Our study suggests that robustness to distribution shifts is necessary before deploying a model to another context.
    Learning Towards the Largest Margins. (arXiv:2206.11589v1 [cs.CV])
    One of the main challenges for feature representation in deep learning-based classification is the design of appropriate loss functions that exhibit strong discriminative power. The classical softmax loss does not explicitly encourage discriminative learning of features. A popular direction of research is to incorporate margins in well-established losses in order to enforce extra intra-class compactness and inter-class separability, which, however, were developed through heuristic means, as opposed to rigorous mathematical principles. In this work, we attempt to address this limitation by formulating the principled optimization objective as learning towards the largest margins. Specifically, we firstly define the class margin as the measure of inter-class separability, and the sample margin as the measure of intra-class compactness. Accordingly, to encourage discriminative representation of features, the loss function should promote the largest possible margins for both classes and samples. Furthermore, we derive a generalized margin softmax loss to draw general conclusions for the existing margin-based losses. Not only does this principled framework offer new perspectives to understand and interpret existing margin-based losses, but it also provides new insights that can guide the design of new tools, including sample margin regularization and largest margin softmax loss for the class-balanced case, and zero-centroid regularization for the class-imbalanced case. Experimental results demonstrate the effectiveness of our strategy on a variety of tasks, including visual classification, imbalanced classification, person re-identification, and face verification.
    Positive-Unlabeled Learning with Adversarial Data Augmentation for Knowledge Graph Completion. (arXiv:2205.00904v3 [cs.LG] UPDATED)
    Most real-world knowledge graphs (KG) are far from complete and comprehensive. This problem has motivated efforts in predicting the most plausible missing facts to complete a given KG, i.e., knowledge graph completion (KGC). However, existing KGC methods suffer from two main issues, 1) the false negative issue, i.e., the sampled negative training instances may include potential true facts; and 2) the data sparsity issue, i.e., true facts account for only a tiny part of all possible facts. To this end, we propose positive-unlabeled learning with adversarial data augmentation (PUDA) for KGC. In particular, PUDA tailors positive-unlabeled risk estimator for the KGC task to deal with the false negative issue. Furthermore, to address the data sparsity issue, PUDA achieves a data augmentation strategy by unifying adversarial training and positive-unlabeled learning under the positive-unlabeled minimax game. Extensive experimental results on real-world benchmark datasets demonstrate the effectiveness and compatibility of our proposed method.
    FINGER: Fast Inference for Graph-based Approximate Nearest Neighbor Search. (arXiv:2206.11408v1 [cs.LG])
    Approximate K-Nearest Neighbor Search (AKNNS) has now become ubiquitous in modern applications, for example, as a fast search procedure with two tower deep learning models. Graph-based methods for AKNNS in particular have received great attention due to their superior performance. These methods rely on greedy graph search to traverse the data points as embedding vectors in a database. Under this greedy search scheme, we make a key observation: many distance computations do not influence search updates so these computations can be approximated without hurting performance. As a result, we propose FINGER, a fast inference method to achieve efficient graph search. FINGER approximates the distance function by estimating angles between neighboring residual vectors with low-rank bases and distribution matching. The approximated distance can be used to bypass unnecessary computations, which leads to faster searches. Empirically, accelerating a popular graph-based method named HNSW by FINGER is shown to outperform existing graph-based methods by 20%-60% across different benchmark datasets.
    Neural Implicit Manifold Learning for Topology-Aware Generative Modelling. (arXiv:2206.11267v1 [stat.ML])
    Natural data observed in $\mathbb{R}^n$ is often constrained to an $m$-dimensional manifold $\mathcal{M}$, where $m < n$. Current generative models represent this manifold by mapping an $m$-dimensional latent variable through a neural network $f_\theta: \mathbb{R}^m \to \mathbb{R}^n$. Such procedures, which we call pushforward models, incur a straightforward limitation: manifolds cannot in general be represented with a single parameterization, meaning that attempts to do so will incur either computational instability or the inability to learn probability densities within the manifold. To remedy this problem, we propose to model $\mathcal{M}$ as a neural implicit manifold: the set of zeros of a neural network. To learn the data distribution within $\mathcal{M}$, we introduce constrained energy-based models, which use a constrained variant of Langevin dynamics to train and sample within the learned manifold. The resulting model can be manipulated with an arithmetic of manifolds which allows practitioners to take unions and intersections of model manifolds. In experiments on synthetic and natural data, we show that constrained EBMs can learn manifold-supported distributions with complex topologies more accurately than pushforward models.
    Disentangling representations in Restricted Boltzmann Machines without adversaries. (arXiv:2206.11600v1 [cs.LG])
    A goal of unsupervised machine learning is to disentangle representations of complex high-dimensional data, allowing for interpreting the significant latent factors of variation in the data as well as for manipulating them to generate new data with desirable features. These methods often rely on an adversarial scheme, in which representations are tuned to avoid discriminators from being able to reconstruct specific data information (labels). We propose a simple, effective way of disentangling representations without any need to train adversarial discriminators, and apply our approach to Restricted Boltzmann Machines (RBM), one of the simplest representation-based generative models. Our approach relies on the introduction of adequate constraints on the weights during training, which allows us to concentrate information about labels on a small subset of latent variables. The effectiveness of the approach is illustrated on the MNIST dataset, the two-dimensional Ising model, and taxonomy of protein families. In addition, we show how our framework allows for computing the cost, in terms of log-likelihood of the data, associated to the disentanglement of their representations.
    Offline RL for Natural Language Generation with Implicit Language Q Learning. (arXiv:2206.11871v1 [cs.CL])
    Large language models distill broad knowledge from text corpora. However, they can be inconsistent when it comes to completing user specified tasks. This issue can be addressed by finetuning such models via supervised learning on curated datasets, or via reinforcement learning. In this work, we propose a novel offline RL motivated method, implicit language Q-learning (ILQL), designed for use on language models, that combines both the flexible utility optimization framework of traditional RL algorithms with supervised learning's ability to leverage existing data and its simplicity and stability. Our method, based on dynamic programming, employs a blend of value conservatism alongside an implicit dataset support constraint in learning value functions, which are then used to guide language model generations towards maximizing utility. In addition to empirically validating ILQL, we present a detailed empirical analysis of situations where offline RL can be useful in natural language generation settings, demonstrating how it can be a more effective utility optimizer than prior approaches for end-to-end dialogue, and how it can effectively optimize high variance reward functions based on subjective judgement, such as whether to label a comment as an example of toxic speech or not.
    Prompt Injection: Parameterization of Fixed Inputs. (arXiv:2206.11349v1 [cs.LG])
    Recent works have shown that attaching prompts to the input is effective at conditioning Language Models (LM) to perform specific tasks. However, prompts are always included in the input text during inference, thus incurring substantial computational and memory overhead. Also, there is currently no straightforward method of utilizing prompts that are longer than the maximum input length of the LMs without incurring additional costs during inference. We propose Prompt Injection (PI), a novel formulation of injecting the prompt into the parameters of an LM to be an efficient alternative to attaching fixed prompts to the input. We show that in scenarios with long fixed prompts, PI can be up to 280 times more efficient in terms of total FLOPs than previous approaches. We further explore methodologies for PI and show promising results in persona-dependent conversation, semantic parsing, and zero-shot learning with task instructions. Through these explorations, we show that PI can be a promising direction for conditioning language models, especially in scenarios with long and fixed prompts.  ( 2 min )
    Latent Policies for Adversarial Imitation Learning. (arXiv:2206.11299v1 [cs.LG])
    This paper considers learning robot locomotion and manipulation tasks from expert demonstrations. Generative adversarial imitation learning (GAIL) trains a discriminator that distinguishes expert from agent transitions, and in turn use a reward defined by the discriminator output to optimize a policy generator for the agent. This generative adversarial training approach is very powerful but depends on a delicate balance between the discriminator and the generator training. In high-dimensional problems, the discriminator training may easily overfit or exploit associations with task-irrelevant features for transition classification. A key insight of this work is that performing imitation learning in a suitable latent task space makes the training process stable, even in challenging high-dimensional problems. We use an action encoder-decoder model to obtain a low-dimensional latent action space and train a LAtent Policy using Adversarial imitation Learning (LAPAL). The encoder-decoder model can be trained offline from state-action pairs to obtain a task-agnostic latent action representation or online, simultaneously with the discriminator and generator training, to obtain a task-aware latent action representation. We demonstrate that LAPAL training is stable, with near-monotonic performance improvement, and achieves expert performance in most locomotion and manipulation tasks, while a GAIL baseline converges slower and does not achieve expert performance in high-dimensional environments.  ( 2 min )
    The ArtBench Dataset: Benchmarking Generative Models with Artworks. (arXiv:2206.11404v1 [cs.CV])
    We introduce ArtBench-10, the first class-balanced, high-quality, cleanly annotated, and standardized dataset for benchmarking artwork generation. It comprises 60,000 images of artwork from 10 distinctive artistic styles, with 5,000 training images and 1,000 testing images per style. ArtBench-10 has several advantages over previous artwork datasets. Firstly, it is class-balanced while most previous artwork datasets suffer from the long tail class distributions. Secondly, the images are of high quality with clean annotations. Thirdly, ArtBench-10 is created with standardized data collection, annotation, filtering, and preprocessing procedures. We provide three versions of the dataset with different resolutions ($32\times32$, $256\times256$, and original image size), formatted in a way that is easy to be incorporated by popular machine learning frameworks. We also conduct extensive benchmarking experiments using representative image synthesis models with ArtBench-10 and present in-depth analysis. The dataset is available at https://github.com/liaopeiyuan/artbench under a Fair Use license.  ( 2 min )
    Attention-aware contrastive learning for predicting T cell receptor-antigen binding specificity. (arXiv:2206.11255v1 [q-bio.QM])
    It has been verified that only a small fraction of the neoantigens presented by MHC class I molecules on the cell surface can elicit T cells. The limitation can be attributed to the binding specificity of T cell receptor (TCR) to peptide-MHC complex (pMHC). Computational prediction of T cell binding to neoantigens is an challenging and unresolved task. In this paper, we propose an attentive-mask contrastive learning model, ATMTCR, for inferring TCR-antigen binding specificity. For each input TCR sequence, we used Transformer encoder to transform it to latent representation, and then masked a proportion of residues guided by attention weights to generate its contrastive view. Pretraining on large-scale TCR CDR3 sequences, we verified that contrastive learning significantly improved the prediction performance of TCR binding to peptide-MHC complex (pMHC). Beyond the detection of important amino acids and their locations in the TCR sequence, our model can also extracted high-order semantic information underlying the TCR-antigen binding specificity. Comparison experiments were conducted on two independent datasets, our method achieved better performance than other existing algorithms. Moreover, we effectively identified important amino acids and their positional preferences through attention weights, which indicated the interpretability of our proposed model.  ( 2 min )
    Optimistic Linear Support and Successor Features as a Basis for Optimal Policy Transfer. (arXiv:2206.11326v1 [cs.LG])
    In many real-world applications, reinforcement learning (RL) agents might have to solve multiple tasks, each one typically modeled via a reward function. If reward functions are expressed linearly, and the agent has previously learned a set of policies for different tasks, successor features (SFs) can be exploited to combine such policies and identify reasonable solutions for new problems. However, the identified solutions are not guaranteed to be optimal. We introduce a novel algorithm that addresses this limitation. It allows RL agents to combine existing policies and directly identify optimal policies for arbitrary new problems, without requiring any further interactions with the environment. We first show (under mild assumptions) that the transfer learning problem tackled by SFs is equivalent to the problem of learning to optimize multiple objectives in RL. We then introduce an SF-based extension of the Optimistic Linear Support algorithm to learn a set of policies whose SFs form a convex coverage set. We prove that policies in this set can be combined via generalized policy improvement to construct optimal behaviors for any new linearly-expressible tasks, without requiring any additional training samples. We empirically show that our method outperforms state-of-the-art competing algorithms both in discrete and continuous domains under value function approximation.  ( 2 min )
    Optimally Weighted Ensembles of Regression Models: Exact Weight Optimization and Applications. (arXiv:2206.11263v1 [cs.LG])
    Automated model selection is often proposed to users to choose which machine learning model (or method) to apply to a given regression task. In this paper, we show that combining different regression models can yield better results than selecting a single ('best') regression model, and outline an efficient method that obtains optimally weighted convex linear combination from a heterogeneous set of regression models. More specifically, in this paper, a heuristic weight optimization, used in a preceding conference paper, is replaced by an exact optimization algorithm using convex quadratic programming. We prove convexity of the quadratic programming formulation for the straightforward formulation and for a formulation with weighted data points. The novel weight optimization is not only (more) exact but also more efficient. The methods we develop in this paper are implemented and made available via github-open source. They can be executed on commonly available hardware and offer a transparent and easy to interpret interface. The results indicate that the approach outperforms model selection methods on a range of data sets, including data sets with mixed variable type from drug discovery applications.  ( 2 min )
    Efficient Adaptive Federated Optimization of Federated Learning for IoT. (arXiv:2206.11448v1 [cs.LG])
    The proliferation of the Internet of Things (IoT) and widespread use of devices with sensing, computing, and communication capabilities have motivated intelligent applications empowered by artificial intelligence. The classical artificial intelligence algorithms require centralized data collection and processing which are challenging in realistic intelligent IoT applications due to growing data privacy concerns and distributed datasets. Federated Learning (FL) has emerged as a distributed privacy-preserving learning framework that enables IoT devices to train global model through sharing model parameters. However, inefficiency due to frequent parameters transmissions significantly reduce FL performance. Existing acceleration algorithms consist of two main type including local update considering trade-offs between communication and computation and parameter compression considering trade-offs between communication and precision. Jointly considering these two trade-offs and adaptively balancing their impacts on convergence have remained unresolved. To solve the problem, this paper proposes a novel efficient adaptive federated optimization (EAFO) algorithm to improve efficiency of FL, which minimizes the learning error via jointly considering two variables including local update and parameter compression and enables FL to adaptively adjust the two variables and balance trade-offs among computation, communication and precision. The experiment results illustrate that comparing with state-of-the-art algorithms, the proposed EAFO can achieve higher accuracies faster.  ( 2 min )
    Curious Exploration via Structured World Models Yields Zero-Shot Object Manipulation. (arXiv:2206.11403v1 [cs.LG])
    It has been a long-standing dream to design artificial agents that explore their environment efficiently via intrinsic motivation, similar to how children perform curious free play. Despite recent advances in intrinsically motivated reinforcement learning (RL), sample-efficient exploration in object manipulation scenarios remains a significant challenge as most of the relevant information lies in the sparse agent-object and object-object interactions. In this paper, we propose to use structured world models to incorporate relational inductive biases in the control loop to achieve sample-efficient and interaction-rich exploration in compositional multi-object environments. By planning for future novelty inside structured world models, our method generates free-play behavior that starts to interact with objects early on and develops more complex behavior over time. Instead of using models only to compute intrinsic rewards, as commonly done, our method showcases that the self-reinforcing cycle between good models and good exploration also opens up another avenue: zero-shot generalization to downstream tasks via model-based planning. After the entirely intrinsic task-agnostic exploration phase, our method solves challenging downstream tasks such as stacking, flipping, pick & place, and throwing that generalizes to unseen numbers and arrangements of objects without any additional training.
    Stochastic Langevin Differential Inclusions with Applications to Machine Learning. (arXiv:2206.11533v1 [math.OC])
    Stochastic differential equations of Langevin-diffusion form have received significant recent, thanks to their foundational role in both Bayesian sampling algorithms and optimization in machine learning. In the latter, they serve as a conceptual model of the stochastic gradient flow in training over-parametrized models. However, the literature typically assumes smoothness of the potential, whose gradient is the drift term. Nevertheless, there are many problems, for which the potential function is not continuously differentiable, and hence the drift is not Lipschitz-continuous everywhere. This is exemplified by robust losses and Rectified Linear Units in regression problems. In this paper, we show some foundational results regarding the flow and asymptotic properties of Langevin-type Stochastic Differential Inclusions under assumptions appropriate to the machine-learning settings. In particular, we show strong existence of the solution, as well as asymptotic minimization of the canonical Free Energy Functional.
  • Open

    $\ell_{\infty}$-Bounds of the MLE in the BTL Model under General Comparison Graphs. (arXiv:2110.10825v2 [math.ST] UPDATED)
    The Bradley-Terry-Luce (BTL) model is a popular statistical approach for estimating the global ranking of a collection of items using pairwise comparisons. To ensure accurate ranking, it is essential to obtain precise estimates of the model parameters in the $\ell_{\infty}$-loss. The difficulty of this task depends crucially on the topology of the pairwise comparison graph over the given items. However, beyond very few well-studied cases, such as the complete and Erd\"os-R\'enyi comparison graphs, little is known about the performance of the maximum likelihood estimator MLE) of the BTL model parameters in the $\ell_{\infty}$-loss under more general graph topologies. In this paper, we derive novel, general upper bounds on the $\ell_{\infty}$ estimation error of the BTL MLE that depend explicitly on the algebraic connectivity of the comparison graph, the maximal performance gap across items and the sample complexity. We demonstrate that the derived bounds perform well and in some cases are sharper compared to known results obtained using different loss functions and more restricted assumptions and graph topologies. We carefully compare our results to Yan et al. (2012), which is closest in spirit to our work. We further provide minimax lower bounds under $\ell_{\infty}$-error that nearly match the upper bounds over a class of sufficiently regular graph topologies. Finally, we study the implications of our $\ell_{\infty}$-bounds for efficient (offline) tournament design. We illustrate and discuss our findings through various examples and simulations.
    How causal machine learning can leverage marketing strategies: Assessing and improving the performance of a coupon campaign. (arXiv:2204.10820v2 [econ.GN] UPDATED)
    We apply causal machine learning algorithms to assess the causal effect of a marketing intervention, namely a coupon campaign, on the sales of a retailer. Besides assessing the average impacts of different types of coupons, we also investigate the heterogeneity of causal effects across different subgroups of customers, e.g., between clients with relatively high vs. low prior purchases. Finally, we use optimal policy learning to determine (in a data-driven way) which customer groups should be targeted by the coupon campaign in order to maximize the marketing intervention's effectiveness in terms of sales. We find that only two out of the five coupon categories examined, namely coupons applicable to the product categories of drugstore items and other food, have a statistically significant positive effect on retailer sales. The assessment of group average treatment effects reveals substantial differences in the impact of coupon provision across customer groups, particularly across customer groups as defined by prior purchases at the store, with drugstore coupons being particularly effective among customers with high prior purchases and other food coupons among customers with low prior purchases. Our study provides a use case for the application of causal machine learning in business analytics to evaluate the causal impact of specific firm policies (like marketing campaigns) for decision support.
    Bayesian Nonparametrics for Offline Skill Discovery. (arXiv:2202.04675v3 [cs.LG] UPDATED)
    Skills or low-level policies in reinforcement learning are temporally extended actions that can speed up learning and enable complex behaviours. Recent work in offline reinforcement learning and imitation learning has proposed several techniques for skill discovery from a set of expert trajectories. While these methods are promising, the number K of skills to discover is always a fixed hyperparameter, which requires either prior knowledge about the environment or an additional parameter search to tune it. We first propose a method for offline learning of options (a particular skill framework) exploiting advances in variational inference and continuous relaxations. We then highlight an unexplored connection between Bayesian nonparametrics and offline skill discovery, and show how to obtain a nonparametric version of our model. This version is tractable thanks to a carefully structured approximate posterior with a dynamically-changing number of options, removing the need to specify K. We also show how our nonparametric extension can be applied in other skill frameworks, and empirically demonstrate that our method can outperform state-of-the-art offline skill learning algorithms across a variety of environments. Our code is available at https://github.com/layer6ai-labs/BNPO .
    Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process. (arXiv:2202.10589v3 [stat.ML] UPDATED)
    This paper is concerned with constructing a confidence interval for a target policy's value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. In this paper, we show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy's value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient off-policy value estimator that is robust to potential model misspecification and provide rigorous uncertainty quantification. Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies. A Python implementation of the proposed procedure is available at https://github.com/Mamba413/cope.
    Sequential Importance Sampling for Hybrid Model Bayesian Inference to Support Bioprocess Mechanism Learning and Robust Control. (arXiv:2205.02410v3 [stat.ML] UPDATED)
    Driven by the critical needs of biomanufacturing 4.0, we introduce a probabilistic knowledge graph hybrid model characterizing the risk- and science-based understanding of bioprocess mechanisms. It can faithfully capture the important properties, including nonlinear reactions, partially observed state, and nonstationary dynamics. Given very limited real process observations, we derive a posterior distribution quantifying model estimation uncertainty. To avoid the evaluation of intractable likelihoods, Approximate Bayesian Computation sampling with Sequential Monte Carlo (ABC-SMC) is utilized to approximate the posterior distribution. Under high stochastic and model uncertainties, it is computationally expensive to match output trajectories. Therefore, we create a linear Gaussian dynamic Bayesian network (LG-DBN) auxiliary likelihood-based ABC-SMC approach. Through matching the summary statistics driven through LG-DBN likelihood that can capture critical interactions and variations, the proposed algorithm can accelerate hybrid model inference, support process monitoring, and facilitate mechanism learning and robust control.
    Do More Negative Samples Necessarily Hurt in Contrastive Learning?. (arXiv:2205.01789v2 [cs.LG] UPDATED)
    Recent investigations in noise contrastive estimation suggest, both empirically as well as theoretically, that while having more "negative samples" in the contrastive loss improves downstream classification performance initially, beyond a threshold, it hurts downstream performance due to a "collision-coverage" trade-off. But is such a phenomenon inherent in contrastive learning? We show in a simple theoretical setting, where positive pairs are generated by sampling from the underlying latent class (introduced by Saunshi et al. (ICML 2019)), that the downstream performance of the representation optimizing the (population) contrastive loss in fact does not degrade with the number of negative samples. Along the way, we give a structural characterization of the optimal representation in our framework, for noise contrastive estimation. We also provide empirical support for our theoretical results on CIFAR-10 and CIFAR-100 datasets.
    Subexponential-Time Algorithms for Sparse PCA. (arXiv:1907.11635v3 [math.ST] UPDATED)
    We study the computational cost of recovering a unit-norm sparse principal component $x \in \mathbb{R}^n$ planted in a random matrix, in either the Wigner or Wishart spiked model (observing either $W + \lambda xx^\top$ with $W$ drawn from the Gaussian orthogonal ensemble, or $N$ independent samples from $\mathcal{N}(0, I_n + \beta xx^\top)$, respectively). Prior work has shown that when the signal-to-noise ratio ($\lambda$ or $\beta\sqrt{N/n}$, respectively) is a small constant and the fraction of nonzero entries in the planted vector is $\|x\|_0 / n = \rho$, it is possible to recover $x$ in polynomial time if $\rho \lesssim 1/\sqrt{n}$. While it is possible to recover $x$ in exponential time under the weaker condition $\rho \ll 1$, it is believed that polynomial-time recovery is impossible unless $\rho \lesssim 1/\sqrt{n}$. We investigate the precise amount of time required for recovery in the "possible but hard" regime $1/\sqrt{n} \ll \rho \ll 1$ by exploring the power of subexponential-time algorithms, i.e., algorithms running in time $\exp(n^\delta)$ for some constant $\delta \in (0,1)$. For any $1/\sqrt{n} \ll \rho \ll 1$, we give a recovery algorithm with runtime roughly $\exp(\rho^2 n)$, demonstrating a smooth tradeoff between sparsity and runtime. Our family of algorithms interpolates smoothly between two existing algorithms: the polynomial-time diagonal thresholding algorithm and the $\exp(\rho n)$-time exhaustive search algorithm. Furthermore, by analyzing the low-degree likelihood ratio, we give rigorous evidence suggesting that the tradeoff achieved by our algorithms is optimal.
    Identify treatment effect patterns for personalised decisions. (arXiv:1906.06080v2 [stat.ME] UPDATED)
    In personalised decision making, evidence is required to determine whether an action (treatment) is suitable for an individual. Such evidence can be obtained by modelling treatment effect heterogeneity in subgroups. The existing interpretable modelling methods take a top-down approach to search for subgroups with heterogeneous treatment effects and they may miss the most specific and relevant context for an individual. In this paper, we design a \emph{Treatment effect pattern (TEP)} to represent treatment effect heterogeneity in data. To achieve an interpretable presentation of TEPs, we use a local causal structure around the outcome to explicitly show how those important variables are used in modelling. We also derive a formula for unbiasedly estimating the \emph{Conditional Average Causal Effect (CATE)} using the local structure in our problem setting. In the discovery process, we aim at minimising heterogeneity within each subgroup represented by a pattern. We propose a bottom-up search algorithm to discover the most specific patterns fitting individual circumstances the best for personalised decision making. Experiments show that the proposed method models treatment effect heterogeneity better than three other existing tree based methods in synthetic and real world data sets.
    Approximation Benefits of Policy Gradient Methods with Aggregated States. (arXiv:2007.11684v3 [cs.LG] UPDATED)
    Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper studies the case of state-aggregated representations, where the state space is partitioned and either the policy or value function approximation is held constant over partitions. This paper shows a policy gradient method converges to a policy whose regret per-period is bounded by $\epsilon$, the largest difference between two elements of the state-action value function belonging to a common partition. With the same representation, both approximate policy iteration and approximate value iteration can produce policies whose per-period regret scales as $\epsilon/(1-\gamma)$, where $\gamma$ is a discount factor. Faced with inherent approximation error, methods that locally optimize the true decision-objective can be far more robust.
    Matrix-wise $\ell_0$-constrained Sparse Nonnegative Least Squares. (arXiv:2011.11066v4 [cs.LG] UPDATED)
    Nonnegative least squares problems with multiple right-hand sides (MNNLS) arise in models that rely on additive linear combinations. In particular, they are at the core of most nonnegative matrix factorization algorithms and have many applications. The nonnegativity constraint is known to naturally favor sparsity, that is, solutions with few non-zero entries. However, it is often useful to further enhance this sparsity, as it improves the interpretability of the results and helps reducing noise, which leads to the sparse MNNLS problem. In this paper, as opposed to most previous works that enforce sparsity column- or row-wise, we first introduce a novel formulation for sparse MNNLS, with a matrix-wise sparsity constraint. Then, we present a two-step algorithm to tackle this problem. The first step divides sparse MNNLS in subproblems, one per column of the original problem. It then uses different algorithms to produce, either exactly or approximately, a Pareto front for each subproblem, that is, to produce a set of solutions representing different tradeoffs between reconstruction error and sparsity. The second step selects solutions among these Pareto fronts in order to build a sparsity-constrained matrix that minimizes the reconstruction error. We perform experiments on facial and hyperspectral images, and we show that our proposed two-step approach provides more accurate results than state-of-the-art sparse coding heuristics applied both column-wise and globally.
    Chasing Convex Bodies and Functions with Black-Box Advice. (arXiv:2206.11780v1 [cs.LG])
    We consider the problem of convex function chasing with black-box advice, where an online decision-maker aims to minimize the total cost of making and switching between decisions in a normed vector space, aided by black-box advice such as the decisions of a machine-learned algorithm. The decision-maker seeks cost comparable to the advice when it performs well, known as $\textit{consistency}$, while also ensuring worst-case $\textit{robustness}$ even when the advice is adversarial. We first consider the common paradigm of algorithms that switch between the decisions of the advice and a competitive algorithm, showing that no algorithm in this class can improve upon 3-consistency while staying robust. We then propose two novel algorithms that bypass this limitation by exploiting the problem's convexity. The first, INTERP, achieves $(\sqrt{2}+\epsilon)$-consistency and $\mathcal{O}(\frac{C}{\epsilon^2})$-robustness for any $\epsilon > 0$, where $C$ is the competitive ratio of an algorithm for convex function chasing or a subclass thereof. The second, BDINTERP, achieves $(1+\epsilon)$-consistency and $\mathcal{O}(\frac{CD}{\epsilon})$-robustness when the problem has bounded diameter $D$. Further, we show that BDINTERP achieves near-optimal consistency-robustness trade-off for the special case where cost functions are $\alpha$-polyhedral.
    Hermite Polynomial Features for Private Data Generation. (arXiv:2106.05042v4 [cs.LG] UPDATED)
    Kernel mean embedding is a useful tool to represent and compare probability measures. Despite its usefulness, kernel mean embedding considers infinite-dimensional features, which are challenging to handle in the context of differentially private data generation. A recent work proposes to approximate the kernel mean embedding of data distribution using finite-dimensional random features, which yields analytically tractable sensitivity. However, the number of required random features is excessively high, often ten thousand to a hundred thousand, which worsens the privacy-accuracy trade-off. To improve the trade-off, we propose to replace random features with Hermite polynomial features. Unlike the random features, the Hermite polynomial features are ordered, where the features at the low orders contain more information on the distribution than those at the high orders. Hence, a relatively low order of Hermite polynomial features can more accurately approximate the mean embedding of the data distribution compared to a significantly higher number of random features. As demonstrated on several tabular and image datasets, Hermite polynomial features seem better suited for private data generation than random Fourier features.
    Factorization of the Partial Covariance in Singly-Connected Path Diagrams. (arXiv:2002.05226v6 [stat.ME] UPDATED)
    We extend path analysis by showing that, for a singly-connected path diagram, the partial covariance of two random variables factorizes over the nodes and edges in the path between the variables. This result allows us to determine the contribution of each node and edge to the partial covariance. It also allows us to show that Simpson's paradox cannot occur in singly-connected path diagrams.
    Diagnosing and Fixing Manifold Overfitting in Deep Generative Models. (arXiv:2204.07172v2 [stat.ML] UPDATED)
    Likelihood-based, or explicit, deep generative models use neural networks to construct flexible high-dimensional densities. This formulation directly contradicts the manifold hypothesis, which states that observed data lies on a low-dimensional manifold embedded in high-dimensional ambient space. In this paper we investigate the pathologies of maximum-likelihood training in the presence of this dimensionality mismatch. We formally prove that degenerate optima are achieved wherein the manifold itself is learned but not the distribution on it, a phenomenon we call manifold overfitting. We propose a class of two-step procedures consisting of a dimensionality reduction step followed by maximum-likelihood density estimation, and prove that they recover the data-generating distribution in the nonparametric regime, thus avoiding manifold overfitting. We also show that these procedures enable density estimation on the manifolds learned by implicit models, such as generative adversarial networks, hence addressing a major shortcoming of these models. Several recently proposed methods are instances of our two-step procedures; we thus unify, extend, and theoretically justify a large class of models.
    $p$-Laplacian Based Graph Neural Networks. (arXiv:2111.07337v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have demonstrated superior performance for semi-supervised node classification on graphs, as a result of their ability to exploit node features and topological information simultaneously. However, most GNNs implicitly assume that the labels of nodes and their neighbors in a graph are the same or consistent, which does not hold in heterophilic graphs, where the labels of linked nodes are likely to differ. Hence, when the topology is non-informative for label prediction, ordinary GNNs may work significantly worse than simply applying multi-layer perceptrons (MLPs) on each node. To tackle the above problem, we propose a new $p$-Laplacian based GNN model, termed as $^p$GNN, whose message passing mechanism is derived from a discrete regularization framework and could be theoretically explained as an approximation of a polynomial graph filter defined on the spectral domain of $p$-Laplacians. The spectral analysis shows that the new message passing mechanism works simultaneously as low-pass and high-pass filters, thus making $^p$GNNs are effective on both homophilic and heterophilic graphs. Empirical studies on real-world and synthetic datasets validate our findings and demonstrate that $^p$GNNs significantly outperform several state-of-the-art GNN architectures on heterophilic benchmarks while achieving competitive performance on homophilic benchmarks. Moreover, $^p$GNNs can adaptively learn aggregation weights and are robust to noisy edges.
    Fock State-enhanced Expressivity of Quantum Machine Learning Models. (arXiv:2107.05224v2 [quant-ph] UPDATED)
    The data-embedding process is one of the bottlenecks of quantum machine learning, potentially negating any quantum speedups. In light of this, more effective data-encoding strategies are necessary. We propose a photonic-based bosonic data-encoding scheme that embeds classical data points using fewer encoding layers and circumventing the need for nonlinear optical components by mapping the data points into the high-dimensional Fock space. The expressive power of the circuit can be controlled via the number of input photons. Our work shed some light on the unique advantages offers by quantum photonics on the expressive power of quantum machine learning models. By leveraging the photon-number dependent expressive power, we propose three different noisy intermediate-scale quantum-compatible binary classification methods with different scaling of required resources suitable for different supervised classification tasks.
    Wasserstein t-SNE. (arXiv:2205.07531v2 [cs.LG] UPDATED)
    Scientific datasets often have hierarchical structure: for example, in surveys, individual participants (samples) might be grouped at a higher level (units) such as their geographical region. In these settings, the interest is often in exploring the structure on the unit level rather than on the sample level. Units can be compared based on the distance between their means, however this ignores the within-unit distribution of samples. Here we develop an approach for exploratory analysis of hierarchical datasets using the Wasserstein distance metric that takes into account the shapes of within-unit distributions. We use t-SNE to construct 2D embeddings of the units, based on the matrix of pairwise Wasserstein distances between them. The distance matrix can be efficiently computed by approximating each unit with a Gaussian distribution, but we also provide a scalable method to compute exact Wasserstein distances. We use synthetic data to demonstrate the effectiveness of our Wasserstein t-SNE, and apply it to data from the 2017 German parliamentary election, considering polling stations as samples and voting districts as units. The resulting embedding uncovers meaningful structure in the data.
    Gradual Domain Adaptation via Normalizing Flows. (arXiv:2206.11492v1 [stat.ML])
    Conventional domain adaptation methods do not work well when a large gap exists between the source and the target domain. Gradual domain adaptation is one of the approaches to address the problem by leveraging the intermediate domain, which gradually shifts from the source to the target domain. The previous work assumed that the number of the intermediate domains is large and the distance of the adjacent domains is small; hence, the gradual domain adaptation algorithm by self-training with unlabeled datasets was applicable. In practice, however, gradual self-training will fail because the number of the intermediate domains is limited, and the distance of the adjacent domains is large. We propose using normalizing flows to mitigate this problem while maintaining the framework of unsupervised domain adaptation. We generate pseudo intermediate domains from normalizing flows and then use them for gradual domain adaptation. We evaluate our method by experiments with real-world datasets and confirm that our proposed method mitigates the above explained problem and improves the classification performance.  ( 2 min )
    Modular Conformal Calibration. (arXiv:2206.11468v1 [cs.LG])
    Uncertainty estimates must be calibrated (i.e., accurate) and sharp (i.e., informative) in order to be useful. This has motivated a variety of methods for recalibration, which use held-out data to turn an uncalibrated model into a calibrated model. However, the applicability of existing methods is limited due to their assumption that the original model is also a probabilistic model. We introduce a versatile class of algorithms for recalibration in regression that we call Modular Conformal Calibration (MCC). This framework allows one to transform any regression model into a calibrated probabilistic model. The modular design of MCC allows us to make simple adjustments to existing algorithms that enable well-behaved distribution predictions. We also provide finite-sample calibration guarantees for MCC algorithms. Our framework recovers isotonic recalibration, conformal calibration, and conformal interval prediction, implying that our theoretical results apply to those methods as well. Finally, we conduct an empirical study of MCC on 17 regression datasets. Our results show that new algorithms designed in our framework achieve near-perfect calibration and improve sharpness relative to existing methods.  ( 2 min )
    Bayesian model calibration for block copolymer self-assembly: Likelihood-free inference and expected information gain computation via measure transport. (arXiv:2206.11343v1 [physics.comp-ph])
    We consider the Bayesian calibration of models describing the phenomenon of block copolymer (BCP) self-assembly using image data produced by microscopy or X-ray scattering techniques. To account for the random long-range disorder in BCP equilibrium structures, we introduce auxiliary variables to represent this aleatory uncertainty. These variables, however, result in an integrated likelihood for high-dimensional image data that is generally intractable to evaluate. We tackle this challenging Bayesian inference problem using a likelihood-free approach based on measure transport together with the construction of summary statistics for the image data. We also show that expected information gains (EIGs) from the observed data about the model parameters can be computed with no significant additional cost. Lastly, we present a numerical case study based on the Ohta--Kawasaki model for diblock copolymer thin film self-assembly and top-down microscopy characterization. For calibration, we introduce several domain-specific energy- and Fourier-based summary statistics, and quantify their informativeness using EIG. We demonstrate the power of the proposed approach to study the effect of data corruptions and experimental designs on the calibration results.  ( 2 min )
    Physics-Informed Statistical Modeling for Wildfire Aerosols Process Using Multi-Source Geostationary Satellite Remote-Sensing Data Streams. (arXiv:2206.11766v1 [stat.AP])
    Increasingly frequent wildfires significantly affect solar energy production as the atmospheric aerosols generated by wildfires diminish the incoming solar radiation to the earth. Atmospheric aerosols are measured by Aerosol Optical Depth (AOD), and AOD data streams can be retrieved and monitored by geostationary satellites. However, multi-source remote-sensing data streams often present heterogeneous characteristics, including different data missing rates, measurement errors, systematic biases, and so on. To accurately estimate and predict the underlying AOD propagation process, there exist practical needs and theoretical interests to propose a physics-informed statistical approach for modeling wildfire AOD propagation by simultaneously utilizing, or fusing, multi-source heterogeneous satellite remote-sensing data streams. Leveraging a spectral approach, the proposed approach integrates multi-source satellite data streams with a fundamental advection-diffusion equation that governs the AOD propagation process. A bias correction process is included in the statistical model to account for the bias of the physics model and the truncation error of the Fourier series. The proposed approach is applied to California wildfires AOD data streams obtained from the National Oceanic and Atmospheric Administration. Comprehensive numerical examples are provided to demonstrate the predictive capabilities and model interpretability of the proposed approach. Computer code has been made available on GitHub.  ( 2 min )
    Regression Trees on Grassmann Manifold for Adapting Reduced-Order Models. (arXiv:2206.11324v1 [stat.AP])
    Low dimensional and computationally less expensive Reduced-Order Models (ROMs) have been widely used to capture the dominant behaviors of high-dimensional systems. A ROM can be obtained, using the well-known Proper Orthogonal Decomposition (POD), by projecting the full-order model to a subspace spanned by modal basis modes which are learned from experimental, simulated or observational data, i.e., training data. However, the optimal basis can change with the parameter settings. When a ROM, constructed using the POD basis obtained from training data, is applied to new parameter settings, the model often lacks robustness against the change of parameters in design, control, and other real-time operation problems. This paper proposes to use regression trees on Grassmann Manifold to learn the mapping between parameters and POD bases that span the low-dimensional subspaces onto which full-order models are projected. Motivated by the fact that a subspace spanned by a POD basis can be viewed as a point in the Grassmann manifold, we propose to grow a tree by repeatedly splitting the tree node to maximize the Riemannian distance between the two subspaces spanned by the predicted POD bases on the left and right daughter nodes. Five numerical examples are presented to comprehensively demonstrate the performance of the proposed method, and compare the proposed tree-based method to the existing interpolation method for POD basis and the use of global POD basis. The results show that the proposed tree-based method is capable of establishing the mapping between parameters and POD bases, and thus adapt ROMs for new parameters.  ( 3 min )
    Bi-stochastically normalized graph Laplacian: convergence to manifold Laplacian and robustness to outlier noise. (arXiv:2206.11386v1 [math.ST])
    Bi-stochastic normalization of kernelized graph affinity matrix provides an alternative normalization scheme for graph Laplacian methods in graph-based data analysis and can be computed efficiently by Sinkhorn-Knopp (SK) iterations in practice. This paper proves the convergence of the bi-stochastically normalized graph Laplacian to manifold (weighted-)Laplacian with rates when $n$ data points are i.i.d. sampled from a general $d$-dimensional manifold embedded in a possibly high-dimensional space. Under certain joint limit of $n \to \infty$ and kernel bandwidth $\epsilon \to 0$, the point-wise convergence rate of the graph Laplacian operator (under 2-norm) is proved to be $ O( n^{-1/(d/2+3)})$ at finite large $n$ up to log factors, achieved at the scaling of $\epsilon \sim n^{-1/(d/2+3)} $. When the manifold data are corrupted by outlier noise, we theoretically prove the graph Laplacian point-wise consistency which matches the rate for clean manifold data up to an additional error term proportional to the boundedness of mutual inner-products of the noise vectors. Our analysis suggests that, under the setting being considered in this paper, not exact bi-stochastic normalization but an approximate one will achieve the same consistency rate. Motivated by the analysis, we propose an approximate and constrained matrix scaling problem that can be solved by SK iterations with early termination, and apply to simulated manifold data both clean and with outlier noise. Numerical experiments support our theoretical results and show the robustness of bi-stochastically normalized graph Laplacian to outlier noise.  ( 3 min )
    Utilizing Expert Features for Contrastive Learning of Time-Series Representations. (arXiv:2206.11517v1 [cs.LG])
    We present an approach that incorporates expert knowledge for time-series representation learning. Our method employs expert features to replace the commonly used data transformations in previous contrastive learning approaches. We do this since time-series data frequently stems from the industrial or medical field where expert features are often available from domain experts, while transformations are generally elusive for time-series data. We start by proposing two properties that useful time-series representations should fulfill and show that current representation learning approaches do not ensure these properties. We therefore devise ExpCLR, a novel contrastive learning approach built on an objective that utilizes expert features to encourage both properties for the learned representation. Finally, we demonstrate on three real-world time-series datasets that ExpCLR surpasses several state-of-the-art methods for both unsupervised and semi-supervised representation learning.  ( 2 min )
    Minimax Optimal Fair Regression under Linear Model. (arXiv:2206.11546v1 [math.ST])
    We investigate the minimax optimal error of a fair regression problem under a linear model employing the demographic parity as a fairness constraint. As a tractable demographic parity constraint, we introduce $(\alpha,\delta)$-fairness consistency, meaning that the quantified unfairness is decreased at most $n^{-\alpha}$ rate with at least probability $1-\delta$, where $n$ is the sample size. In other words, the consistently fair algorithm eventually outputs a regressor satisfying the demographic parity constraint with high probability as $n$ tends to infinity. As a result of our analyses, we found that the minimax optimal error under the $(\alpha,\delta)$-fairness consistency constraint is $\Theta(\frac{dM}{n})$ provided that $\alpha \le \frac{1}{2}$, where $d$ is the dimensionality, and $M$ is the number of groups induced from the sensitive attributes. This is the first study revealing minimax optimality for the fair regression problem under a linear model.  ( 2 min )
    Neural Implicit Manifold Learning for Topology-Aware Generative Modelling. (arXiv:2206.11267v1 [stat.ML])
    Natural data observed in $\mathbb{R}^n$ is often constrained to an $m$-dimensional manifold $\mathcal{M}$, where $m < n$. Current generative models represent this manifold by mapping an $m$-dimensional latent variable through a neural network $f_\theta: \mathbb{R}^m \to \mathbb{R}^n$. Such procedures, which we call pushforward models, incur a straightforward limitation: manifolds cannot in general be represented with a single parameterization, meaning that attempts to do so will incur either computational instability or the inability to learn probability densities within the manifold. To remedy this problem, we propose to model $\mathcal{M}$ as a neural implicit manifold: the set of zeros of a neural network. To learn the data distribution within $\mathcal{M}$, we introduce constrained energy-based models, which use a constrained variant of Langevin dynamics to train and sample within the learned manifold. The resulting model can be manipulated with an arithmetic of manifolds which allows practitioners to take unions and intersections of model manifolds. In experiments on synthetic and natural data, we show that constrained EBMs can learn manifold-supported distributions with complex topologies more accurately than pushforward models.  ( 2 min )
    Provably Efficient Model-Free Constrained RL with Linear Function Approximation. (arXiv:2206.11889v1 [cs.LG])
    We study the constrained reinforcement learning problem, in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. In contrast to existing model-based approaches or model-free methods accompanied with a `simulator', we aim to develop the first model-free, simulator-free algorithm that achieves a sublinear regret and a sublinear constraint violation even in large-scale systems. To this end, we consider the episodic constrained Markov decision processes with linear function approximation, where the transition dynamics and the reward function can be represented as a linear function of some known feature mapping. We show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ constraint violation bounds can be achieved, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps. Our bounds are attained without explicitly estimating the unknown transition model or requiring a simulator, and they depend on the state space only through the dimension of the feature mapping. Hence our bounds hold even when the number of states goes to infinity. Our main results are achieved via novel adaptations of the standard LSVI-UCB algorithms. In particular, we first introduce primal-dual optimization into the LSVI-UCB algorithm to balance between regret and constraint violation. More importantly, we replace the standard greedy selection with respect to the state-action function in LSVI-UCB with a soft-max policy. This turns out to be key in establishing uniform concentration for the constrained case via its approximation-smoothness trade-off. We also show that one can achieve an even zero constraint violation while still maintaining the same order with respect to $T$.  ( 3 min )
    Functional Nonlinear Learning. (arXiv:2206.11424v1 [stat.ML])
    Using representations of functional data can be more convenient and beneficial in subsequent statistical models than direct observations. These representations, in a lower-dimensional space, extract and compress information from individual curves. The existing representation learning approaches in functional data analysis usually use linear mapping in parallel to those from multivariate analysis, e.g., functional principal component analysis (FPCA). However, functions, as infinite-dimensional objects, sometimes have nonlinear structures that cannot be uncovered by linear mapping. Linear methods will be more overwhelmed given multivariate functional data. For that matter, this paper proposes a functional nonlinear learning (FunNoL) method to sufficiently represent multivariate functional data in a lower-dimensional feature space. Furthermore, we merge a classification model for enriching the ability of representations in predicting curve labels. Hence, representations from FunNoL can be used for both curve reconstruction and classification. Additionally, we have endowed the proposed model with the ability to address the missing observation problem as well as to further denoise observations. The resulting representations are robust to observations that are locally disturbed by uncontrollable random noises. We apply the proposed FunNoL method to several real data sets and show that FunNoL can achieve better classifications than FPCA, especially in the multivariate functional data setting. Simulation studies have shown that FunNoL provides satisfactory curve classification and reconstruction regardless of data sparsity.  ( 2 min )
    Projection-free Constrained Stochastic Nonconvex Optimization with State-dependent Markov Data. (arXiv:2206.11346v1 [math.OC])
    We study a projection-free conditional gradient-type algorithm for constrained nonconvex stochastic optimization problems with Markovian data. In particular, we focus on the case when the transition kernel of the Markov chain is state-dependent. Such stochastic optimization problems arise in various machine learning problems including strategic classification and reinforcement learning. For this problem, we establish that the number of calls to the stochastic first-order oracle and the linear minimization oracle to obtain an appropriately defined $\epsilon$-stationary point, are of the order $\mathcal{O}(1/\epsilon^{2.5})$ and $\mathcal{O}(1/\epsilon^{5.5})$ respectively. We also empirically demonstrate the performance of our algorithm on the problem of strategic classification with neural networks.  ( 2 min )
    Improving decision-making via risk-based active learning: Probabilistic discriminative classifiers. (arXiv:2206.11616v1 [cs.LG])
    Gaining the ability to make informed decisions on operation and maintenance of structures provides motivation for the implementation of structural health monitoring (SHM) systems. However, descriptive labels for measured data corresponding to health-states of the monitored system are often unavailable. This issue limits the applicability of fully-supervised machine learning paradigms for the development of statistical classifiers to be used in decision-support in SHM systems. One approach to dealing with this problem is risk-based active learning. In such an approach, data-label querying is guided according to the expected value of perfect information for incipient data points. For risk-based active learning in SHM, the value of information is evaluated with respect to a maintenance decision process, and the data-label querying corresponds to the inspection of a structure to determine its health state. In the context of SHM, risk-based active learning has only been considered for generative classifiers. The current paper demonstrates several advantages of using an alternative type of classifier -- discriminative models. Using the Z24 Bridge dataset as a case study, it is shown that discriminative classifiers have benefits, in the context of SHM decision-support, including improved robustness to sampling bias, and reduced expenditure on structural inspections.  ( 2 min )
    A generalised form for a homogeneous population of structures using an overlapping mixture of Gaussian processes. (arXiv:2206.11683v1 [cs.LG])
    Reductions in natural frequency are often used as a damage indicator for structural health monitoring (SHM) purposes. However, fluctuations in operational and environmental conditions, changes in boundary conditions, and slight differences among nominally-identical structures can also affect stiffness, producing frequency changes that mimic or mask damage. This variability has limited the practical implementation and generalisation of SHM technologies. The aim of this work is to investigate the effects of normal variation, and to identify methods that account for the resulting uncertainty. This work considers vibration data collected from a set of four healthy full-scale composite helicopter blades. The blades were nominally-identical but distinct, and slight differences in material properties and geometry among the blades caused significant variability in the frequency response functions, which presented as four separate trajectories across the input space. In this paper, an overlapping mixture of Gaussian processes (OMGP), was used to generate labels and quantify the uncertainty of normal-condition frequency response data from the helicopter blades. Using a population-based approach, the OMGP model provided a generic representation, called a form, to characterise the normal condition of the blades. Additional simulated data were then compared against the form and evaluated for damage using a marginal-likelihood novelty index.  ( 2 min )
    A Topological characterisation of Weisfeiler-Leman equivalence classes. (arXiv:2206.11876v1 [cs.LG])
    Graph Neural Networks (GNNs) are learning models aimed at processing graphs and signals on graphs. The most popular and successful GNNs are based on message passing schemes. Such schemes inherently have limited expressive power when it comes to distinguishing two non-isomorphic graphs. In this article, we rely on the theory of covering spaces to fully characterize the classes of graphs that GNNs cannot distinguish. We then generate arbitrarily many non-isomorphic graphs that cannot be distinguished by GNNs, leading to the GraphCovers dataset. We also show that the number of indistinguishable graphs in our dataset grows super-exponentially with the number of nodes. Finally, we test the GraphCovers dataset on several GNN architectures, showing that none of them can distinguish any two graphs it contains.  ( 2 min )
    Inductive Conformal Prediction: A Straightforward Introduction with Examples in Python. (arXiv:2206.11810v1 [stat.ML])
    Inductive Conformal Prediction (ICP) is a set of distribution-free and model agnostic algorithms devised to predict with a user-defined confidence with coverage guarantee. Instead of having \textit{point predictions}, i.e., a real number in the case of regression or a single class in multi class classification, models calibrated using ICP output an interval or a set of classes, respectively. ICP takes special importance in high-risk settings where we want the real output to belong to the prediction set with high probability. As an example, a classification model might output that given a magnetic resonance image a patient has no latent diseases to report. However, this model output was based on the most likely class, the second most likely class might tell that the patient has a 15\% chance of brain tumor or other severe disease and therefore further exams should be conducted. Using ICP is therefore way more informative and we believe that should be the standard way of producing forecasts. This paper is a hands-on introduction, this means that we will provide examples as we introduce the theory.  ( 2 min )
    A Temporal Extension of Latent Dirichlet Allocation for Unsupervised Acoustic Unit Discovery. (arXiv:2206.11706v1 [eess.AS])
    Latent Dirichlet allocation (LDA) is widely used for unsupervised topic modelling on sets of documents. No temporal information is used in the model. However, there is often a relationship between the corresponding topics of consecutive tokens. In this paper, we present an extension to LDA that uses a Markov chain to model temporal information. We use this new model for acoustic unit discovery from speech. As input tokens, the model takes a discretised encoding of speech from a vector quantised (VQ) neural network with 512 codes. The goal is then to map these 512 VQ codes to 50 phone-like units (topics) in order to more closely resemble true phones. In contrast to the base LDA, which only considers how VQ codes co-occur within utterances (documents), the Markov chain LDA additionally captures how consecutive codes follow one another. This extension leads to an increase in cluster quality and phone segmentation results compared to the base LDA. Compared to a recent vector quantised neural network approach that also learns 50 units, the extended LDA model performs better in phone segmentation but worse in mutual information.  ( 2 min )
    Backward baselines: Is your model predicting the past?. (arXiv:2206.11673v1 [cs.LG])
    When does a machine learning model predict the future of individuals and when does it recite patterns that predate the individuals? In this work, we propose a distinction between these two pathways of prediction, supported by theoretical, empirical, and normative arguments. At the center of our proposal is a family of simple and efficient statistical tests, called backward baselines, that demonstrate if, and to which extent, a model recounts the past. Our statistical theory provides guidance for interpreting backward baselines, establishing equivalences between different baselines and familiar statistical concepts. Concretely, we derive a meaningful backward baseline for auditing a prediction system as a black box, given only background variables and the system's predictions. Empirically, we evaluate the framework on different prediction tasks derived from longitudinal panel surveys, demonstrating the ease and effectiveness of incorporating backward baselines into the practice of machine learning.  ( 2 min )
    Invariant Causal Mechanisms through Distribution Matching. (arXiv:2206.11646v1 [cs.LG])
    Learning representations that capture the underlying data generating process is a key problem for data efficient and robust use of neural networks. One key property for robustness which the learned representation should capture and which recently received a lot of attention is described by the notion of invariance. In this work we provide a causal perspective and new algorithm for learning invariant representations. Empirically we show that this algorithm works well on a diverse set of tasks and in particular we observe state-of-the-art performance on domain generalization, where we are able to significantly boost the score of existing models.  ( 2 min )
    A Geometric Method for Improved Uncertainty Estimation in Real-time. (arXiv:2206.11562v1 [cs.LG])
    Machine learning classifiers are probabilistic in nature, and thus inevitably involve uncertainty. Predicting the probability of a specific input to be correct is called uncertainty (or confidence) estimation and is crucial for risk management. Post-hoc model calibrations can improve models' uncertainty estimations without the need for retraining, and without changing the model. Our work puts forward a geometric-based approach for uncertainty estimation. Roughly speaking, we use the geometric distance of the current input from the existing training inputs as a signal for estimating uncertainty and then calibrate that signal (instead of the model's estimation) using standard post-hoc calibration techniques. We show that our method yields better uncertainty estimations than recently proposed approaches by extensively evaluating multiple datasets and models. In addition, we also demonstrate the possibility of performing our approach in near real-time applications. Our code is available at our Github https://github.com/NoSleepDeveloper/Geometric-Calibrator.  ( 2 min )

  • Open

    [Project] Semantic Search powerup for Ctrl+F
    Hi Reddit! Scout Search is a project I've been working on as a Find-in-Page replacement. It uses a semantic search engine (rather than character matching) to help you find what you're looking for on websites. Try it out and let me know what you think. https://chrome.google.com/webstore/detail/scout-search/hgljpodblkjjklailoaefokflfdeffdl submitted by /u/scoutsearchteam [link] [comments]  ( 83 min )
    [D] CVPR wants to penalize reviewers for violating the reviewer guideline!
    I cannot believe that CVPR put this motion for voting: Motion 3: "Any reviewer who has accepted an invitation to review but violates the reviewing guidelines set forth by the conference will be prohibited from submitting any papers to CVPR for up to two years." Reviewing is a community service, and although I have encountered bad and unfair reviews multiple times, I don't think such a wild action is the way to go to increase the review process quality. Let's start with the training process and choosing qualified AC and Meta ACs first where they can properly oversee the review process, choose fit reviewers, and take action in the rebuttal process. If this goes through I would never review for CVPR again. https://mobile.twitter.com/KostasPenn/status/1539805992145358850 submitted by /u/aifordummies [link] [comments]  ( 89 min )
    [P] Farewell, CUDA OOM: Automatic Gradient Accumulation
    Hey everyone, If you've trained a lot of neural nets, you probably know the pain of getting CUDA OOM errors and iteratively tuning your batch size to avoid them. Which is why I'm excited to announce that we (MosaicML) just released an automatic way to avoid these errors. Namely, we just added automatic gradient accumulation to Composer, our open source library for faster + easier neural net training. If you're not familiar with gradient accumulation, it's like tuning the batch size, but without messing with the optimization (aside from slightly different BatchNorm stats). This lets you avoid tuning learning rate, weight decay, etc based on how much memory your GPU has or how many GPUs you're training on. https://preview.redd.it/ogxq73znuf791.png?width=1374&format=png&auto=webp&s=93ff0b76a2293a73a5380b7e93f62fe34c604bc4 What's nice about the *automatic* gradient accumulation in Composer is that you just set the batch size and hparams once and you're done—no need to tune the gradient accumulation manually. More info in our blog post, and special thanks to Mihir Patel for building most of this. Happy to answer questions! submitted by /u/ffast-math [link] [comments]  ( 85 min )
    [P] HyperImpute: sklearn-style library for handling missing data using novel algorithms
    There are many data imputation algorithms for machine learning. However, benchmarking them can be complicated, mainly because most implementations stay just as research code to reproduce the experiments in the papers. Moreover, when dealing with tabular data, you need to handle continuous/discrete/categorical data correctly -- not just let some regressor approximate everything. HyperImpute is a library that should make it easy to benchmark new imputation algorithms while offering several state-of-the-art models. For example, imputing using MIWAE can be done as easy as this: import pandas as pd import numpy as np from hyperimpute.plugins.imputers import Imputers X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]]) plugin = Imputers().get("miwae") out = plugin.fit_transform(X.copy()) out Bonus, it can be easily plugged into sklearn pipelines. Github page: https://github.com/vanderschaarlab/hyperimpute submitted by /u/ManagementBig2995 [link] [comments]  ( 84 min )
    [D] "Wrapping" effects when using diffusion model to generate samples?
    I've recently been training a latent diffusion model (it operated on the latent space of a VQ-VAE), and I'm finding that my generated samples have "wrapping" effects, i.e.: when I generate the face it wraps up (bottom half of the face in the top half of the image and vice versa). It's worth noting that these halves don't always seem like they belong together, but they individually look quite realistic. I've checked my training data, and there are absolutely no training samples that exhibit this behaviour, so my model never sees images that exhibit this wrapping effect, so what could be causing this? submitted by /u/Pedimus [link] [comments]  ( 84 min )
    [R] Learning to Play Minecraft with Video PreTraining (VPT)
    OpenAI Blog: Learning to Play Minecraft with Video PreTraining (VPT) OpenAI gathered a large dataset of human Minecraft demonstrations and trained an Inverse Dynamics Model (IDM) transformer that predicts actions based on past and future frames using a dataset of human demonstrations. They used this model to label 70k hours of video, which is used to train a Video PreTraining (VPT) model, which predicts actions based on past frames alone, using behavioral cloning (i.e. supervised learning). They can then fine-tune the VPT via behavioral cloning on narrower datasets or RL (with a hand-designed reward function that rewards the agent for going deeper into the tech tree or obtaining materials that could lead to a diamond pickaxe) and are able to train an agent that can craft a diamond pickaxe in 2.5% of its 10-minute long episodes. submitted by /u/gambs [link] [comments]  ( 85 min )
    [P] AutoRegistry: A Python library for mapping names to functionality to simplify project configurations.
    A common design pattern I see in a lot of ML projects is to have some sort of experiment configuration file, and then a bunch of code that constructs the appropriate objects based on these configurations. Frequently, the resulting code blocks have a bunch of if/elif/else statements, or a manually created lookup dictionary somewhere. This can quickly get messy and inconsistent as you add new models/losses/encoders/optimizers. AutoRegistry is a library that makes all of these lookups more organized and terse. For example, lets say you want to configure a backbone to either be "resnet34" or "resnet50". Your code could look something like this (mimicking torchvision code) using a decorator: ``` from autoregistry import Registry models = Registry() @models def resnet34(, weights: Optional[ResNet34_Weights] = None, progress: bool = True, *kwargs: Any) -> ResNet: return _resnet(BasicBlock, [3, 4, 6, 3], weights, progress, **kwargs) @models def resnet50(, weights: Optional[ResNet50_Weights] = None, progress: bool = True, *kwargs: Any) -> ResNet: return _resnet(Bottleneck, [3, 4, 6, 3], weights, progress, **kwargs) create a model based off of some configuration dictionary. model_config = copy(config["model"]) model_type = model_config.pop("type") model = models[model_type](**model_config) ``` or, class-based inheritance (uses metaclasses internally): ``` class BaseModel(nn.Module, Registry): pass class MyNewModel(BaseModel): pass class SomeOtherModel(BaseModel): pass stringified keys are automatically derived. my_new_model = BaseModel["mynewmodel"](**config) some_other_model = BaseModel["someothermodel"](**config) ``` Github Page: https://github.com/BrianPugh/autoregistry submitted by /u/guyfrom7up [link] [comments]  ( 84 min )
    [P] Reverse Engineering Google Colab
    Hi! I've spent a lot of time working with Google Colab recently, and was disappointed that such a powerful platform was limited to only running Jupyter notebooks. So I took a deep dive into the internals of Colab, discovering tons of interesting hidden features! Take a look at what I found! submitted by /u/vikarjramun [link] [comments]  ( 84 min )
    [R] Can interpretability improve model accuracy?!
    Deep learning models are often complex and mostly uninterpretable. • One strategy is to learn the nonlinear relation of features. But, there are so many features to learn from: • Research shows a set of important features can improve the learning process. • So let's focus on the most correlated features. Paper📜: https://arxiv.org/abs/2203.04383 submitted by /u/AshkanF [link] [comments]  ( 83 min )
    [P] Data search engine for ML in Binder
    Open source data search engine for ML. Binder link: https://mybinder.org/v2/gh/upgini/upgini/main?urlpath=notebooks%2Fnotebooks%2Fkaggle_example.ipynb Colab link: https://colab.research.google.com/github/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb Github: https://github.com/upgini/upgini submitted by /u/AnnualLimp1418 [link] [comments]  ( 83 min )
    [D] How Imagen Actually Works
    Hey everyone! I wrote this article explaining how Imagen actually works, with a general overview for the big picture ideas and a Deep Dive to get into the nitty-gritty. I'm happy to answer any questions, let me know what you think! https://preview.redd.it/17xc5fqeud791.png?width=3472&format=png&auto=webp&s=e78a024892a3032ffc0c143b7843a5223751afcb submitted by /u/SleekEagle [link] [comments]  ( 84 min )
    State of the art 2D body pose estimation [Discussion]
    Hi. I have a background in neuroscience and sometimes we use DeepLabCut to track animals during behaviour. This is by far the most widespread and used application for animal tracking based on artificial neural networks. I was wondering, if anyone here is an expert in human 2D body pose estimation and can tell me what their oppinion is on what is the best human 2D pose estimation tool currently available? I came across Pose from mediapipe and it seems very good from a few examples I tested so far but I'm curious if there's something even better that I have not come across. Thanks for the help! submitted by /u/lux123or [link] [comments]  ( 84 min )
    [D] [P] A TensorFlow Re-Implementation of CheXNet - Classification and Localization of Thoracic Diseases
    TL:DR; need help making heatmaps! [Repository|Colab Notebook] Hey everyone - I've been working to reproduce CheXNet - a fantastic paper describing research on a model capable of radiologist-grade pathology classification! CheXNet uses Class Activation Mappings (CAMs for short) to generate heatmaps that identify what parts of the image the model uses to base its classification. In my case, I'm facing a bit of a struggle reproducing them - as shown in the image below, most of our classifications are derived from the diaphragm, instead of regions within the lung. Curiously, we are attaining a reasonable AUROC, with .773 on training and .749 on validation data - the paper reports .8062 AUROC. My current model is being trained on a subsample of the main dataset, and I'm basically looking to this as a way to validate the architecture. I'd love to know if anyone has experienced similar issues and solved them, and could have any input here as well. If you have a moment to spare - I'd be super grateful for some help from the r/MachineLearning community in solving the inaccurate localization issue - #58! Fig 1. An incorrect localization, despite a correct classification. submitted by /u/codeinassembly [link] [comments]  ( 85 min )
    [P] Yandex open sources 100b large language model weights (YaLM)
    PR Announcement: https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6 Github: https://github.com/yandex/YaLM-100B Network is trained using same principles as Megatron LM, inference alone will require 4 A100s submitted by /u/htrp [link] [comments]  ( 88 min )
    [N] Microsoft released a DirectML Plugin for TensorFlow 2
    The plugin provides a DirectML PluggableDevice backend for TensorFlow 2, so any GPU which supports DirectX 12 should be able to work with TF2. Hopefully this will pave the way for more support for non-NVIDIA GPUs in ML. They provide some more details (installation, code samples, etc') in the Windows AI devblog. submitted by /u/chromeplated [link] [comments]  ( 84 min )
    [D] Do any Text-to-Image approaches work well with long complex prompts (i.e. paragraph or book chapter scale)?
    Seems almost all the examples of text-to-image are based on tiny prompts with very few details ("avocado chair"). Do any such systems do a good job at keeping track of details - like the first 2 paragraphs of The Hobbit and correctly place the "polished chairs", "pegs for hats and coats", and "deep-set round windows looking over his garden, and meadows beyond, sloping down to the river"? Assuming they don't - what approach(es) might make sense to design such systems? I'm speculating that you'd need much larger embedding vectors (to correctly connect concepts from the right adjectives to the right nouns); and it'd be harder to find training data (perhaps frames of movies from novels would be a good source)? Any pointers to anything in that direction? submitted by /u/Appropriate_Ant_4629 [link] [comments]  ( 85 min )
    [Project] h5 model to onnx model in JAVA
    I have a trained model (.h5) saved. I need to do the following in Java. Can I load this model and then convert it to an onnx model and save that onnx model? Any lead is appreciated! submitted by /u/Negative_Internet514 [link] [comments]  ( 83 min )
    [Discussion]
    Part of my graduation project is to classify body organs such as the heart and liver. I searched a lot and did not find anything, so I decided to bring a 3d models of the body organs and start working on collecting a dataset . I collected about 100,000 images for each of the 4 organs. My question here is whether this data is considered It has any value, in other words, to put it on Kaggle or any another website or has no value ? submitted by /u/NourOmran [link] [comments]  ( 84 min )
    [D] Implementing custom functions in pytorch e.g. feature propagation (PointNet++)
    Apologies if this isn't the right place to ask. But I'm currently studying point cloud-based networks like pointcloud++, and all the related 3d object detection networks like pointpillars, voxelnet, etc. While I (think) understand the algorithms like feature propagation in pointnet++. I'm having trouble understanding how would one implement them. Or Where could I learn about writing operations in cuda and making sure they are compatible with backprop? submitted by /u/wowAmaze [link] [comments]  ( 84 min )
  • Open

    NVIDIA’s GANCraft AI: Feels Like Magic! 🌴
    submitted by /u/the_anonymizer [link] [comments]  ( 82 min )
    Should sentient Artificial intelligence be legally protected?
    submitted by /u/Tell_Nervous [link] [comments]  ( 83 min )
    Why Google’s LaMDA AI is conscious: Suspended Google engineer Blake Lemoine speaks out in first podcast interview
    submitted by /u/DrJamesCooke [link] [comments]  ( 83 min )
    Have you ever used an AI text-to-image generator? [Short survey]
    submitted by /u/KazRainer [link] [comments]  ( 82 min )
    DALL-E 2 could become OpenAI's first money printing machine
    submitted by /u/much_successes [link] [comments]  ( 83 min )
    How does a optical quantum neural network work? thanks
    submitted by /u/OneFinding1429 [link] [comments]  ( 83 min )
    Some hellish art i promted from dall-e mini in the style of one of my favorite artists.
    submitted by /u/SuperCasualGamerDad [link] [comments]  ( 82 min )
    Dalle2 Prompts
    submitted by /u/KrinoDaGamer [link] [comments]  ( 82 min )
    AI can predict your political ideology using just a brain scan
    submitted by /u/nagual901 [link] [comments]  ( 83 min )
    Hey, guys! I am new to Face AI and computer vision and planning to build a lie detector using Face AI technology. Would it be possible? Is anyone already doing this?
    submitted by /u/adilonreddit1 [link] [comments]  ( 83 min )
    Using Craiyon, I made the first image. I then put that image into Starryai, and made the second image. AI art inception
    submitted by /u/VastlyArtistic [link] [comments]  ( 82 min )
    We have AI generated art now. We have AI generated conversation. But where are the AI generated music compositions?
    AI generated images from text prompts are making the rounds with Dalle mini and DALLE.2. These systems are so powerful that people are admitting they cannot tell real from fake images anymore. Google's LaMDA is producing conversational text chats that are so realistic that they spawned entire subreddits where users claim the software agent has become sentient. So where is the instrumental and orchestral music that is indifferentiable from human composers? In recent months I had heard some song continuations, where an AI was trained on the wave form of popular music, which was asked to continue. Those were fine, but ended up sounding like strange incoherent fever dreams. I fiddled with some midi-like continuations on a website. The output was janky, repetitive, and obviously computer-…  ( 98 min )
    Latest AI tools in different languages?
    Hi there, There are many amazing tools powered by AI or ML but most of them are available only in English. How hard would it be to adapt them to my own language, which is not English? Google translate doesn't do a very good job translating.... Thanks! submitted by /u/decixl [link] [comments]  ( 83 min )
    NEON PSYHEDELIC TEMPLES | FAST MODE! | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
  • Open

    deep reinforcment learning for games
    Hey everyone i have some questions that i hope you can help me with: -im looking for ressources for reinforcment and deep reinforcment learning i want to know if some of you guys implemented rl within a 3d game and can give me some advice about how that works ( how to make the agent understand its environment in 3d and so fourth) Thanks! submitted by /u/naffra [link] [comments]  ( 83 min )
    How to correctly pass history into Gym observation space?
    I'm new to reinforcement learning, but from research have found that SOTA algo's whether value or policy based are not able to gracefully ignore irrelevant information in the observation space - https://arxiv.org/pdf/2011.00756.pdf. I need to keep track of history to handle performance and correctly implement actions; some of these values directly correspond to decision making, but most correspond to environment inspection (e.g. monitoring performance in a more human friendly fashion). From various sources, I've found that by keeping this copy of history, I'm still able to maintain the Markov assumption but have found limited practical examples. Specifically, should I maintain history outside of the observation space and just as an environment instance variable or can/should the histor…  ( 88 min )
    Effectiveness of Q learning for two player games?
    Essentially, what I'm asking is if a single q learning model (table or neural network) trained against itself in an environment can learn to perform optimally in the general case of the environment at hand? For instance, against a random player and a decent player, can q learning perform well or optimally after doing the initial training against itself? I've tried implementing it with tic tac toe and it seems to give decent but not amazing results. I want to at least know if my fundamental approach is appropriate, so I can resolve any other bugs due to implementation. I use a single q table by switching the markings of the board for each player (X vs O) and then using this as a key to look in the q table. Essentially, the table is not directly in X/O form, but is expressed as the agent vs the adversary. The next state for the table is not the board after the agent places a mark, but the board after the adversary responds. I'm assuming this would simply be the general stochastic MDP in the way I've framed the problem? Perhaps I need the learn rate of the q table to be decreasing in the fashion needed for value iteration (sum of rates are infinite, sum of squared rates is not)? I've tried various values of epsilon for exploration vs exploitation. Any help would be much appreciated! submitted by /u/Spiritual_Dinner9232 [link] [comments]  ( 83 min )
    Is PPO still SATO in 2022 ?
    Hello guys, I was wondering if PPO was still the most broadly used algorithm for continuous control in 2022 ? submitted by /u/Jogima-cyber [link] [comments]  ( 83 min )
    DeepMind Researchers Develop ‘BYOL-Explore’: A Curiosity-Driven Exploration Algorithm That Harnesses The Power Of Self-Supervised Learning To Solve Sparse-Reward Partially-Observable Tasks
    Reinforcement learning (RL) requires exploration of the environment. Exploration is even more critical when extrinsic incentives are few or difficult to obtain. Due to the massive size of the environment, it is impractical to visit every location in rich settings due to the range of helpful exploration paths. Consequently, the question is: how can an agent decide which areas of the environment are worth exploring? Curiosity-driven exploration is a viable approach to tackle this problem. It entails learning a world model, a predictive model of specific knowledge about the world, and (ii) exploiting disparities between the world model’s predictions and experience to create intrinsic rewards. An RL agent that maximizes these intrinsic incentives steers itself toward situations where the world model is unreliable or unsatisfactory, creating new paths for the world model. In other words, the quality of the exploration policy is influenced by the characteristics of the world model, which in turn helps the world model by collecting new data. Therefore, it might be crucial to approach learning the world model and learning the exploratory policy as one cohesive problem to be solved rather than two separate tasks. Deepmind researchers keeping this in mind, introduced a curiosity-driven exploration algorithm BYOL-Explore. Its attraction stems from its conceptual simplicity, generality, and excellent performance. Continue reading | Checkout the paper, blog post https://i.redd.it/5d8iz0r1me791.gif submitted by /u/Embarrassed-Fee5513 [link] [comments]  ( 84 min )
    Combining dynamic movement primitives to create new ones
    This is my very first question and I want to thank everyone for the massive contribution to the community! On to my question now, here is a quick definition of my idea/problem: I have created a "set of knowledge" from dynamic movement primitives, [d1, d2, ..., dn] for the exact same task but for slightly different characteristics of the scenarios every time. Given the "set of knowledge" of DMPs for different scenarios and also the characteristics of a completely new scenario, how can I create a new DMP for this new scenario using the existing "knowledge". I was thinking of a way to represent the weights of the DMPs as Gaussians, apply weights to each one of the Gaussians and perform an evolution of algorithm to update the weights and keep the most impactful DMPs. Please feel free to propose any other ideas, papers, techniques that could help me approach this problem. Thank you in advance submitted by /u/Stelios_ml [link] [comments]  ( 84 min )
    An introduction to ML-Agents with Hugging Face 🤗 (Deep Reinforcement Learning Free Class)
    Hey there! I'm happy to announce that we just published a new tutorial on ML-Agents (a library containing environments made with Unity). In fact, at Hugging Face, we created a new ML-Agents version where: - You don't need to install Unity or know how to use the Unity Editor. - You can publish your models to the Hugging Face Hub for free. - You can visualize your agent playing directly on your browser 👀. So in this tutorial, you’ll train an agent that needs to press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top. The tutorial 👉 https://medium.com/p/efbac62c8c80 https://preview.redd.it/99s0x07ayd791.png?width=1050&format=png&auto=webp&s=f4ef3978b36a63223be2e5d0cf2974ab97d3cecb Do you just want to play with some trained agents? We have live demos you can try 🔥: - Worm 🐍: https://huggingface.co/spaces/unity/ML-Agents-Worm - PushBlock 🧊: https://huggingface.co/spaces/unity/ML-Agents-PushBlock - Pyramids 🏆: https://huggingface.co/spaces/unity/ML-Agents-Pyramids - Walker 🚶: https://huggingface.co/spaces/unity/ML-Agents-Walker ​ https://preview.redd.it/r7dqmywbyd791.png?width=1435&format=png&auto=webp&s=f0bdcf82ed2ba35101159d442dcfdaf6eb4d98ee If you have questions and feedback, I would love to answer them. Keep Learning, Stay awesome 🤗 submitted by /u/cranthir_ [link] [comments]  ( 83 min )
    I have an idea that makes sense but is not working :/
    Hello Hello, Heads up, it might sound complicated but it is a simple idea. I have a RL agent trying to solve a certain problem, with training using PPO. I also have an expert, i.e. an agent that already knows how to tackle the given problem. I am assuming that in simulations, I have access to the expert policy (meaning I can easily generate trajectories using the expert). I am trying to use the expert to help with speeding up the learning of my agent. A pseudo-code of my "act" function is as follows: https://preview.redd.it/0kw11vpkz9791.png?width=447&format=png&auto=webp&s=ba07d7cee9377fd949af64e6574a5c3e56e4d4f1 So basically, if use_expert is 0, nothing is new, it is normal act function where the agent gets actions based on its own actor network. if use_expert is 1, the only difference is that the agent no longer samples actions based on its own actor, but it gets the action suggested by the expert. Since PPO requires logprobs, I still get the logprob based on the agent's own distribution, but using the action suggested by the expert. My main aim here is, if I introduce this for a small portion of the learning, my agent would have exposure to more rewarding experiences, and hopefully learn faster. I have a hyperparameter (expert_rate) that determines how frequently I use expert actions in my learning (how frequently i set use_expert to 1). However, this doesnt seem to be working. As a matter of fact, for fun I set expert_rate to 100% (i.e. the agent is always acting in the environment based on the expert suggestions), and I notice no learning whatsoever. I am already familiar with the works that try to incorporate imitation learning with RL, but i'm trying to avoid using imitation learning (issues related to the problem I'm solving). Any idea what could be the problem? submitted by /u/AhmedNizam_ [link] [comments]  ( 86 min )
  • Open

    Family Style: Li Auto L9 Brings Top-Line Luxury and Intelligence to Full-Size SUV With NVIDIA DRIVE Orin
    Finally, there’s a family car any kid would want to be seen in. Beijing-based startup Li Auto this week rolled out its second electric vehicle, the L9. It’s a full-size SUV decked out with the latest intelligent driving technology. With AI features and an extended battery range of more than 800 miles, the L9 promises Read article > The post Family Style: Li Auto L9 Brings Top-Line Luxury and Intelligence to Full-Size SUV With NVIDIA DRIVE Orin appeared first on NVIDIA Blog.  ( 5 min )
    Making an Impact: GFN Thursday Transforms Macs Into GeForce Gaming PCs
    Thanks to the GeForce cloud, even Mac users can be PC gamers. This GFN Thursday, fire up your Macbook and get your game on. This week brings eight more games to the GeForce NOW library. Plus, members can play Genshin Impact and claim a reward to start them out on their journeys streaming on GeForce Read article > The post Making an Impact: GFN Thursday Transforms Macs Into GeForce Gaming PCs appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    Import data from cross-account Amazon Redshift in Amazon SageMaker Data Wrangler for exploratory data analysis and data preparation
    Organizations moving towards a data-driven culture embrace the use of data and machine learning (ML) in decision-making. To make ML-based decisions from data, you need your data available, accessible, clean, and in the right format to train ML models. Organizations with a multi-account architecture want to avoid situations where they must extract data from one […]  ( 7 min )
    Predict types of machine failures with no-code machine learning using Amazon SageMaker Canvas
    Predicting common machine failure types is critical in manufacturing industries. Given a set of characteristics of a product that is tied to a given type of failure, you can develop a model that can predict the failure type when you feed those attributes to a machine learning (ML) model. ML can help with insights, but […]  ( 10 min )
  • Open

    Learning to Play Minecraft with Video PreTraining (VPT)
    We trained a neural network to play Minecraft by Video PreTraining (VPT) on a massive unlabeled video dataset of human Minecraft play, while using only a small amount of labeled contractor data. With fine-tuning, our model can learn to craft diamond tools, a task that usually takes proficient humans over  ( 8 min )
  • Open

    Robots play with play dough
    A new system lets robots manipulate soft, deformable material into various shapes from visual inputs, which could one day enable better home assistants.  ( 6 min )
  • Open

    GODEL: Combining goal-oriented dialog with real-world conversations
    They make restaurant recommendations, help us pay bills, and remind us of appointments. Many people have come to rely on virtual assistants and chatbots to perform a wide range of routine tasks. But what if a single dialog agent, the technology behind these language-based apps, could perform all these tasks and then take the conversation […] The post GODEL: Combining goal-oriented dialog with real-world conversations appeared first on Microsoft Research.  ( 11 min )
  • Open

    The quality of an RNG depends on the application
    A random number generator can be good for some purposes and not for others. This isn’t surprising given the fundamentally impossible task such generators are supposed to perform. Technically a random number generator is a pseudo random number generator because it cannot produce random numbers. But random is as random does, and for many purposes […] The quality of an RNG depends on the application first appeared on John D. Cook.  ( 6 min )
  • Open

    How AI is Stopping Money Laundering
    Anti-money laundering (AML) and know-your-customer (KYC) compliance might be transformed by artificial intelligence (AI). Artificial intelligence systems may also mine vast amounts of data through KYC verification companies for risk-relevant information for anti-money laundering reasons, making identifying high-risk clients easier. AI is beneficial when completing repetitive activities since it saves time, effort, and resources that… Read More »How AI is Stopping Money Laundering The post How AI is Stopping Money Laundering appeared first on Data Science Central.  ( 19 min )
    Basic E-Discovery Concepts Every Attorney Should Know
    Electronic Discovery or E-Discovery is a process wherein electronic data is found, secured, and then searched in order to find effective evidence during criminal or civil legal procedures. Electronic discovery can also be carried out without the connectivity of the Internet from a local computer. Government or court-ordered hacking in order get critical information as… Read More »Basic E-Discovery Concepts Every Attorney Should Know The post Basic E-Discovery Concepts Every Attorney Should Know appeared first on Data Science Central.  ( 18 min )
    Value of Real-Time Data Visualization and Interpretation
    Representation of data using graphics such as charts, plots, infographics, heat maps, bubble clouds, scatter plots, mekko charts are referred to as data visualization. Such visual displays and representation of information help communicate complex data relationships and data-driven insights in a way that makes it easy to understand and base decisions on. The goal of… Read More »Value of Real-Time Data Visualization and Interpretation The post Value of Real-Time Data Visualization and Interpretation appeared first on Data Science Central.  ( 20 min )
    Blueprint for Building a Data Product Business
    Note: I got feedback that my Data Product Blueprint process in Figure 7 was waterfall, not agile.  Totally agree and that’s my bad.  I’ve updated the image and will release a future blog to address the questions that I got about that process.  Thanks for your feedback! A Blueprint is a detailed design plan of… Read More »Blueprint for Building a Data Product Business The post Blueprint for Building a Data Product Business appeared first on Data Science Central.  ( 21 min )
    AI Goes Mainstream
    The initial uptake of AI was within financial services – that still continues but we are now seeing adoption beyond traditional industries dominated by AI. The CB insights AI 100 is an annual list of interesting AI companies. This year, I saw companies applying AI to nontraditional sectors These areas are relatively hard to acquire data for at… Read More »AI Goes Mainstream The post AI Goes Mainstream appeared first on Data Science Central.  ( 18 min )
  • Open

    GEMv2: Multilingual NLG Benchmarking in a Single Line of Code. (arXiv:2206.11249v1 [cs.CL])
    Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.  ( 3 min )
    Learning Monotone Dynamics by Neural Networks. (arXiv:2006.06417v2 [cs.LG] UPDATED)
    Feed-forward neural networks (FNNs) work as standard building blocks in applying artificial intelligence (AI) to the physical world. They allow learning the dynamics of unknown physical systems (e.g., biological and chemical) {to predict their future behavior}. However, they are likely to violate the physical constraints of those systems without proper treatment. This work focuses on imposing two important physical constraints: monotonicity (i.e., a partial order of system states is preserved over time) and stability (i.e., the system states converge over time) when using FNNs to learn physical dynamics. For monotonicity constraints, we propose to use nonnegative neural networks and batch normalization. For both monotonicity and stability constraints, we propose to learn the system dynamics and corresponding Lyapunov function simultaneously. As demonstrated by case studies, our methods can preserve the stability and monotonicity of FNNs and significantly reduce their prediction errors.  ( 2 min )
    AlphaMLDigger: A Novel Machine Learning Solution to Explore Excess Return on Investment. (arXiv:2206.11072v1 [q-fin.CP])
    How to quickly and automatically mine effective information and serve investment decisions has attracted more and more attention from academia and industry. And new challenges have been raised with the global pandemic. This paper proposes a two-phase AlphaMLDigger that effectively finds excessive returns in the highly fluctuated market. In phase 1, a deep sequential NLP model is proposed to transfer blogs on Sina Microblog to market sentiment. In phase 2, the predicted market sentiment is combined with social network indicator features and stock market history features to predict the stock movements with different Machine Learning models and optimizers. The results show that our AlphaMLDigger achieves higher accuracy in the test set than previous works and is robust to the negative impact of COVID-19 to some extent.  ( 2 min )
    Efficient Online Linear Control with Stochastic Convex Costs and Unknown Dynamics. (arXiv:2203.01170v2 [math.OC] UPDATED)
    We consider the problem of controlling an unknown linear dynamical system under a stochastic convex cost and full feedback of both the state and cost function. We present a computationally efficient algorithm that attains an optimal $\sqrt{T}$ regret-rate compared to the best stabilizing linear controller in hindsight. In contrast to previous work, our algorithm is based on the Optimism in the Face of Uncertainty paradigm. This results in a substantially improved computational complexity and a simpler analysis.  ( 2 min )
    Explainable Artificial Intelligence Methods in Combating Pandemics: A Systematic Review. (arXiv:2112.12705v3 [cs.AI] UPDATED)
    Despite the myriad peer-reviewed papers demonstrating novel Artificial Intelligence (AI)-based solutions to COVID-19 challenges during the pandemic, few have made significant clinical impact. The impact of artificial intelligence during the COVID-19 pandemic was greatly limited by lack of model transparency. This systematic review examines the use of Explainable Artificial Intelligence (XAI) during the pandemic and how its use could overcome barriers to real-world success. We find that successful use of XAI can improve model performance, instill trust in the end-user, and provide the value needed to affect user decision-making. We introduce the reader to common XAI techniques, their utility, and specific examples of their application. Evaluation of XAI results is also discussed as an important step to maximize the value of AI-based clinical decision support systems. We illustrate the classical, modern, and potential future trends of XAI to elucidate the evolution of novel XAI techniques. Finally, we provide a checklist of suggestions during the experimental design process supported by recent publications. Common challenges during the implementation of AI solutions are also addressed with specific examples of potential solutions. We hope this review may serve as a guide to improve the clinical impact of future AI-based solutions.  ( 3 min )
    Variational Causal Dynamics: Discovering Modular World Models from Interventions. (arXiv:2206.11131v1 [cs.LG])
    Latent world models allow agents to reason about complex environments with high-dimensional observations. However, adapting to new environments and effectively leveraging previous knowledge remain significant challenges. We present variational causal dynamics (VCD), a structured world model that exploits the invariance of causal mechanisms across environments to achieve fast and modular adaptation. By causally factorising a transition model, VCD is able to identify reusable components across different environments. This is achieved by combining causal discovery and variational inference to learn a latent representation and transition model jointly in an unsupervised manner. Specifically, we optimise the evidence lower bound jointly over a representation model and a transition model structured as a causal graphical model. In evaluations on simulated environments with state and image observations, we show that VCD is able to successfully identify causal variables, and to discover consistent causal structures across different environments. Moreover, given a small number of observations in a previously unseen, intervened environment, VCD is able to identify the sparse changes in the dynamics and to adapt efficiently. In doing so, VCD significantly extends the capabilities of the current state-of-the-art in latent world models while also comparing favourably in terms of prediction accuracy.  ( 2 min )
    Supervised Graph Contrastive Learning for Few-shot Node Classification. (arXiv:2203.15936v3 [cs.LG] UPDATED)
    Graphs are present in many real-world applications, such as financial fraud detection, commercial recommendation, and social network analysis. But given the high cost of graph annotation or labeling, we face a severe graph label-scarcity problem, i.e., a graph might have a few labeled nodes. One example of such a problem is the so-called \textit{few-shot node classification}. A predominant approach to this problem resorts to \textit{episodic meta-learning}. In this work, we challenge the status quo by asking a fundamental question whether meta-learning is a must for few-shot node classification tasks. We propose a new and simple framework under the standard few-shot node classification setting as an alternative to meta-learning to learn an effective graph encoder. The framework consists of supervised graph contrastive learning with novel mechanisms for data augmentation, subgraph encoding, and multi-scale contrast on graphs. Extensive experiments on three benchmark datasets (CoraFull, Reddit, Ogbn) show that the new framework significantly outperforms state-of-the-art meta-learning based methods.  ( 2 min )
    Beyond Low-pass Filtering: Graph Convolutional Networks with Automatic Filtering. (arXiv:2107.04755v3 [cs.LG] UPDATED)
    Graph convolutional networks are becoming indispensable for deep learning from graph-structured data. Most of the existing graph convolutional networks share two big shortcomings. First, they are essentially low-pass filters, thus the potentially useful middle and high frequency band of graph signals are ignored. Second, the bandwidth of existing graph convolutional filters is fixed. Parameters of a graph convolutional filter only transform the graph inputs without changing the curvature of a graph convolutional filter function. In reality, we are uncertain about whether we should retain or cut off the frequency at a certain point unless we have expert domain knowledge. In this paper, we propose Automatic Graph Convolutional Networks (AutoGCN) to capture the full spectrum of graph signals and automatically update the bandwidth of graph convolutional filters. While it is based on graph spectral theory, our AutoGCN is also localized in space and has a spatial form. Experimental results show that AutoGCN achieves significant improvement over baseline methods which only work as low-pass filters.  ( 2 min )
    Nonparametric Multi-shape Modeling with Uncertainty Quantification. (arXiv:2206.09127v2 [stat.ML] UPDATED)
    The modeling and uncertainty quantification of closed curves is an important problem in the field of shape analysis, and can have significant ramifications for subsequent statistical tasks. Many of these tasks involve collections of closed curves, which often exhibit structural similarities at multiple levels. Modeling multiple closed curves in a way that efficiently incorporates such between-curve dependence remains a challenging problem. In this work, we propose and investigate a multiple-output (a.k.a. multi-output), multi-dimensional Gaussian process modeling framework. We illustrate the proposed methodological advances, and demonstrate the utility of meaningful uncertainty quantification, on several curve and shape-related tasks. This model-based approach not only addresses the problem of inference on closed curves (and their shapes) with kernel constructions, but also opens doors to nonparametric modeling of multi-level dependence for functional objects in general.  ( 2 min )
    Scalable and Efficient Training of Large Convolutional Neural Networks with Differential Privacy. (arXiv:2205.10683v2 [cs.LG] UPDATED)
    Large convolutional neural networks (CNN) can be difficult to train in the differentially private (DP) regime, since the optimization algorithms require a computationally expensive operation, known as the per-sample gradient clipping. We propose an efficient and scalable implementation of this clipping on convolutional layers, termed as the mixed ghost clipping, that significantly eases the private training in terms of both time and space complexities, without affecting the accuracy. The improvement in efficiency is rigorously studied through the first complexity analysis for the mixed ghost clipping and existing DP training algorithms. Extensive experiments on vision classification tasks, with large ResNet, VGG, and Vision Transformers, demonstrate that DP training with mixed ghost clipping adds $1\sim 10\%$ memory overhead and $<2\times$ slowdown to the standard non-private training. Specifically, when training VGG19 on CIFAR10, the mixed ghost clipping is $3\times$ faster than state-of-the-art Opacus library with $18\times$ larger maximum batch size. To emphasize the significance of efficient DP training on convolutional layers, we achieve 96.7\% accuracy on CIFAR10 and 83.0\% on CIFAR100 at $\epsilon=1$ using BEiT, while the previous best results are 94.8\% and 67.4\%, respectively. We open-source a privacy engine (\url{https://github.com/JialinMao/private_CNN}) that implements DP training of CNN with a few lines of code.  ( 2 min )
    The Privacy Onion Effect: Memorization is Relative. (arXiv:2206.10469v2 [cs.LG] UPDATED)
    Machine learning models trained on private datasets have been shown to leak their private data. While recent work has found that the average data point is rarely leaked, the outlier samples are frequently subject to memorization and, consequently, privacy leakage. We demonstrate and analyse an Onion Effect of memorization: removing the "layer" of outlier points that are most vulnerable to a privacy attack exposes a new layer of previously-safe points to the same attack. We perform several experiments to study this effect, and understand why it occurs. The existence of this effect has various consequences. For example, it suggests that proposals to defend against memorization without training with rigorous privacy guarantees are unlikely to be effective. Further, it suggests that privacy-enhancing technologies such as machine unlearning could actually harm the privacy of other users.  ( 2 min )
    Multiple Testing Framework for Out-of-Distribution Detection. (arXiv:2206.09522v2 [stat.ML] UPDATED)
    We study the problem of Out-of-Distribution (OOD) detection, that is, detecting whether a learning algorithm's output can be trusted at inference time. While a number of tests for OOD detection have been proposed in prior work, a formal framework for studying this problem is lacking. We propose a definition for the notion of OOD that includes both the input distribution and the learning algorithm, which provides insights for the construction of powerful tests for OOD detection. We propose a multiple hypothesis testing inspired procedure to systematically combine any number of different statistics from the learning algorithm using conformal p-values. We further provide strong guarantees on the probability of incorrectly classifying an in-distribution sample as OOD. In our experiments, we find that threshold-based tests proposed in prior work perform well in specific settings, but not uniformly well across different types of OOD instances. In contrast, our proposed method that combines multiple statistics performs uniformly well across different datasets and neural networks.  ( 2 min )
    From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses. (arXiv:2205.07704v2 [stat.ML] UPDATED)
    We propose the Bayes-UCBVI algorithm for reinforcement learning in tabular, stage-dependent, episodic Markov decision process: a natural extension of the Bayes-UCB algorithm by Kaufmann et al. (2012) for multi-armed bandits. Our method uses the quantile of a Q-value function posterior as upper confidence bound on the optimal Q-value function. For Bayes-UCBVI, we prove a regret bound of order $\widetilde{O}(\sqrt{H^3SAT})$ where $H$ is the length of one episode, $S$ is the number of states, $A$ the number of actions, $T$ the number of episodes, that matches the lower-bound of $\Omega(\sqrt{H^3SAT})$ up to poly-$\log$ terms in $H,S,A,T$ for a large enough $T$. To the best of our knowledge, this is the first algorithm that obtains an optimal dependence on the horizon $H$ (and $S$) without the need for an involved Bernstein-like bonus or noise. Crucial to our analysis is a new fine-grained anti-concentration bound for a weighted Dirichlet sum that can be of independent interest. We then explain how Bayes-UCBVI can be easily extended beyond the tabular setting, exhibiting a strong link between our algorithm and Bayesian bootstrap (Rubin, 1981).  ( 2 min )
    Learning to Estimate and Refine Fluid Motion with Physical Dynamics. (arXiv:2206.10480v2 [cs.LG] UPDATED)
    Extracting information on fluid motion directly from images is challenging. Fluid flow represents a complex dynamic system governed by the Navier-Stokes equations. General optical flow methods are typically designed for rigid body motion, and thus struggle if applied to fluid motion estimation directly. Further, optical flow methods only focus on two consecutive frames without utilising historical temporal information, while the fluid motion (velocity field) can be considered a continuous trajectory constrained by time-dependent partial differential equations (PDEs). This discrepancy has the potential to induce physically inconsistent estimations. Here we propose an unsupervised learning based prediction-correction scheme for fluid flow estimation. An estimate is first given by a PDE-constrained optical flow predictor, which is then refined by a physical based corrector. The proposed approach outperforms optical flow methods and shows competitive results compared to existing supervised learning based methods on a benchmark dataset. Furthermore, the proposed approach can generalize to complex real-world fluid scenarios where ground truth information is effectively unknowable. Finally, experiments demonstrate that the physical corrector can refine flow estimates by mimicking the operator splitting method commonly utilised in fluid dynamical simulation.  ( 2 min )
    Business Document Information Extraction: Towards Practical Benchmarks. (arXiv:2206.11229v1 [cs.IR])
    Information extraction from semi-structured documents is crucial for frictionless business-to-business (B2B) communication. While machine learning problems related to Document Information Extraction (IE) have been studied for decades, many common problem definitions and benchmarks do not reflect domain-specific aspects and practical needs for automating B2B document communication. We review the landscape of Document IE problems, datasets and benchmarks. We highlight the practical aspects missing in the common definitions and define the Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR) problems. There is a lack of relevant datasets and benchmarks for Document IE on semi-structured business documents as their content is typically legally protected or sensitive. We discuss potential sources of available documents including synthetic data.  ( 2 min )
    FedorAS: Federated Architecture Search under system heterogeneity. (arXiv:2206.11239v1 [cs.LG])
    Federated learning (FL) has recently gained considerable attention due to its ability to use decentralised data while preserving privacy. However, it also poses additional challenges related to the heterogeneity of the participating devices, both in terms of their computational capabilities and contributed data. Meanwhile, Neural Architecture Search (NAS) has been successfully used with centralised datasets, producing state-of-the-art results in constrained (hardware-aware) and unconstrained settings. However, even the most recent work laying at the intersection of NAS and FL assumes homogeneous compute environment with datacenter-grade hardware and does not address the issues of working with constrained, heterogeneous devices. As a result, practical usage of NAS in a federated setting remains an open problem that we address in our work. We design our system, FedorAS, to discover and train promising architectures when dealing with devices of varying capabilities holding non-IID distributed data, and present empirical evidence of its effectiveness across different settings. Specifically, we evaluate FedorAS across datasets spanning three different modalities (vision, speech, text) and show its better performance compared to state-of-the-art federated solutions, while maintaining resource efficiency.  ( 2 min )
    Inference of Multiscale Gaussian Graphical Model. (arXiv:2202.05775v2 [stat.ML] UPDATED)
    Gaussian Graphical Models (GGMs) are widely used for exploratory data analysis in various fields such as genomics, ecology, psychometry. In a high-dimensional setting, when the number of variables exceeds the number of observations by several orders of magnitude, the estimation of GGM is a difficult and unstable optimization problem. Clustering of variables or variable selection is often performed prior to GGM estimation. We propose a new method allowing to simultaneously infer a hierarchical clustering structure and the graphs describing the structure of independence at each level of the hierarchy. This method is based on solving a convex optimization problem combining a graphical lasso penalty with a fused type lasso penalty. Results on real and synthetic data are presented.  ( 2 min )
    Private and polynomial time algorithms for learning Gaussians and beyond. (arXiv:2111.11320v3 [stat.ML] UPDATED)
    We present a fairly general framework for reducing $(\varepsilon, \delta)$ differentially private (DP) statistical estimation to its non-private counterpart. As the main application of this framework, we give a polynomial time and $(\varepsilon,\delta)$-DP algorithm for learning (unrestricted) Gaussian distributions in $\mathbb{R}^d$. The sample complexity of our approach for learning the Gaussian up to total variation distance $\alpha$ is $\widetilde{O}(d^2/\alpha^2 + d^2\sqrt{\ln(1/\delta)}/\alpha \varepsilon + d\ln(1/\delta) / \alpha \varepsilon)$ matching (up to logarithmic factors) the best known information-theoretic (non-efficient) sample complexity upper bound due to Aden-Ali, Ashtiani, and Kamath (ALT'21). In an independent work, Kamath, Mouzakis, Singhal, Steinke, and Ullman (arXiv:2111.04609) proved a similar result using a different approach and with $O(d^{5/2})$ sample complexity dependence on $d$. As another application of our framework, we provide the first polynomial time $(\varepsilon, \delta)$-DP algorithm for robust learning of (unrestricted) Gaussians with sample complexity $\widetilde{O}(d^{3.5})$. In another independent work, Kothari, Manurangsi, and Velingker (arXiv:2112.03548) also provided a polynomial time $(\varepsilon, \delta)$-DP algorithm for robust learning of Gaussians with sample complexity $\widetilde{O}(d^8)$.  ( 2 min )
    Adversarially trained neural representations may already be as robust as corresponding biological neural representations. (arXiv:2206.11228v1 [q-bio.NC])
    Visual systems of primates are the gold standard of robust perception. There is thus a general belief that mimicking the neural representations that underlie those systems will yield artificial visual systems that are adversarially robust. In this work, we develop a method for performing adversarial visual attacks directly on primate brain activity. We then leverage this method to demonstrate that the above-mentioned belief might not be well founded. Specifically, we report that the biological neurons that make up visual systems of primates exhibit susceptibility to adversarial perturbations that is comparable in magnitude to existing (robustly trained) artificial neural networks.  ( 2 min )
    X-Risk Analysis for AI Research. (arXiv:2206.05862v3 [cs.CY] CROSS LISTED)
    Artificial intelligence (AI) has the potential to greatly improve society, but as with any powerful technology, it comes with heightened risks and responsibilities. Current AI research lacks a systematic discussion of how to manage long-tail risks from AI systems, including speculative long-term risks. Keeping in mind the potential benefits of AI, there is some concern that building ever more intelligent and powerful AI systems could eventually result in systems that are more powerful than us; some say this is like playing with fire and speculate that this could create existential risks (x-risks). To add precision and ground these discussions, we provide a guide for how to analyze AI x-risk, which consists of three parts: First, we review how systems can be made safer today, drawing on time-tested concepts from hazard analysis and systems safety that have been designed to steer large processes in safer directions. Next, we discuss strategies for having long-term impacts on the safety of future systems. Finally, we discuss a crucial concept in making AI systems safer by improving the balance between safety and general capabilities. We hope this document and the presented concepts and tools serve as a useful guide for understanding how to analyze AI x-risk.
    Introduction to Machine Learning for the Sciences. (arXiv:2102.04883v2 [physics.comp-ph] UPDATED)
    This is an introductory machine-learning course specifically developed with STEM students in mind. Our goal is to provide the interested reader with the basics to employ machine learning in their own projects and to familiarize themself with the terminology as a foundation for further reading of the relevant literature. In these lecture notes, we discuss supervised, unsupervised, and reinforcement learning. The notes start with an exposition of machine learning methods without neural networks, such as principle component analysis, t-SNE, clustering, as well as linear regression and linear classifiers. We continue with an introduction to both basic and advanced neural-network structures such as dense feed-forward and conventional neural networks, recurrent neural networks, restricted Boltzmann machines, (variational) autoencoders, generative adversarial networks. Questions of interpretability are discussed for latent-space representations and using the examples of dreaming and adversarial attacks. The final section is dedicated to reinforcement learning, where we introduce basic notions of value functions and policy learning.
    Encoding large information structures in linear algebra and statistical models. (arXiv:2201.08233v3 [cs.LG] UPDATED)
    Large information sizes in samples and features can be encoded to speed up the learning of statistical models based on linear algebra and remove unwanted signals. Encoding information can reduce both sample and feature dimension to a smaller representational set. Here two examples are shown on linear mixed models and mixture models speeding up the run time for parameter estimation by a factor defined by the user's choice on dimension reduction (can be linear, quadratic or beyond based on dimension specification).
    Scaling and Scalability: Provable Nonconvex Low-Rank Tensor Estimation from Incomplete Measurements. (arXiv:2104.14526v3 [cs.LG] UPDATED)
    Tensors, which provide a powerful and flexible model for representing multi-attribute data and multi-way interactions, play an indispensable role in modern data science across various fields in science and engineering. A fundamental task is to faithfully recover the tensor from highly incomplete measurements in a statistically and computationally efficient manner. Harnessing the low-rank structure of tensors in the Tucker decomposition, this paper develops a scaled gradient descent (ScaledGD) algorithm to directly recover the tensor factors with tailored spectral initializations, and shows that it provably converges at a linear rate independent of the condition number of the ground truth tensor for two canonical problems -- tensor completion and tensor regression -- as soon as the sample size is above the order of $n^{3/2}$ ignoring other parameter dependencies, where $n$ is the dimension of the tensor. This leads to an extremely scalable approach to low-rank tensor estimation compared with prior art, which suffers from at least one of the following drawbacks: extreme sensitivity to ill-conditioning, high per-iteration costs in terms of memory and computation, or poor sample complexity guarantees. To the best of our knowledge, ScaledGD is the first algorithm that achieves near-optimal statistical and computational complexities simultaneously for low-rank tensor completion with the Tucker decomposition. Our algorithm highlights the power of appropriate preconditioning in accelerating nonconvex statistical estimation, where the iteration-varying preconditioners promote desirable invariance properties of the trajectory with respect to the underlying symmetry in low-rank tensor factorization.
    Model-free Representation Learning and Exploration in Low-rank MDPs. (arXiv:2102.07035v2 [cs.LG] UPDATED)
    The low rank MDP has emerged as an important model for studying representation learning and exploration in reinforcement learning. With a known representation, several model-free exploration strategies exist. In contrast, all algorithms for the unknown representation setting are model-based, thereby requiring the ability to model the full dynamics. In this work, we present the first model-free representation learning algorithms for low rank MDPs. The key algorithmic contribution is a new minimax representation learning objective, for which we provide variants with differing tradeoffs in their statistical and computational properties. We interleave this representation learning step with an exploration strategy to cover the state space in a reward-free manner. The resulting algorithms are provably sample efficient and can accommodate general function approximation to scale to complex environments.
    Convergence Rates for Learning Linear Operators from Noisy Data. (arXiv:2108.12515v2 [math.ST] UPDATED)
    This paper studies the learning of linear operators between infinite-dimensional Hilbert spaces. The training data comprises pairs of random input vectors in a Hilbert space and their noisy images under an unknown self-adjoint linear operator. Assuming that the operator is diagonalizable in a known basis, this work solves the equivalent inverse problem of estimating the operator's eigenvalues given the data. Adopting a Bayesian approach, the theoretical analysis establishes posterior contraction rates in the infinite data limit with Gaussian priors that are not directly linked to the forward map of the inverse problem. The main results also include learning-theoretic generalization error guarantees for a wide range of distribution shifts. These convergence rates quantify the effects of data smoothness and true eigenvalue decay or growth, for compact or unbounded operators, respectively, on sample complexity. Numerical evidence supports the theory in diagonal and non-diagonal settings.
    MMD Aggregated Two-Sample Test. (arXiv:2110.15073v2 [stat.ML] UPDATED)
    We propose a novel nonparametric two-sample test based on the Maximum Mean Discrepancy (MMD), which is constructed by aggregating tests with different kernel bandwidths. This aggregation procedure, called MMDAgg, ensures that test power is maximised over the collection of kernels used, without requiring held-out data for kernel selection (which results in a loss of test power), or arbitrary kernel choices such as the median heuristic. We work in the non-asymptotic framework, and prove that our aggregated test is minimax adaptive over Sobolev balls. Our guarantees are not restricted to a specific kernel, but hold for any product of one-dimensional translation invariant characteristic kernels which are absolutely and square integrable. Moreover, our results apply for popular numerical procedures to determine the test threshold, namely permutations and the wild bootstrap. Through numerical experiments on both synthetic and real-world datasets, we demonstrate that MMDAgg outperforms alternative state-of-the-art approaches to MMD kernel adaptation for two-sample testing.
    Coin Flipping Neural Networks. (arXiv:2206.09182v2 [cs.LG] UPDATED)
    We show that neural networks with access to randomness can outperform deterministic networks by using amplification. We call such networks Coin-Flipping Neural Networks, or CFNNs. We show that a CFNN can approximate the indicator of a $d$-dimensional ball to arbitrary accuracy with only 2 layers and $\mathcal{O}(1)$ neurons, where a 2-layer deterministic network was shown to require $\Omega(e^d)$ neurons, an exponential improvement (arXiv:1610.09887). We prove a highly non-trivial result, that for almost any classification problem, there exists a trivially simple network that solves it given a sufficiently powerful generator for the network's weights. Combining these results we conjecture that for most classification problems, there is a CFNN which solves them with higher accuracy or fewer neurons than any deterministic network. Finally, we verify our proofs experimentally using novel CFNN architectures on CIFAR10 and CIFAR100, reaching an improvement of 9.25\% from the baseline.
    Automatic Autism Spectrum Disorder Detection Using Artificial Intelligence Methods with MRI Neuroimaging: A Review. (arXiv:2206.11233v1 [q-bio.NC])
    Autism spectrum disorder (ASD) is a brain condition characterized by diverse signs and symptoms that appear in early childhood. ASD is also associated with communication deficits and repetitive behavior in affected individuals. Various ASD detection methods have been developed, including neuroimaging modalities and psychological tests. Among these methods, magnetic resonance imaging (MRI) imaging modalities are of paramount importance to physicians. Clinicians rely on MRI modalities to diagnose ASD accurately. The MRI modalities are non-invasive methods that include functional (fMRI) and structural (sMRI) neuroimaging methods. However, the process of diagnosing ASD with fMRI and sMRI for specialists is often laborious and time-consuming; therefore, several computer-aided design systems (CADS) based on artificial intelligence (AI) have been developed to assist the specialist physicians. Conventional machine learning (ML) and deep learning (DL) are the most popular schemes of AI used for diagnosing ASD. This study aims to review the automated detection of ASD using AI. We review several CADS that have been developed using ML techniques for the automated diagnosis of ASD using MRI modalities. There has been very limited work on the use of DL techniques to develop automated diagnostic models for ASD. A summary of the studies developed using DL is provided in the appendix. Then, the challenges encountered during the automated diagnosis of ASD using MRI and AI techniques are described in detail. Additionally, a graphical comparison of studies using ML and DL to diagnose ASD automatically is discussed. We conclude by suggesting future approaches to detecting ASDs using AI techniques and MRI neuroimaging.
    Dual-Stream Transformer with Cross-Attention on Whole-Slide Image Pyramids for Cancer Prognosis. (arXiv:2206.05782v2 [eess.IV] UPDATED)
    The cancer prognosis on gigapixel Whole-Slide Images (WSIs) has always been a challenging task. Most existing approaches focus solely on single-resolution images. The multi-resolution schemes, utilizing image pyramids to enhance WSI visual representations, have not yet been paid enough attention to. In order to explore a multi-resolution solution for improving cancer prognosis accuracy, this paper proposes a dual-stream architecture to model WSIs by an image pyramid strategy. This architecture consists of two sub-streams: one for low-resolution WSIs, and the other especially for high-resolution ones. Compared to other approaches, our scheme has three highlights: (i) there exists a one-to-one relation between stream and resolution; (ii) a square pooling layer is added to align the patches from two resolution streams, largely reducing computation cost and enabling a natural stream feature fusion; (iii) a cross-attention-based method is proposed to pool high-resolution patches spatially under the guidance of low-resolution ones. We validate our scheme on three publicly-available datasets with a total number of 3,101 WSIs from 1,911 patients. Experimental results verify that (i) hierarchical dual-stream representation is more effective than single-stream ones for cancer prognosis, gaining an average C-Index rise of 5.0% and 1.8% on a single low-resolution and high-resolution stream, respectively; (ii) our dual-stream scheme could outperform current state-of-the-art ones, by an average C-Index improvement of 5.1%; (iii) the cancer diseases with observable survival differences could have different preferences for model complexity. Our scheme could serve as an alternative tool for further facilitating WSI prognosis research.
    Adaptive Adversarial Training to Improve Adversarial Robustness of DNNs for Medical Image Segmentation and Detection. (arXiv:2206.01736v2 [eess.IV] UPDATED)
    It is known that Deep Neural networks (DNNs) are vulnerable to adversarial attacks, and the adversarial robustness of DNNs could be improved by adding adversarial noises to training data (e.g., the standard adversarial training (SAT)). However, inappropriate noises added to training data may reduce a model's performance, which is termed the trade-off between accuracy and robustness. This problem has been sufficiently studied for the classification of whole images but has rarely been explored for image analysis tasks in the medical application domain, including image segmentation, landmark detection, and object detection tasks. In this study, we show that, for those medical image analysis tasks, the SAT method has a severe issue that limits its practical use: it generates a fixed and unified level of noise for all training samples for robust DNN training. A high noise level may lead to a large reduction in model performance and a low noise level may not be effective in improving robustness. To resolve this issue, we design an adaptive-margin adversarial training (AMAT) method that generates sample-wise adaptive adversarial noises for robust DNN training. In contrast to the existing, classification-oriented adversarial training methods, our AMAT method uses a loss-defined-margin strategy so that it can be applied to different tasks as long as the loss functions are well-defined. We successfully apply our AMAT method to state-of-the-art DNNs, using five publicly available datasets. The experimental results demonstrate that: (1) our AMAT method can be applied to the three seemingly different tasks in the medical image application domain; (2) AMAT outperforms the SAT method in adversarial robustness; (3) AMAT has a minimal reduction in prediction accuracy on clean data, compared with the SAT method; and (4) AMAT has almost the same training time cost as SAT.
    Traffic-Twitter Transformer: A Nature Language Processing-joined Framework For Network-wide Traffic Forecasting. (arXiv:2206.11078v1 [cs.LG])
    With accurate and timely traffic forecasting, the impacted traffic conditions can be predicted in advance to guide agencies and residents to respond to changes in traffic patterns appropriately. However, existing works on traffic forecasting mainly relied on historical traffic patterns confining to short-term prediction, under 1 hour, for instance. To better manage future roadway capacity and accommodate social and human impacts, it is crucial to propose a flexible and comprehensive framework to predict physical-aware long-term traffic conditions for public users and transportation agencies. In this paper, the gap of robust long-term traffic forecasting was bridged by taking social media features into consideration. A correlation study and a linear regression model were first implemented to evaluate the significance of the correlation between two time-series data, traffic intensity and Twitter data intensity. Two time-series data were then fed into our proposed social-aware framework, Traffic-Twitter Transformer, which integrated Nature Language representations into time-series records for long-term traffic prediction. Experimental results in the Great Seattle Area showed that our proposed model outperformed baseline models in all evaluation matrices. This NLP-joined social-aware framework can become a valuable implement of network-wide traffic prediction and management for traffic agencies.
    Neural Moving Horizon Estimation for Robust Flight Control. (arXiv:2206.10397v2 [cs.RO] UPDATED)
    Estimating and reacting to external disturbances is crucial for robust flight control of quadrotors. Existing estimators typically require significant tuning for a specific flight scenario or training with extensive real-world data to achieve satisfactory performance. In this paper, we propose a neural moving horizon estimator (NeuroMHE) that can automatically tune the MHE parameters modeled by a neural network and adapt to different flight scenarios. We achieve this by deriving the analytical gradient of the MHE estimates with respect to the tunable parameters, enabling a seamless embedding of MHE as a layer into the neural network for highly effective learning. Most interestingly, we show that the gradient can be solved efficiently from a Kalman filter in a recursive form. Moreover, we develop a model-based policy gradient algorithm to train NeuroMHE directly from the trajectory tracking error without the need for the ground-truth disturbance. The effectiveness of NeuroMHE is verified extensively via both simulations and physical experiments on a quadrotor in various challenging flights. Notably, NeuroMHE outperforms the state-of-the-art estimator with force estimation error reductions of up to 49.4% by using only a 2.5% amount of parameters. The proposed method is general and can be applied to robust adaptive control for other robotic systems.
    Supervised Learning for Coverage-Directed Test Selection in Simulation-Based Verification. (arXiv:2205.08524v2 [cs.AR] UPDATED)
    Constrained random test generation is one of the most widely adopted methods for generating stimuli for simulation-based verification. Randomness leads to test diversity, but tests tend to repeatedly exercise the same design logic. Constraints are written (typically manually) to bias random tests towards interesting, hard-to-reach, and yet-untested logic. However, as verification progresses, most constrained random tests yield little to no effect on functional coverage. If stimuli generation consumes significantly less resources than simulation, then a better approach involves randomly generating a large number of tests, selecting the most effective subset, and only simulating that subset. In this paper, we introduce a novel method for automatic constraint extraction and test selection. This method, which we call coverage-directed test selection, is based on supervised learning from coverage feedback. Our method biases selection towards tests that have a high probability of increasing functional coverage, and prioritises them for simulation. We show how coverage-directed test selection can reduce manual constraint writing, prioritise effective tests, reduce verification resource consumption, and accelerate coverage closure on a large, real-life industrial hardware design.
    Near-optimal control of dynamical systems with neural ordinary differential equations. (arXiv:2206.11120v1 [cs.LG])
    Optimal control problems naturally arise in many scientific applications where one wishes to steer a dynamical system from a certain initial state $\mathbf{x}_0$ to a desired target state $\mathbf{x}^*$ in finite time $T$. Recent advances in deep learning and neural network-based optimization have contributed to the development of methods that can help solve control problems involving high-dimensional dynamical systems. In particular, the framework of neural ordinary differential equations (neural ODEs) provides an efficient means to iteratively approximate continuous time control functions associated with analytically intractable and computationally demanding control tasks. Although neural ODE controllers have shown great potential in solving complex control problems, the understanding of the effects of hyperparameters such as network structure and optimizers on learning performance is still very limited. Our work aims at addressing some of these knowledge gaps to conduct efficient hyperparameter optimization. To this end, we first analyze how truncated and non-truncated backpropagation through time affect runtime performance and the ability of neural networks to learn optimal control functions. Using analytical and numerical methods, we then study the role of parameter initializations, optimizers, and neural-network architecture. Finally, we connect our results to the ability of neural ODE controllers to implicitly regularize control energy.  ( 2 min )
    Behavior Transformers: Cloning $k$ modes with one stone. (arXiv:2206.11251v1 [cs.LG])
    While behavior learning has made impressive progress in recent times, it lags behind computer vision and natural language processing due to its inability to leverage large, human-generated datasets. Human behaviors have wide variance, multiple modes, and human demonstrations typically do not come with reward labels. These properties limit the applicability of current methods in Offline RL and Behavioral Cloning to learn from large, pre-collected datasets. In this work, we present Behavior Transformer (BeT), a new technique to model unlabeled demonstration data with multiple modes. BeT retrofits standard transformer architectures with action discretization coupled with a multi-task action correction inspired by offset prediction in object detection. This allows us to leverage the multi-modal modeling ability of modern transformers to predict multi-modal continuous actions. We experimentally evaluate BeT on a variety of robotic manipulation and self-driving behavior datasets. We show that BeT significantly improves over prior state-of-the-art work on solving demonstrated tasks while capturing the major modes present in the pre-collected datasets. Finally, through an extensive ablation study, we analyze the importance of every crucial component in BeT. Videos of behavior generated by BeT are available at https://notmahi.github.io/bet  ( 2 min )
    Discussion of `Multiscale Fisher's Independence Test for Multivariate Dependence'. (arXiv:2206.11142v1 [stat.ME])
    We discuss how MultiFIT, the Multiscale Fisher's Independence Test for Multivariate Dependence proposed by Gorsky and Ma (2022), compares to existing linear-time kernel tests based on the Hilbert-Schmidt independence criterion (HSIC). We highlight the fact that the levels of the kernel tests at any finite sample size can be controlled exactly, as it is the case with the level of MultiFIT. In our experiments, we observe some of the performance limitations of MultiFIT in terms of test power.  ( 2 min )
    MRI Reconstruction via Data Driven Markov Chain with Joint Uncertainty Estimation. (arXiv:2202.01479v2 [cs.LG] UPDATED)
    We introduce a framework that enables efficient sampling from learned probability distributions for MRI reconstruction. Different from conventional deep learning-based MRI reconstruction techniques, samples are drawn from the posterior distribution given the measured k-space using the Markov chain Monte Carlo (MCMC) method. In addition to the maximum a posteriori (MAP) estimate for the image, which can be obtained with conventional methods, the minimum mean square error (MMSE) estimate and uncertainty maps can also be computed. The data-driven Markov chains are constructed from the generative model learned from a given image database and are independent of the forward operator that is used to model the k-space measurement. This provides flexibility because the method can be applied to k-space acquired with different sampling schemes or receive coils using the same pre-trained models. Furthermore, we use a framework based on a reverse diffusion process to be able to utilize advanced generative models. The performance of the method is evaluated on an open dataset using 10-fold undersampling in k-space.  ( 2 min )
    $k$-Anonymity in Practice: How Generalisation and Suppression Affect Machine Learning Classifiers. (arXiv:2102.04763v2 [cs.LG] UPDATED)
    The protection of private information is a crucial issue in data-driven research and business contexts. Typically, techniques like anonymisation or (selective) deletion are introduced in order to allow data sharing, e. g. in the case of collaborative research endeavours. For use with anonymisation techniques, the $k$-anonymity criterion is one of the most popular, with numerous scientific publications on different algorithms and metrics. Anonymisation techniques often require changing the data and thus necessarily affect the results of machine learning models trained on the underlying data. In this work, we conduct a systematic comparison and detailed investigation into the effects of different $k$-anonymisation algorithms on the results of machine learning models. We investigate a set of popular $k$-anonymisation algorithms with different classifiers and evaluate them on different real-world datasets. Our systematic evaluation shows that with an increasingly strong $k$-anonymity constraint, the classification performance generally degrades, but to varying degrees and strongly depending on the dataset and anonymisation method. Furthermore, Mondrian can be considered as the method with the most appealing properties for subsequent classification.  ( 2 min )
    Kernel Clustering with Sigmoid-based Regularization for Efficient Segmentation of Sequential Data. (arXiv:2106.11541v2 [cs.LG] UPDATED)
    Kernel segmentation aims at partitioning a data sequence into several non-overlapping segments that may have nonlinear and complex structures. In general, it is formulated as a discrete optimization problem with combinatorial constraints. A popular algorithm for optimally solving this problem is dynamic programming (DP), which has quadratic computation and memory requirements. Given that sequences in practice are too long, this algorithm is not a practical approach. Although many heuristic algorithms have been proposed to approximate the optimal segmentation, they have no guarantee on the quality of their solutions. In this paper, we take a differentiable approach to alleviate the aforementioned issues. First, we introduce a novel sigmoid-based regularization to smoothly approximate the combinatorial constraints. Combining it with objective of the balanced kernel clustering, we formulate a differentiable model termed Kernel clustering with sigmoid-based regularization (KCSR), where the gradient-based algorithm can be exploited to obtain the optimal segmentation. Second, we develop a stochastic variant of the proposed model. By using the stochastic gradient descent algorithm, which has much lower time and space complexities, for optimization, the second model can perform segmentation on overlong data sequences. Finally, for simultaneously segmenting multiple data sequences, we slightly modify the sigmoid-based regularization to further introduce an extended variant of the proposed model. Through extensive experiments on various types of data sequences performances of our models are evaluated and compared with those of the existing methods. The experimental results validate advantages of the proposed models. Our Matlab source code is available on github.  ( 3 min )
    Decoupled Dynamic Spatial-Temporal Graph Neural Network for Traffic Forecasting. (arXiv:2206.09112v2 [cs.LG] UPDATED)
    We all depend on mobility, and vehicular transportation affects the daily lives of most of us. Thus, the ability to forecast the state of traffic in a road network is an important functionality and a challenging task. Traffic data is often obtained from sensors deployed in a road network. Recent proposals on spatial-temporal graph neural networks have achieved great progress at modeling complex spatial-temporal correlations in traffic data, by modeling traffic data as a diffusion process. However, intuitively, traffic data encompasses two different kinds of hidden time series signals, namely the diffusion signals and inherent signals. Unfortunately, nearly all previous works coarsely consider traffic signals entirely as the outcome of the diffusion, while neglecting the inherent signals, which impacts model performance negatively. To improve modeling performance, we propose a novel Decoupled Spatial-Temporal Framework (DSTF) that separates the diffusion and inherent traffic information in a data-driven manner, which encompasses a unique estimation gate and a residual decomposition mechanism. The separated signals can be handled subsequently by the diffusion and inherent modules separately. Further, we propose an instantiation of DSTF, Decoupled Dynamic Spatial-Temporal Graph Neural Network (D2STGNN), that captures spatial-temporal correlations and also features a dynamic graph learning module that targets the learning of the dynamic characteristics of traffic networks. Extensive experiments with four real-world traffic datasets demonstrate that the framework is capable of advancing the state-of-the-art.  ( 3 min )
    MedFilter: Improving Extraction of Task-relevant Utterances from Doctor-Patient Conversations through Integration of Discourse Structure and Ontological Knowledge. (arXiv:2010.02246v3 [cs.CL] UPDATED)
    Information extraction from conversational data is particularly challenging because the task-centric nature of conversation allows for effective communication of implicit information by humans, but is challenging for machines. The challenges may differ between utterances depending on the role of the speaker within the conversation, especially when relevant expertise is distributed asymmetrically across roles. Further, the challenges may also increase over the conversation as more shared context is built up through information communicated implicitly earlier in the dialogue. In this paper, we propose the novel modeling approach MedFilter, which addresses these insights in order to increase performance at identifying and categorizing task-relevant utterances, and in so doing, positively impacts performance at a downstream information extraction task. We evaluate this approach on a corpus of nearly 7,000 doctor-patient conversations where MedFilter is used to identify medically relevant contributions to the discussion (achieving a 10% improvement over SOTA baselines in terms of area under the PR curve). Identifying task-relevant utterances benefits downstream medical processing, achieving improvements of 15%, 105%, and 23% respectively for the extraction of symptoms, medications, and complaints.  ( 2 min )
    Beyond RMSE: Do machine-learned models of road user interaction produce human-like behavior?. (arXiv:2206.11110v1 [cs.LG])
    Autonomous vehicles use a variety of sensors and machine-learned models to predict the behavior of surrounding road users. Most of the machine-learned models in the literature focus on quantitative error metrics like the root mean square error (RMSE) to learn and report their models' capabilities. This focus on quantitative error metrics tends to ignore the more important behavioral aspect of the models, raising the question of whether these models really predict human-like behavior. Thus, we propose to analyze the output of machine-learned models much like we would analyze human data in conventional behavioral research. We introduce quantitative metrics to demonstrate presence of three different behavioral phenomena in a naturalistic highway driving dataset: 1) The kinematics-dependence of who passes a merging point first 2) Lane change by an on-highway vehicle to accommodate an on-ramp vehicle 3) Lane changes by vehicles on the highway to avoid lead vehicle conflicts. Then, we analyze the behavior of three machine-learned models using the same metrics. Even though the models' RMSE value differed, all the models captured the kinematic-dependent merging behavior but struggled at varying degrees to capture the more nuanced courtesy lane change and highway lane change behavior. Additionally, the collision aversion analysis during lane changes showed that the models struggled to capture the physical aspect of human driving: leaving adequate gap between the vehicles. Thus, our analysis highlighted the inadequacy of simple quantitative metrics and the need to take a broader behavioral perspective when analyzing machine-learned models of human driving predictions.
    Minimizing Control for Credit Assignment with Strong Feedback. (arXiv:2204.07249v2 [cs.NE] UPDATED)
    The success of deep learning ignited interest in whether the brain learns hierarchical representations using gradient-based learning. However, current biologically plausible methods for gradient-based credit assignment in deep neural networks need infinitesimally small feedback signals, which is problematic in biologically realistic noisy environments and at odds with experimental evidence in neuroscience showing that top-down feedback can significantly influence neural activity. Building upon deep feedback control (DFC), a recently proposed credit assignment method, we combine strong feedback influences on neural activity with gradient-based learning and show that this naturally leads to a novel view on neural network optimization. Instead of gradually changing the network weights towards configurations with low output loss, weight updates gradually minimize the amount of feedback required from a controller that drives the network to the supervised output label. Moreover, we show that the use of strong feedback in DFC allows learning forward and feedback connections simultaneously, using learning rules fully local in space and time. We complement our theoretical results with experiments on standard computer-vision benchmarks, showing competitive performance to backpropagation as well as robustness to noise. Overall, our work presents a fundamentally novel view of learning as control minimization, while sidestepping biologically unrealistic assumptions.
    A Novel Three-Dimensional Navigation Method for the Visually Impaired. (arXiv:2206.11136v1 [cs.HC])
    According to the World Health Organization, visual impairment is estimated to affect approximately 2.2 billion people worldwide. The visually impaired must currently rely on navigational aids to replace their sense of sight, like a white cane or GPS (Global Positioning System) based navigation, both of which fail to work well indoors. The white cane cannot be used to determine a user's position within a room, while GPS can often lose connection indoors and does not provide orientation information, making both approaches unsuitable for indoor use. Therefore, this research seeks to develop a 3D-imaging solution that enables contactless navigation through a complex indoor environment. The device can pinpoint a user's position and orientation with 31% less error compared to previous approaches while requiring only 53.1% of the memory, and processing 125% faster. The device can also detect obstacles with 60.2% more accuracy than the previous state-of-the-art models while requiring only 41% of the memory and processing 260% faster. When testing with human participants, the device allows for a 94.5% reduction in collisions with obstacles in the environment and allows for a 48.3% increase in walking speed, showing that my device enables safer and more rapid navigation for the visually impaired. All in all, this research demonstrates a 3D-based navigation system for the visually impaired. The approach can be used by a wide variety of mobile low-power devices, like cell phones, ensuring this research remains accessible to all.  ( 2 min )
    Optimal transport meets noisy label robust loss and MixUp regularization for domain adaptation. (arXiv:2206.11180v1 [cs.CV])
    It is common in computer vision to be confronted with domain shift: images which have the same class but different acquisition conditions. In domain adaptation (DA), one wants to classify unlabeled target images using source labeled images. Unfortunately, deep neural networks trained on a source training set perform poorly on target images which do not belong to the training domain. One strategy to improve these performances is to align the source and target image distributions in an embedded space using optimal transport (OT). However OT can cause negative transfer, i.e. aligning samples with different labels, which leads to overfitting especially in the presence of label shift between domains. In this work, we mitigate negative alignment by explaining it as a noisy label assignment to target images. We then mitigate its effect by appropriate regularization. We propose to couple the MixUp regularization \citep{zhang2018mixup} with a loss that is robust to noisy labels in order to improve domain adaptation performance. We show in an extensive ablation study that a combination of the two techniques is critical to achieve improved performance. Finally, we evaluate our method, called \textsc{mixunbot}, on several benchmarks and real-world DA problems.  ( 2 min )
    VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives. (arXiv:2206.11212v1 [cs.CV])
    Many past works aim to improve visual reasoning in models by supervising feature importance (estimated by model explanation techniques) with human annotations such as highlights of important image regions. However, recent work has shown that performance gains from feature importance (FI) supervision for Visual Question Answering (VQA) tasks persist even with random supervision, suggesting that these methods do not meaningfully align model FI with human FI. In this paper, we show that model FI supervision can meaningfully improve VQA model accuracy as well as performance on several Right-for-the-Right-Reason (RRR) metrics by optimizing for four key model objectives: (1) accurate predictions given limited but sufficient information (Sufficiency); (2) max-entropy predictions given no important information (Uncertainty); (3) invariance of predictions to changes in unimportant features (Invariance); and (4) alignment between model FI explanations and human FI explanations (Plausibility). Our best performing method, Visual Feature Importance Supervision (VisFIS), outperforms strong baselines on benchmark VQA datasets in terms of both in-distribution and out-of-distribution accuracy. While past work suggests that the mechanism for improved accuracy is through improved explanation plausibility, we show that this relationship depends crucially on explanation faithfulness (whether explanations truly represent the model's internal reasoning). Predictions are more accurate when explanations are plausible and faithful, and not when they are plausible but not faithful. Lastly, we show that, surprisingly, RRR metrics are not predictive of out-of-distribution model accuracy when controlling for a model's in-distribution accuracy, which calls into question the value of these metrics for evaluating model reasoning. All supporting code is available at https://github.com/zfying/visfis  ( 3 min )
    On the Role of Spatial, Spectral, and Temporal Processing for DNN-based Non-linear Multi-channel Speech Enhancement. (arXiv:2206.11181v1 [eess.AS])
    Employing deep neural networks (DNNs) to directly learn filters for multi-channel speech enhancement has potentially two key advantages over a traditional approach combining a linear spatial filter with an independent tempo-spectral post-filter: 1) non-linear spatial filtering allows to overcome potential restrictions originating from a linear processing model and 2) joint processing of spatial and tempo-spectral information allows to exploit interdependencies between different sources of information. A variety of DNN-based non-linear filters have been proposed recently, for which good enhancement performance is reported. However, little is known about the internal mechanisms which turns network architecture design into a game of chance. Therefore, in this paper, we perform experiments to better understand the internal processing of spatial, spectral and temporal information by DNN-based non-linear filters. On the one hand, our experiments in a difficult speech extraction scenario confirm the importance of non-linear spatial filtering, which outperforms an oracle linear spatial filter by 0.24 POLQA score. On the other hand, we demonstrate that joint processing results in a large performance gap of 0.4 POLQA score between network architectures exploiting spectral versus temporal information besides spatial information.  ( 2 min )
    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. (arXiv:2203.05482v2 [cs.LG] UPDATED)
    The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs -- we call the results "model soups." When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. The resulting ViT-G model, which attains 90.94% top-1 accuracy on ImageNet, achieved a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically. Code is available at https://github.com/mlfoundations/model-soups.  ( 3 min )
    Then and Now: Quantifying the Longitudinal Validity of Self-Disclosed Depression Diagnoses. (arXiv:2206.11155v1 [cs.LG])
    Self-disclosed mental health diagnoses, which serve as ground truth annotations of mental health status in the absence of clinical measures, underpin the conclusions behind most computational studies of mental health language from the last decade. However, psychiatric conditions are dynamic; a prior depression diagnosis may no longer be indicative of an individual's mental health, either due to treatment or other mitigating factors. We ask: to what extent are self-disclosures of mental health diagnoses actually relevant over time? We analyze recent activity from individuals who disclosed a depression diagnosis on social media over five years ago and, in turn, acquire a new understanding of how presentations of mental health status on social media manifest longitudinally. We also provide expanded evidence for the presence of personality-related biases in datasets curated using self-disclosed diagnoses. Our findings motivate three practical recommendations for improving mental health datasets curated using self-disclosed diagnoses: 1) Annotate diagnosis dates and psychiatric comorbidities; 2) Sample control groups using propensity score matching; 3) Identify and remove spurious correlations introduced by selection bias.  ( 2 min )
    Contextual Semantic Embeddings for Ontology Subsumption Prediction. (arXiv:2112.10006v4 [cs.LG] UPDATED)
    Automating ontology construction and curation is an important but challenging task in knowledge engineering and artificial intelligence. Prediction by machine learning techniques such as contextual semantic embedding is a promising direction, but the relevant research is still preliminary especially for expressive ontologies in Web Ontology Language (OWL). In this paper, we present a new subsumption prediction method named BERTSubs for classes of OWL ontology. It exploits the pre-trained language model BERT to compute contextual embeddings of a class, where customized templates are proposed to incorporate the class context (e.g., neighbouring classes) and the logical existential restriction. BERTSubs is quite general, being able to predict multiple kinds of subsumers including named classes and existential restrictions from the same ontology or another ontology. Extensive evaluation on five real-world ontologies for three different subsumption tasks has shown the effectiveness of the templates and that BERTSubs can dramatically outperform the baselines that use (literal-aware) knowledge graph embeddings, non-contextual word embeddings and the state-of-the-art OWL ontology embeddings.
    Transformer Neural Networks Attending to Both Sequence and Structure for Protein Prediction Tasks. (arXiv:2206.11057v1 [cs.LG])
    The increasing number of protein sequences decoded from genomes is opening up new avenues of research on linking protein sequence to function with transformer neural networks. Recent research has shown that the number of known protein sequences supports learning useful, task-agnostic sequence representations via transformers. In this paper, we posit that learning joint sequence-structure representations yields better representations for function-related prediction tasks. We propose a transformer neural network that attends to both sequence and tertiary structure. We show that such joint representations are more powerful than sequence-based representations only, and they yield better performance on superfamily membership across various metrics.  ( 2 min )
    Explanation-based Counterfactual Retraining(XCR): A Calibration Method for Black-box Models. (arXiv:2206.11126v1 [cs.LG])
    With the rapid development of eXplainable Artificial Intelligence (XAI), a long line of past work has shown concerns about the Out-of-Distribution (OOD) problem in perturbation-based post-hoc XAI models and explanations are socially misaligned. We explore the limitations of post-hoc explanation methods that use approximators to mimic the behavior of black-box models. Then we propose eXplanation-based Counterfactual Retraining (XCR), which extracts feature importance fastly. XCR applies the explanations generated by the XAI model as counterfactual input to retrain the black-box model to address OOD and social misalignment problems. Evaluation of popular image datasets shows that XCR can improve model performance when only retaining 12.5% of the most crucial features without changing the black-box model structure. Furthermore, the evaluation of the benchmark of corruption datasets shows that the XCR is very helpful for improving model robustness and positively impacts the calibration of OOD problems. Even though not calibrated in the validation set like some OOD calibration methods, the corrupted data metric outperforms existing methods. Our method also beats current OOD calibration methods on the OOD calibration metric if calibration on the validation set is applied.  ( 2 min )
    KSD Aggregated Goodness-of-fit Test. (arXiv:2202.00824v3 [stat.ML] UPDATED)
    We investigate properties of goodness-of-fit tests based on the Kernel Stein Discrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg, which aggregates multiple tests with different kernels. KSDAgg avoids splitting the data to perform kernel selection (which leads to a loss in test power), and rather maximises the test power over a collection of kernels. We provide theoretical guarantees on the power of KSDAgg: we show it achieves the smallest uniform separation rate of the collection, up to a logarithmic term. KSDAgg can be computed exactly in practice as it relies either on a parametric bootstrap or on a wild bootstrap to estimate the quantiles and the level corrections. In particular, for the crucial choice of bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such as median or standard deviation) or to data splitting. We find on both synthetic and real-world data that KSDAgg outperforms other state-of-the-art adaptive KSD-based goodness-of-fit testing procedures.  ( 2 min )
    Noisy $\ell^{0}$-Sparse Subspace Clustering on Dimensionality Reduced Data. (arXiv:2206.11079v1 [stat.ML])
    Sparse subspace clustering methods with sparsity induced by $\ell^{0}$-norm, such as $\ell^{0}$-Sparse Subspace Clustering ($\ell^{0}$-SSC)~\citep{YangFJYH16-L0SSC-ijcv}, are demonstrated to be more effective than its $\ell^{1}$ counterpart such as Sparse Subspace Clustering (SSC)~\citep{ElhamifarV13}. However, the theoretical analysis of $\ell^{0}$-SSC is restricted to clean data that lie exactly in subspaces. Real data often suffer from noise and they may lie close to subspaces. In this paper, we show that an optimal solution to the optimization problem of noisy $\ell^{0}$-SSC achieves subspace detection property (SDP), a key element with which data from different subspaces are separated, under deterministic and semi-random model. Our results provide theoretical guarantee on the correctness of noisy $\ell^{0}$-SSC in terms of SDP on noisy data for the first time, which reveals the advantage of noisy $\ell^{0}$-SSC in terms of much less restrictive condition on subspace affinity. In order to improve the efficiency of noisy $\ell^{0}$-SSC, we propose Noisy-DR-$\ell^{0}$-SSC which provably recovers the subspaces on dimensionality reduced data. Noisy-DR-$\ell^{0}$-SSC first projects the data onto a lower dimensional space by random projection, then performs noisy $\ell^{0}$-SSC on the projected data for improved efficiency. Experimental results demonstrate the effectiveness of Noisy-DR-$\ell^{0}$-SSC.  ( 2 min )
    Least Squares Estimation Using Sketched Data with Heteroskedastic Errors. (arXiv:2007.07781v3 [stat.ML] UPDATED)
    Researchers may perform regressions using a sketch of data of size $m$ instead of the full sample of size $n$ for a variety of reasons. This paper considers the case when the regression errors do not have constant variance and heteroskedasticity robust standard errors would normally be needed for test statistics to provide accurate inference. We show that estimates using data sketched by random projections will behave `as if' the errors were homoskedastic. Estimation by random sampling would not have this property. The result arises because the sketched estimates in the case of random projections can be expressed as degenerate $U$-statistics, and under certain conditions, these statistics are asymptotically normal with homoskedastic variance. We verify that the conditions hold not only in the case of least squares regression when the covariates are exogenous, but also in instrumental variables estimation when the covariates are endogenous. The result implies that inference, including first-stage F tests for instrument relevance, can be simpler than the full sample case if the sketching scheme is appropriately chosen.  ( 2 min )
    Algorithms that get old : the case of generative deep neural networks. (arXiv:2202.03008v2 [stat.ML] UPDATED)
    Generative deep neural networks used in machine learning, like the Variational Auto-Encoders (VAE), and Generative Adversarial Networks (GANs) produce new objects each time when asked to do so with the constraint that the new objects remain similar to some list of examples given as input. However, this behavior is unlike that of human artists that change their style as times go by and seldom return to the initial creations. We investigate a situation where VAEs are used to sample from a probability measure described by some empirical dataset. Based on recent works on Radon-Sobolev statistical distances, we propose a numerical paradigm, to be used in conjunction with a generative algorithm, that satisfies the two following requirements: the objects created do not repeat and evolve to fill the entire target probability measure.
    Large-scale Stochastic Optimization of NDCG Surrogates for Deep Learning with Provable Convergence. (arXiv:2202.12183v3 [cs.LG] UPDATED)
    NDCG, namely Normalized Discounted Cumulative Gain, is a widely used ranking metric in information retrieval and machine learning. However, efficient and provable stochastic methods for maximizing NDCG are still lacking, especially for deep models. In this paper, we propose a principled approach to optimize NDCG and its top-$K$ variant. First, we formulate a novel compositional optimization problem for optimizing the NDCG surrogate, and a novel bilevel compositional optimization problem for optimizing the top-$K$ NDCG surrogate. Then, we develop efficient stochastic algorithms with provable convergence guarantees for the non-convex objectives. Different from existing NDCG optimization methods, the per-iteration complexity of our algorithms scales with the mini-batch size instead of the number of total items. To improve the effectiveness for deep learning, we further propose practical strategies by using initial warm-up and stop gradient operator. Experimental results on multiple datasets demonstrate that our methods outperform prior ranking approaches in terms of NDCG. To the best of our knowledge, this is the first time that stochastic algorithms are proposed to optimize NDCG with a provable convergence guarantee. Our proposed methods are implemented in the LibAUC library at https://libauc.org/.
    Ordered Subgraph Aggregation Networks. (arXiv:2206.11168v1 [cs.LG])
    Numerous subgraph-enhanced graph neural networks (GNNs) have emerged recently, provably boosting the expressive power of standard (message-passing) GNNs. However, there is a limited understanding of how these approaches relate to each other and to the Weisfeiler--Leman hierarchy. Moreover, current approaches either use all subgraphs of a given size, sample them uniformly at random, or use hand-crafted heuristics instead of learning to select subgraphs in a data-driven manner. Here, we offer a unified way to study such architectures by introducing a theoretical framework and extending the known expressivity results of subgraph-enhanced GNNs. Concretely, we show that increasing subgraph size always increases the expressive power and develop a better understanding of their limitations by relating them to the established $k\text{-}\mathsf{WL}$ hierarchy. In addition, we explore different approaches for learning to sample subgraphs using recent methods for backpropagating through complex discrete probability distributions. Empirically, we study the predictive performance of different subgraph-enhanced GNNs, showing that our data-driven architectures increase prediction accuracy on standard benchmark datasets compared to non-data-driven subgraph-enhanced graph neural networks while reducing computation time.  ( 2 min )
    SMT-DTA: Improving Drug-Target Affinity Prediction with Semi-supervised Multi-task Training. (arXiv:2206.09818v2 [q-bio.BM] UPDATED)
    Drug-Target Affinity (DTA) prediction is an essential task for drug discovery and pharmaceutical research. Accurate predictions of DTA can greatly benefit the design of new drug. As wet experiments are costly and time consuming, the supervised data for DTA prediction is extremely limited. This seriously hinders the application of deep learning based methods, which require a large scale of supervised data. To address this challenge and improve the DTA prediction accuracy, we propose a framework with several simple yet effective strategies in this work: (1) a multi-task training strategy, which takes the DTA prediction and the masked language modeling (MLM) task on the paired drug-target dataset; (2) a semi-supervised training method to empower the drug and target representation learning by leveraging large-scale unpaired molecules and proteins in training, which differs from previous pre-training and fine-tuning methods that only utilize molecules or proteins in pre-training; and (3) a cross-attention module to enhance the interaction between drug and target representation. Extensive experiments are conducted on three real-world benchmark datasets: BindingDB, DAVIS and KIBA. The results show that our framework significantly outperforms existing methods and achieves state-of-the-art performances, e.g., $0.712$ RMSE on BindingDB IC$_{50}$ measurement with more than $5\%$ improvement than previous best work. In addition, case studies on specific drug-target binding activities, drug feature visualizations, and real-world applications demonstrate the great potential of our work. The code and data are released at https://github.com/QizhiPei/SMT-DTA  ( 3 min )
    Fast Aquatic Swimmer Optimization with Differentiable Projective Dynamics and Neural Network Hydrodynamic Models. (arXiv:2204.12584v2 [cs.RO] UPDATED)
    Aquatic locomotion is a classic fluid-structure interaction (FSI) problem of interest to biologists and engineers. Solving the fully coupled FSI equations for incompressible Navier-Stokes and finite elasticity is computationally expensive. Optimizing robotic swimmer design within such a system generally involves cumbersome, gradient-free procedures on top of the already costly simulation. To address this challenge we present a novel, fully differentiable hybrid approach to FSI that combines a 2D direct numerical simulation for the deformable solid structure of the swimmer and a physics-constrained neural network surrogate to capture hydrodynamic effects of the fluid. For the deformable solid simulation of the swimmer's body, we use state-of-the-art techniques from the field of computer graphics to speed up the finite-element method (FEM). For the fluid simulation, we use a U-Net architecture trained with a physics-based loss function to predict the flow field at each time step. The pressure and velocity field outputs from the neural network are sampled around the boundary of our swimmer using an immersed boundary method (IBM) to compute its swimming motion accurately and efficiently. We demonstrate the computational efficiency and differentiability of our hybrid simulator on a 2D carangiform swimmer. Due to differentiability, the simulator can be used for computational design of controls for soft bodies immersed in fluids via direct gradient-based optimization.  ( 3 min )
    Federated Adaptation of Reservoirs via Intrinsic Plasticity. (arXiv:2206.11087v1 [cs.NE])
    We propose a novel algorithm for performing federated learning with Echo State Networks (ESNs) in a client-server scenario. In particular, our proposal focuses on the adaptation of reservoirs by combining Intrinsic Plasticity with Federated Averaging. The former is a gradient-based method for adapting the reservoir's non-linearity in a local and unsupervised manner, while the latter provides the framework for learning in the federated scenario. We evaluate our approach on real-world datasets from human monitoring, in comparison with the previous approach for federated ESNs existing in literature. Results show that adapting the reservoir with our algorithm provides a significant improvement on the performance of the global model.  ( 2 min )
    RetrievalGuard: Provably Robust 1-Nearest Neighbor Image Retrieval. (arXiv:2206.11225v1 [cs.IR])
    Recent research works have shown that image retrieval models are vulnerable to adversarial attacks, where slightly modified test inputs could lead to problematic retrieval results. In this paper, we aim to design a provably robust image retrieval model which keeps the most important evaluation metric Recall@1 invariant to adversarial perturbation. We propose the first 1-nearest neighbor (NN) image retrieval algorithm, RetrievalGuard, which is provably robust against adversarial perturbations within an $\ell_2$ ball of calculable radius. The challenge is to design a provably robust algorithm that takes into consideration the 1-NN search and the high-dimensional nature of the embedding space. Algorithmically, given a base retrieval model and a query sample, we build a smoothed retrieval model by carefully analyzing the 1-NN search procedure in the high-dimensional embedding space. We show that the smoothed retrieval model has bounded Lipschitz constant and thus the retrieval score is invariant to $\ell_2$ adversarial perturbations. Experiments on image retrieval tasks validate the robustness of our RetrievalGuard method.  ( 2 min )
    Cold Posteriors through PAC-Bayes. (arXiv:2206.11173v1 [cs.LG])
    We investigate the cold posterior effect through the lens of PAC-Bayes generalization bounds. We argue that in the non-asymptotic setting, when the number of training samples is (relatively) small, discussions of the cold posterior effect should take into account that approximate Bayesian inference does not readily provide guarantees of performance on out-of-sample data. Instead, out-of-sample error is better described through a generalization bound. In this context, we explore the connections between the ELBO objective from variational inference and the PAC-Bayes objectives. We note that, while the ELBO and PAC-Bayes objectives are similar, the latter objectives naturally contain a temperature parameter $\lambda$ which is not restricted to be $\lambda=1$. For both regression and classification tasks, in the case of isotropic Laplace approximations to the posterior, we show how this PAC-Bayesian interpretation of the temperature parameter captures the cold posterior effect.  ( 2 min )
    Multi-Modality Image Super-Resolution using Generative Adversarial Networks. (arXiv:2206.09193v2 [eess.IV] UPDATED)
    Over the past few years deep learning-based techniques such as Generative Adversarial Networks (GANs) have significantly improved solutions to image super-resolution and image-to-image translation problems. In this paper, we propose a solution to the joint problem of image super-resolution and multi-modality image-to-image translation. The problem can be stated as the recovery of a high-resolution image in a modality, given a low-resolution observation of the same image in an alternative modality. Our paper offers two models to address this problem and will be evaluated on the recovery of high-resolution day images given low-resolution night images of the same scene. Promising qualitative and quantitative results will be presented for each model.  ( 2 min )
    Concentration inequalities and optimal number of layers for stochastic deep neural networks. (arXiv:2206.11241v1 [cs.LG])
    We state concentration and martingale inequalities for the output of the hidden layers of a stochastic deep neural network (SDNN), as well as for the output of the whole SDNN. These results allow us to introduce an expected classifier (EC), and to give probabilistic upper bound for the classification error of the EC. We also state the optimal number of layers for the SDNN via an optimal stopping procedure. We apply our analysis to a stochastic version of a feedforward neural network with ReLU activation function.  ( 2 min )
    tntorch: Tensor Network Learning with PyTorch. (arXiv:2206.11128v1 [cs.LG])
    We present tntorch, a tensor learning framework that supports multiple decompositions (including Candecomp/Parafac, Tucker, and Tensor Train) under a unified interface. With our library, the user can learn and handle low-rank tensors with automatic differentiation, seamless GPU support, and the convenience of PyTorch's API. Besides decomposition algorithms, tntorch implements differentiable tensor algebra, rank truncation, cross-approximation, batch processing, comprehensive tensor arithmetics, and more.  ( 2 min )
    Neural Inverse Transform Sampler. (arXiv:2206.11172v1 [cs.LG])
    Any explicit functional representation $f$ of a density is hampered by two main obstacles when we wish to use it as a generative model: designing $f$ so that sampling is fast, and estimating $Z = \int f$ so that $Z^{-1}f$ integrates to 1. This becomes increasingly complicated as $f$ itself becomes complicated. In this paper, we show that when modeling one-dimensional conditional densities with a neural network, $Z$ can be exactly and efficiently computed by letting the network represent the cumulative distribution function of a target density, and applying a generalized fundamental theorem of calculus. We also derive a fast algorithm for sampling from the resulting representation by the inverse transform method. By extending these principles to higher dimensions, we introduce the \textbf{Neural Inverse Transform Sampler (NITS)}, a novel deep learning framework for modeling and sampling from general, multidimensional, compactly-supported probability densities. NITS is a highly expressive density estimator that boasts end-to-end differentiability, fast sampling, and exact and cheap likelihood evaluation. We demonstrate the applicability of NITS by applying it to realistic, high-dimensional density estimation tasks: likelihood-based generative modeling on the CIFAR-10 dataset, and density estimation on the UCI suite of benchmark datasets, where NITS produces compelling results rivaling or surpassing the state of the art.  ( 2 min )
    reStructured Pre-training. (arXiv:2206.11147v1 [cs.CL])
    In this work, we try to decipher the internal connection of NLP technology development in the past decades, searching for essence, which rewards us with a (potential) new learning paradigm for NLP tasks, dubbed as reStructured Pre-training (RST). In such a paradigm, the role of data will be re-emphasized, and model pre-training and fine-tuning of downstream tasks are viewed as a process of data storing and accessing. Based on that, we operationalize the simple principle that a good storage mechanism should not only have the ability to cache a large amount of data but also consider the ease of access. We achieve this by pre-training models over restructured data that consist of a variety of valuable information instead of raw data after overcoming several engineering challenges. Experimentally, RST models not only surpass strong competitors (e.g., T0) on 52/55 popular datasets from a variety of NLP tasks, but also achieve superior performance in National College Entrance Examination - English (Gaokao-English),the most authoritative examination in China. Specifically, the proposed system Qin achieves 40 points higher than the average scores made by students and 15 points higher than GPT3 with 1/16 parameters. In particular, Qin gets a high score of 138.5 (the full mark is 150) in the 2018 English exam (national paper III). We have released the Gaokao Benchmark with an online submission platform. In addition, we test our model in the 2022 College Entrance Examination English that happened a few days ago (2022.06.08), and it gets a total score of 134 (v.s. GPT3's 108).  ( 2 min )
    StaDRe and StaDRo: Reliability and Robustness Estimation of ML-based Forecasting using Statistical Distance Measures. (arXiv:2206.11116v1 [cs.LG])
    Reliability estimation of Machine Learning (ML) models is becoming a crucial subject. This is particularly the case when such \mbox{models} are deployed in safety-critical applications, as the decisions based on model predictions can result in hazardous situations. In this regard, recent research has proposed methods to achieve safe, \mbox{dependable}, and reliable ML systems. One such method consists of detecting and analyzing distributional shift, and then measuring how such systems respond to these shifts. This was proposed in earlier work in SafeML. This work focuses on the use of SafeML for time series data, and on reliability and robustness estimation of ML-forecasting methods using statistical distance measures. To this end, distance measures based on the Empirical Cumulative Distribution Function (ECDF) proposed in SafeML are explored to measure Statistical-Distance Dissimilarity (SDD) across time series. We then propose SDD-based Reliability Estimate (StaDRe) and SDD-based Robustness (StaDRo) measures. With the help of a clustering technique, the similarity between the statistical properties of data seen during training and the forecasts is identified. The proposed method is capable of providing a link between dataset SDD and Key Performance Indicators (KPIs) of the ML models.  ( 2 min )
    KeyCLD: Learning Constrained Lagrangian Dynamics in Keypoint Coordinates from Images. (arXiv:2206.11030v1 [cs.LG])
    We present KeyCLD, a framework to learn Lagrangian dynamics from images. Learned keypoints represent semantic landmarks in images and can directly represent state dynamics. Interpreting this state as Cartesian coordinates coupled with explicit holonomic constraints, allows expressing the dynamics with a constrained Lagrangian. Our method explicitly models kinetic and potential energy, thus allowing energy based control. We are the first to demonstrate learning of Lagrangian dynamics from images on the dm_control pendulum, cartpole and acrobot environments. This is a step forward towards learning Lagrangian dynamics from real-world images, since previous work in literature was only applied to minimalistic images with monochromatic shapes on empty backgrounds. Please refer to our project page for code and additional results: https://rdaems.github.io/keycld/  ( 2 min )
    A view of mini-batch SGD via generating functions: conditions of convergence, phase transitions, benefit from negative momenta. (arXiv:2206.11124v1 [cs.LG])
    Mini-batch SGD with momentum is a fundamental algorithm for learning large predictive models. In this paper we develop a new analytic framework to analyze mini-batch SGD for linear models at different momenta and sizes of batches. Our key idea is to describe the loss value sequence in terms of its generating function, which can be written in a compact form assuming a diagonal approximation for the second moments of model weights. By analyzing this generating function, we deduce various conclusions on the convergence conditions, phase structure of the model, and optimal learning settings. As a few examples, we show that 1) the optimization trajectory can generally switch from the "signal-dominated" to the "noise-dominated" phase, at a time scale that can be predicted analytically; 2) in the "signal-dominated" (but not the "noise-dominated") phase it is favorable to choose a large effective learning rate, however its value must be limited for any finite batch size to avoid divergence; 3) optimal convergence rate can be achieved at a negative momentum. We verify our theoretical predictions by extensive experiments with MNIST and synthetic problems, and find a good quantitative agreement.  ( 2 min )
    Understanding and Extending Subgraph GNNs by Rethinking Their Symmetries. (arXiv:2206.11140v1 [cs.LG])
    Subgraph GNNs are a recent class of expressive Graph Neural Networks (GNNs) which model graphs as collections of subgraphs. So far, the design space of possible Subgraph GNN architectures as well as their basic theoretical properties are still largely unexplored. In this paper, we study the most prominent form of subgraph methods, which employs node-based subgraph selection policies such as ego-networks or node marking and deletion. We address two central questions: (1) What is the upper-bound of the expressive power of these methods? and (2) What is the family of equivariant message passing layers on these sets of subgraphs?. Our first step in answering these questions is a novel symmetry analysis which shows that modelling the symmetries of node-based subgraph collections requires a significantly smaller symmetry group than the one adopted in previous works. This analysis is then used to establish a link between Subgraph GNNs and Invariant Graph Networks (IGNs). We answer the questions above by first bounding the expressive power of subgraph methods by 3-WL, and then proposing a general family of message-passing layers for subgraph methods that generalises all previous node-based Subgraph GNNs. Finally, we design a novel Subgraph GNN dubbed SUN, which theoretically unifies previous architectures while providing better empirical performance on multiple benchmarks.  ( 2 min )
    3D Instance Segmentation of MVS Buildings. (arXiv:2112.09902v2 [cs.CV] UPDATED)
    We present a novel 3D instance segmentation framework for Multi-View Stereo (MVS) buildings in urban scenes. Unlike existing works focusing on semantic segmentation of urban scenes, the emphasis of this work lies in detecting and segmenting 3D building instances even if they are attached and embedded in a large and imprecise 3D surface model. Multi-view RGB images are first enhanced to RGBH images by adding a heightmap and are segmented to obtain all roof instances using a fine-tuned 2D instance segmentation neural network. Instance masks from different multi-view images are then clustered into global masks. Our mask clustering accounts for spatial occlusion and overlapping, which can eliminate segmentation ambiguities among multi-view images. Based on these global masks, 3D roof instances are segmented out by mask back-projections and extended to the entire building instances through a Markov random field optimization. A new dataset that contains instance-level annotation for both 3D urban scenes (roofs and buildings) and drone images (roofs) is provided. To the best of our knowledge, it is the first outdoor dataset dedicated to 3D instance segmentation with much more annotations of attached 3D buildings than existing datasets. Quantitative evaluations and ablation studies have shown the effectiveness of all major steps and the advantages of our multi-view framework over the orthophoto-based method.  ( 3 min )
    Beyond No Regret: Instance-Dependent PAC Reinforcement Learning. (arXiv:2108.02717v2 [cs.LG] UPDATED)
    The theory of reinforcement learning has focused on two fundamental problems: achieving low regret, and identifying $\epsilon$-optimal policies. While a simple reduction allows one to apply a low-regret algorithm to obtain an $\epsilon$-optimal policy and achieve the worst-case optimal rate, it is unknown whether low-regret algorithms can obtain the instance-optimal rate for policy identification. We show this is not possible -- there exists a fundamental tradeoff between achieving low regret and identifying an $\epsilon$-optimal policy at the instance-optimal rate. Motivated by our negative finding, we propose a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP. We then propose and analyze a novel, planning-based algorithm which attains this sample complexity -- yielding a complexity which scales with the suboptimality gaps and the "reachability" of a state. We show our algorithm is nearly minimax optimal, and on several examples that our instance-dependent sample complexity offers significant improvements over worst-case bounds.  ( 2 min )
    Constant-Factor Approximation Algorithms for Socially Fair $k$-Clustering. (arXiv:2206.11210v1 [cs.DS])
    We study approximation algorithms for the socially fair $(\ell_p, k)$-clustering problem with $m$ groups, whose special cases include the socially fair $k$-median ($p=1$) and socially fair $k$-means ($p=2$) problems. We present (1) a polynomial-time $(5+2\sqrt{6})^p$-approximation with at most $k+m$ centers (2) a $(5+2\sqrt{6}+\epsilon)^p$-approximation with $k$ centers in time $n^{2^{O(p)}\cdot m^2}$, and (3) a $(15+6\sqrt{6})^p$ approximation with $k$ centers in time $k^{m}\cdot\text{poly}(n)$. The first result is obtained via a refinement of the iterative rounding method using a sequence of linear programs. The latter two results are obtained by converting a solution with up to $k+m$ centers to one with $k$ centers using sparsification methods for (2) and via an exhaustive search for (3). We also compare the performance of our algorithms with existing bicriteria algorithms as well as exactly $k$ center approximation algorithms on benchmark datasets, and find that our algorithms also outperform existing methods in practice.  ( 2 min )
    Generic E-Variables for Exact Sequential k-Sample Tests that allow for Optional Stopping. (arXiv:2106.02693v3 [stat.ME] UPDATED)
    We develop E-variables for testing whether two or more data streams come from the same source or not, and more generally, whether the difference between the sources is larger than some minimal effect size. These E-variables lead to exact, nonasymptotic tests that remain safe, i.e. keep their type-I error guarantees, under flexible sampling scenarios such as optional stopping and continuation. In special cases our E-variables also have an optimal 'growth' property under the alternative. While the construction is generic, we illustrate it through the special case of k x 2 contingency tables, where we also allow for the incorporation of different restrictions on a composite alternative. Comparison to p-value analysis in simulations and a real-world example show that E-variables, through their flexibility, often allow for early stopping of data collection, thereby retaining similar power as classical methods, while also retaining the option of extending or combining data afterwards.  ( 2 min )
    Data-Augmented Contact Model for Rigid Body Simulation. (arXiv:1803.04019v4 [cs.RO] UPDATED)
    Accurately modeling contact behaviors for real-world, near-rigid materials remains a grand challenge for existing rigid-body physics simulators. This paper introduces a data-augmented contact model that incorporates analytical solutions with observed data to predict the 3D contact impulse which could result in rigid bodies bouncing, sliding or spinning in all directions. Our method enhances the expressiveness of the standard Coulomb contact model by learning the contact behaviors from the observed data, while preserving the fundamental contact constraints whenever possible. For example, a classifier is trained to approximate the transitions between static and dynamic frictions, while non-penetration constraint during collision is enforced analytically. Our method computes the aggregated effect of contact for the entire rigid body, instead of predicting the contact force for each contact point individually, maintaining same simulation speed as the number of contact points increases for detailed geometries. Supplemental video: https://shorturl.at/eilwX Keywords: Physics Simulation Algorithms, Dynamics Learning, Contact Learning  ( 2 min )
    Langevin Monte Carlo for Contextual Bandits. (arXiv:2206.11254v1 [cs.LG])
    We study the efficiency of Thompson sampling for contextual bandits. Existing Thompson sampling-based algorithms need to construct a Laplace approximation (i.e., a Gaussian distribution) of the posterior distribution, which is inefficient to sample in high dimensional applications for general covariance matrices. Moreover, the Gaussian approximation may not be a good surrogate for the posterior distribution for general reward generating functions. We propose an efficient posterior sampling algorithm, viz., Langevin Monte Carlo Thompson Sampling (LMC-TS), that uses Markov Chain Monte Carlo (MCMC) methods to directly sample from the posterior distribution in contextual bandits. Our method is computationally efficient since it only needs to perform noisy gradient descent updates without constructing the Laplace approximation of the posterior distribution. We prove that the proposed algorithm achieves the same sublinear regret bound as the best Thompson sampling algorithms for a special case of contextual bandits, viz., linear contextual bandits. We conduct experiments on both synthetic data and real-world datasets on different contextual bandit models, which demonstrates that directly sampling from the posterior is both computationally efficient and competitive in performance.  ( 2 min )
    On Uniform Boundedness Properties of SGD and its Momentum Variants. (arXiv:2201.10245v2 [cs.LG] UPDATED)
    A theoretical, and potentially also practical, problem with stochastic gradient descent is that trajectories may escape to infinity. In this note, we investigate uniform boundedness properties of iterates and function values along the trajectories of the stochastic gradient descent algorithm and its important momentum variant. Under smoothness and $R$-dissipativity of the loss function, we show that broad families of step-sizes, including the widely used step-decay and cosine with (or without) restart step-sizes, result in uniformly bounded iterates and function values. Several important applications that satisfy these assumptions, including phase retrieval problems, Gaussian mixture models, and some neural network classifiers, are discussed in detail. We further extend the uniform boundedness of SGD and its momentum variant under the generalized dissipativity for the functions whose tails grow slower than quadratic functions. This includes some interesting applications, for example, Bayesian logistic regression and logistic regression with $\ell_1$ regularization.  ( 2 min )
    Human Pose Estimation from Sparse Inertial Measurements through Recurrent Graph Convolution. (arXiv:2107.11214v3 [cs.CV] UPDATED)
    Conventional methods for human pose estimation either require a high degree of instrumentation, by relying on many inertial measurement units (IMUs), or constraint the recording space, by relying on extrinsic cameras. These deficits are tackled through the approach of human pose estimation from sparse IMU data. We define adjacency adaptive graph convolutional long-short term memory networks (AAGC-LSTM), to tackle human pose estimation based on six IMUs, while incorporating the human body graph structure directly into the network. The AAGC-LSTM combines both spatial and temporal dependency in a single network operation, more memory efficiently than previous approaches. This is made possible by equipping graph convolutions with adjacency adaptivity, which eliminates the problem of information loss in deep or recurrent graph networks, while it also allows for learning unknown dependencies between the human body joints. To further boost accuracy, we propose longitudinal loss weighting to consider natural movement patterns. With our presented approach, we are able to utilize the inherent graph nature of the human body, and thus can outperform the state of the art (SOTA) for human pose estimation from sparse IMU data.  ( 2 min )
    OpenXAI: Towards a Transparent Evaluation of Model Explanations. (arXiv:2206.11104v1 [cs.LG])
    While several types of post hoc explanation methods (e.g., feature attribution methods) have been proposed in recent literature, there is little to no work on systematically benchmarking these methods in an efficient and transparent manner. Here, we introduce OpenXAI, a comprehensive and extensible open source framework for evaluating and benchmarking post hoc explanation methods. OpenXAI comprises of the following key components: (i) a flexible synthetic data generator and a collection of diverse real-world datasets, pre-trained models, and state-of-the-art feature attribution methods, (ii) open-source implementations of twenty-two quantitative metrics for evaluating faithfulness, stability (robustness), and fairness of explanation methods, and (iii) the first ever public XAI leaderboards to benchmark explanations. OpenXAI is easily extensible, as users can readily evaluate custom explanation methods and incorporate them into our leaderboards. Overall, OpenXAI provides an automated end-to-end pipeline that not only simplifies and standardizes the evaluation of post hoc explanation methods, but also promotes transparency and reproducibility in benchmarking these methods. OpenXAI datasets and data loaders, implementations of state-of-the-art explanation methods and evaluation metrics, as well as leaderboards are publicly available at https://open-xai.github.io/.
    Large-scale multi-objective influence maximisation with network downscaling. (arXiv:2204.06250v3 [cs.SI] UPDATED)
    Finding the most influential nodes in a network is a computationally hard problem with several possible applications in various kinds of network-based problems. While several methods have been proposed for tackling the influence maximisation (IM) problem, their runtime typically scales poorly when the network size increases. Here, we propose an original method, based on network downscaling, that allows a multi-objective evolutionary algorithm (MOEA) to solve the IM problem on a reduced scale network, while preserving the relevant properties of the original network. The downscaled solution is then upscaled to the original network, using a mechanism based on centrality metrics such as PageRank. Our results on eight large networks (including two with $\sim$50k nodes) demonstrate the effectiveness of the proposed method with a more than 10-fold runtime gain compared to the time needed on the original network, and an up to $82\%$ time reduction compared to CELF.  ( 2 min )
    Active Learning with Safety Constraints. (arXiv:2206.11183v1 [cs.LG])
    Active learning methods have shown great promise in reducing the number of samples necessary for learning. As automated learning systems are adopted into real-time, real-world decision-making pipelines, it is increasingly important that such algorithms are designed with safety in mind. In this work we investigate the complexity of learning the best safe decision in interactive environments. We reduce this problem to a constrained linear bandits problem, where our goal is to find the best arm satisfying certain (unknown) safety constraints. We propose an adaptive experimental design-based algorithm, which we show efficiently trades off between the difficulty of showing an arm is unsafe vs suboptimal. To our knowledge, our results are the first on best-arm identification in linear bandits with safety constraints. In practice, we demonstrate that this approach performs well on synthetic and real world datasets.
    Only Tails Matter: Average-Case Universality and Robustness in the Convex Regime. (arXiv:2206.09901v2 [math.OC] UPDATED)
    The recently developed average-case analysis of optimization methods allows a more fine-grained and representative convergence analysis than usual worst-case results. In exchange, this analysis requires a more precise hypothesis over the data generating process, namely assuming knowledge of the expected spectral distribution (ESD) of the random matrix associated with the problem. This work shows that the concentration of eigenvalues near the edges of the ESD determines a problem's asymptotic average complexity. This a priori information on this concentration is a more grounded assumption than complete knowledge of the ESD. This approximate concentration is effectively a middle ground between the coarseness of the worst-case scenario convergence and the restrictive previous average-case analysis. We also introduce the Generalized Chebyshev method, asymptotically optimal under a hypothesis on this concentration and globally optimal when the ESD follows a Beta distribution. We compare its performance to classical optimization algorithms, such as gradient descent or Nesterov's scheme, and we show that, in the average-case context, Nesterov's method is universally nearly optimal asymptotically.
    Sharing pattern submodels for prediction with missing values. (arXiv:2206.11161v1 [cs.LG])
    Missing values are unavoidable in many applications of machine learning and present a challenge both during training and at test time. When variables are missing in recurring patterns, fitting separate pattern submodels have been proposed as a solution. However, independent models do not make efficient use of all available data. Conversely, fitting a shared model to the full data set typically relies on imputation which may be suboptimal when missingness depends on unobserved factors. We propose an alternative approach, called sharing pattern submodels, which make predictions that are a) robust to missing values at test time, b) maintains or improves the predictive power of pattern submodels, and c) has a short description enabling improved interpretability. We identify cases where sharing is provably optimal, even when missingness itself is predictive and when the prediction target depends on unobserved variables. Classification and regression experiments on synthetic data and two healthcare data sets demonstrate that our models achieve a favorable trade-off between pattern specialization and information sharing.
    S2RL: Do We Really Need to Perceive All States in Deep Multi-Agent Reinforcement Learning?. (arXiv:2206.11054v1 [cs.LG])
    Collaborative multi-agent reinforcement learning (MARL) has been widely used in many practical applications, where each agent makes a decision based on its own observation. Most mainstream methods treat each local observation as an entirety when modeling the decentralized local utility functions. However, they ignore the fact that local observation information can be further divided into several entities, and only part of the entities is helpful to model inference. Moreover, the importance of different entities may change over time. To improve the performance of decentralized policies, the attention mechanism is used to capture features of local information. Nevertheless, existing attention models rely on dense fully connected graphs and cannot better perceive important states. To this end, we propose a sparse state based MARL (S2RL) framework, which utilizes a sparse attention mechanism to discard irrelevant information in local observations. The local utility functions are estimated through the self-attention and sparse attention mechanisms separately, then are combined into a standard joint value function and auxiliary joint value function in the central critic. We design the S2RL framework as a plug-and-play module, making it general enough to be applied to various methods. Extensive experiments on StarCraft II show that S2RL can significantly improve the performance of many state-of-the-art methods.
    Answer Fast: Accelerating BERT on the Tensor Streaming Processor. (arXiv:2206.11062v1 [cs.LG])
    Transformers have become a predominant machine learning workload, they are not only the de-facto standard for natural language processing tasks, but they are also being deployed in other domains such as vision and speech recognition. Many of the transformer-based applications are real-time systems such as machine translation and web search. These real time systems often come with strict end-to-end inference latency requirements. Unfortunately, while the majority of the transformer computation comes from matrix multiplications, transformers also include several non-linear components that tend to become the bottleneck during an inference. In this work, we accelerate the inference of BERT models on the tensor streaming processor. By carefully fusing all the nonlinear components with the matrix multiplication components, we are able to efficiently utilize the on-chip matrix multiplication units resulting in a deterministic tail latency of 130 $\mu$s for a batch-1 inference through BERT-base, which is 6X faster than the current state-of-the-art.
    A Context-Integrated Transformer-Based Neural Network for Auction Design. (arXiv:2201.12489v2 [cs.GT] UPDATED)
    One of the central problems in auction design is developing an incentive-compatible mechanism that maximizes the auctioneer's expected revenue. While theoretical approaches have encountered bottlenecks in multi-item auctions, recently, there has been much progress on finding the optimal mechanism through deep learning. However, these works either focus on a fixed set of bidders and items, or restrict the auction to be symmetric. In this work, we overcome such limitations by factoring \emph{public} contextual information of bidders and items into the auction learning framework. We propose $\mathtt{CITransNet}$, a context-integrated transformer-based neural network for optimal auction design, which maintains permutation-equivariance over bids and contexts while being able to find asymmetric solutions. We show by extensive experiments that $\mathtt{CITransNet}$ can recover the known optimal solutions in single-item settings, outperform strong baselines in multi-item auctions, and generalize well to cases other than those in training.  ( 2 min )
    Data-Free Quantization with Accurate Activation Clipping and Adaptive Batch Normalization. (arXiv:2204.04215v2 [cs.LG] UPDATED)
    Data-free quantization is a task that compresses the neural network to low bit-width without access to original training data. Most existing data-free quantization methods cause severe performance degradation due to inaccurate activation clipping range and quantization error, especially for low bit-width. In this paper, we present a simple yet effective data-free quantization method with accurate activation clipping and adaptive batch normalization. Accurate activation clipping (AAC) improves the model accuracy by exploiting accurate activation information from the full-precision model. Adaptive batch normalization firstly proposes to address the quantization error from distribution changes by updating the batch normalization layer adaptively. Extensive experiments demonstrate that the proposed data-free quantization method can yield surprisingly performance, achieving 64.33% top-1 accuracy of ResNet18 on ImageNet dataset, with 3.7% absolute improvement outperforming the existing state-of-the-art methods.
    LiT: Zero-Shot Transfer with Locked-image text Tuning. (arXiv:2111.07991v3 [cs.CV] UPDATED)
    This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training. In our empirical study we find that locked pre-trained image models with unlocked text models work best. We call this instance of contrastive-tuning "Locked-image Tuning" (LiT), which just teaches a text model to read out good representations from a pre-trained image model for new tasks. A LiT model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval. The proposed LiT is widely applicable; it works reliably with multiple pre-training methods (supervised and unsupervised) and across diverse architectures (ResNet, Vision Transformers and MLP-Mixer) using three different image-text datasets. With the transformer-based pre-trained ViT-g/14 model, the LiT model achieves 85.2% zero-shot transfer accuracy on the ImageNet test set, and 82.5% on the challenging out-of-distribution ObjectNet test set.  ( 2 min )
    $C^*$-algebra Net: A New Approach Generalizing Neural Network Parameters to $C^*$-algebra. (arXiv:2206.09513v2 [stat.ML] UPDATED)
    We propose a new framework that generalizes the parameters of neural network models to $C^*$-algebra-valued ones. $C^*$-algebra is a generalization of the space of complex numbers. A typical example is the space of continuous functions on a compact space. This generalization enables us to combine multiple models continuously and use tools for functions such as regression and integration. Consequently, we can learn features of data efficiently and adapt the models to problems continuously. We apply our framework to practical problems such as density estimation and few-shot learning and show that our framework enables us to learn features of data even with a limited number of samples. Our new framework highlights the potential possibility of applying the theory of $C^*$-algebra to general neural network models.
    Restless and Uncertain: Robust Policies for Restless Bandits via Deep Multi-Agent Reinforcement Learning. (arXiv:2107.01689v2 [cs.LG] UPDATED)
    We introduce robustness in \textit{restless multi-armed bandits} (RMABs), a popular model for constrained resource allocation among independent stochastic processes (arms). Nearly all RMAB techniques assume stochastic dynamics are precisely known. However, in many real-world settings, dynamics are estimated with significant \emph{uncertainty}, e.g., via historical data, which can lead to bad outcomes if ignored. To address this, we develop an algorithm to compute minimax regret -- robust policies for RMABs. Our approach uses a double oracle framework (oracles for \textit{agent} and \textit{nature}), which is often used for single-process robust planning but requires significant new techniques to accommodate the combinatorial nature of RMABs. Specifically, we design a deep reinforcement learning (RL) algorithm, DDLPO, which tackles the combinatorial challenge by learning an auxiliary "$\lambda$-network" in tandem with policy networks per arm, greatly reducing sample complexity, with guarantees on convergence. DDLPO, of general interest, implements our reward-maximizing agent oracle. We then tackle the challenging regret-maximizing nature oracle, a non-stationary RL challenge, by formulating it as a multi-agent RL problem between a policy optimizer and adversarial nature. This formulation is of general interest -- we solve it for RMABs by creating a multi-agent extension of DDLPO with a shared critic. We show our approaches work well in three experimental domains.
    sqSGD: Locally Private and Communication Efficient Federated Learning. (arXiv:2206.10565v2 [cs.LG] UPDATED)
    Federated learning (FL) is a technique that trains machine learning models from decentralized data sources. We study FL under local notions of privacy constraints, which provides strong protection against sensitive data disclosures via obfuscating the data before leaving the client. We identify two major concerns in designing practical privacy-preserving FL algorithms: communication efficiency and high-dimensional compatibility. We then develop a gradient-based learning algorithm called \emph{sqSGD} (selective quantized stochastic gradient descent) that addresses both concerns. The proposed algorithm is based on a novel privacy-preserving quantization scheme that uses a constant number of bits per dimension per client. Then we improve the base algorithm in three ways: first, we apply a gradient subsampling strategy that simultaneously offers better training performance and smaller communication costs under a fixed privacy budget. Secondly, we utilize randomized rotation as a preprocessing step to reduce quantization error. Thirdly, an adaptive gradient norm upper bound shrinkage strategy is adopted to improve accuracy and stabilize training. Finally, the practicality of the proposed framework is demonstrated on benchmark datasets. Experiment results show that sqSGD successfully learns large models like LeNet and ResNet with local privacy constraints. In addition, with fixed privacy and communication level, the performance of sqSGD significantly dominates that of various baseline algorithms.
    Bear the Query in Mind: Visual Grounding with Query-conditioned Convolution. (arXiv:2206.09114v2 [cs.CV] UPDATED)
    Visual grounding is a task that aims to locate a target object according to a natural language expression. As a multi-modal task, feature interaction between textual and visual inputs is vital. However, previous solutions mainly handle each modality independently before fusing them together, which does not take full advantage of relevant textual information while extracting visual features. To better leverage the textual-visual relationship in visual grounding, we propose a Query-conditioned Convolution Module (QCM) that extracts query-aware visual features by incorporating query information into the generation of convolutional kernels. With our proposed QCM, the downstream fusion module receives visual features that are more discriminative and focused on the desired object described in the expression, leading to more accurate predictions. Extensive experiments on three popular visual grounding datasets demonstrate that our method achieves state-of-the-art performance. In addition, the query-aware visual features are informative enough to achieve comparable performance to the latest methods when directly used for prediction without further multi-modal fusion.
    Multi-Modality Image Inpainting using Generative Adversarial Networks. (arXiv:2206.09210v2 [eess.IV] UPDATED)
    Deep learning techniques, especially Generative Adversarial Networks (GANs) have significantly improved image inpainting and image-to-image translation tasks over the past few years. To the best of our knowledge, the problem of combining the image inpainting task with the multi-modality image-to-image translation remains intact. In this paper, we propose a model to address this problem. The model will be evaluated on combined night-to-day image translation and inpainting, along with promising qualitative and quantitative results.
    Few-Max: Few-Shot Domain Adaptation for Unsupervised Contrastive Representation Learning. (arXiv:2206.10137v2 [cs.CV] UPDATED)
    Contrastive self-supervised learning methods learn to map data points such as images into non-parametric representation space without requiring labels. While highly successful, current methods require a large amount of data in the training phase. In situations where the target training set is limited in size, generalization is known to be poor. Pretraining on a large source data set and fine-tuning on the target samples is prone to overfitting in the few-shot regime, where only a small number of target samples are available. Motivated by this, we propose a domain adaption method for self-supervised contrastive learning, termed Few-Max, to address the issue of adaptation to a target distribution under few-shot learning. To quantify the representation quality, we evaluate Few-Max on a range of source and target datasets, including ImageNet, VisDA, and fastMRI, on which Few-Max consistently outperforms other approaches.
    Exploring Longitudinal Cough, Breath, and Voice Data for COVID-19 Progression Prediction via Sequential Deep Learning: Model Development and Validation. (arXiv:2201.01232v2 [cs.SD] UPDATED)
    Recent work has shown the potential of using audio data (eg, cough, breathing, and voice) in the screening for COVID-19. However, these approaches only focus on one-off detection and detect the infection given the current audio sample, but do not monitor disease progression in COVID-19. Limited exploration has been put forward to continuously monitor COVID-19 progression, especially recovery, through longitudinal audio data. Tracking disease progression characteristics could lead to more timely treatment. The primary objective of this study is to explore the potential of longitudinal audio samples over time for COVID-19 progression prediction and, especially, recovery trend prediction using sequential deep learning techniques. Crowdsourced respiratory audio data, including breathing, cough, and voice samples, from 212 individuals over 5-385 days were analyzed. We developed a deep learning-enabled tracking tool using gated recurrent units (GRUs) to detect COVID-19 progression by exploring the audio dynamics of the individuals' historical audio biomarkers. The investigation comprised 2 parts: (1) COVID-19 detection in terms of positive and negative (healthy) tests, and (2) longitudinal disease progression prediction over time in terms of probability of positive tests. The strong performance for COVID-19 detection, yielding an AUROC of 0.79, a sensitivity of 0.75, and a specificity of 0.71 supported the effectiveness of the approach compared to methods that do not leverage longitudinal dynamics. We further examined the predicted disease progression trajectory, displaying high consistency with test results with a correlation of 0.75 in the test cohort and 0.86 in a subset of the test cohort who reported recovery. Our findings suggest that monitoring COVID-19 evolution via longitudinal audio data has potential in the tracking of individuals' disease progression and recovery.
    Beyond the Quadratic Approximation: the Multiscale Structure of Neural Network Loss Landscapes. (arXiv:2204.11326v3 [cs.LG] UPDATED)
    A quadratic approximation of neural network loss landscapes has been extensively used to study the optimization process of these networks. Though, it usually holds in a very small neighborhood of the minimum, it cannot explain many phenomena observed during the optimization process. In this work, we study the structure of neural network loss functions and its implication on optimization in a region beyond the reach of a good quadratic approximation. Numerically, we observe that neural network loss functions possesses a multiscale structure, manifested in two ways: (1) in a neighborhood of minima, the loss mixes a continuum of scales and grows subquadratically, and (2) in a larger region, the loss shows several separate scales clearly. Using the subquadratic growth, we are able to explain the Edge of Stability phenomenon [5] observed for the gradient descent (GD) method. Using the separate scales, we explain the working mechanism of learning rate decay by simple examples. Finally, we study the origin of the multiscale structure and propose that the non-convexity of the models and the non-uniformity of training data is one of the causes. By constructing a two-layer neural network problem we show that training data with different magnitudes give rise to different scales of the loss function, producing subquadratic growth and multiple separate scales.
    Adversarial Masking for Self-Supervised Learning. (arXiv:2201.13100v2 [cs.CV] UPDATED)
    We propose ADIOS, a masked image model (MIM) framework for self-supervised learning, which simultaneously learns a masking function and an image encoder using an adversarial objective. The image encoder is trained to minimise the distance between representations of the original and that of a masked image. The masking function, conversely, aims at maximising this distance. ADIOS consistently improves on state-of-the-art self-supervised learning (SSL) methods on a variety of tasks and datasets -- including classification on ImageNet100 and STL10, transfer learning on CIFAR10/100, Flowers102 and iNaturalist, as well as robustness evaluated on the backgrounds challenge (Xiao et al., 2021) -- while generating semantically meaningful masks. Unlike modern MIM models such as MAE, BEiT and iBOT, ADIOS does not rely on the image-patch tokenisation construction of Vision Transformers, and can be implemented with convolutional backbones. We further demonstrate that the masks learned by ADIOS are more effective in improving representation learning of SSL methods than masking schemes used in popular MIM models.
    Saute RL: Almost Surely Safe Reinforcement Learning Using State Augmentation. (arXiv:2202.06558v3 [cs.LG] UPDATED)
    Satisfying safety constraints almost surely (or with probability one) can be critical for the deployment of Reinforcement Learning (RL) in real-life applications. For example, plane landing and take-off should ideally occur with probability one. We address the problem by introducing Safety Augmented (Saute) Markov Decision Processes (MDPs), where the safety constraints are eliminated by augmenting them into the state-space and reshaping the objective. We show that Saute MDP satisfies the Bellman equation and moves us closer to solving Safe RL with constraints satisfied almost surely. We argue that Saute MDP allows viewing the Safe RL problem from a different perspective enabling new features. For instance, our approach has a plug-and-play nature, i.e., any RL algorithm can be "Sauteed". Additionally, state augmentation allows for policy generalization across safety constraints. We finally show that Saute RL algorithms can outperform their state-of-the-art counterparts when constraint satisfaction is of high importance.  ( 2 min )
    Universum-inspired Supervised Contrastive Learning. (arXiv:2204.10695v2 [cs.LG] UPDATED)
    Mixup is an efficient data augmentation method which generates additional samples through respective convex combinations of original data points and labels. Although being theoretically dependent on data properties, Mixup is reported to perform well as a regularizer and calibrator contributing reliable robustness and generalization to neural network training. In this paper, inspired by Universum Learning which uses out-of-class samples to assist the target tasks, we investigate Mixup from a largely under-explored perspective - the potential to generate in-domain samples that belong to none of the target classes, that is, universum. We find that in the framework of supervised contrastive learning, universum-style Mixup produces surprisingly high-quality hard negatives, greatly relieving the need for a large batch size in contrastive learning. With these findings, we propose Universum-inspired Contrastive learning (UniCon), which incorporates Mixup strategy to generate universum data as g-negatives and pushes them apart from anchor samples of the target classes. Our approach not only improves Mixup with hard labels, but also innovates a novel measure to generate universum data. With a linear classifier on the learned representations, our method achieves 81.68% top-1 accuracy on CIFAR-100, surpassing the state of art by a significant margin of 5% with a much smaller batch size, typically, 256 in UniCon vs. 1024 in SupCon using ResNet-50.
    Hybrid Intelligent Testing in Simulation-Based Verification. (arXiv:2205.09552v2 [cs.AR] UPDATED)
    Efficient and effective testing for simulation-based hardware verification is challenging. Using constrained random test generation, several millions of tests may be required to achieve coverage goals. The vast majority of tests do not contribute to coverage progress, yet they consume verification resources. In this paper, we propose a hybrid intelligent testing approach combining two methods that have previously been treated separately, namely Coverage-Directed Test Selection and Novelty-Driven Verification. Coverage-Directed Test Selection learns from coverage feedback to bias testing towards the most effective tests. Novelty-Driven Verification learns to identify and simulate stimuli that differ from previous stimuli, thereby reducing the number of simulations and increasing testing efficiency. We discuss the strengths and limitations of each method, and we show how our approach addresses each method's limitations, leading to hardware testing that is both efficient and effective.  ( 2 min )
    Deep reinforcement learning for fMRI prediction of Autism Spectrum Disorder. (arXiv:2206.11224v1 [q-bio.NC])
    Purpose : Because functional MRI (fMRI) data sets are in general small, we sought a data efficient approach to resting state fMRI classification of autism spectrum disorder (ASD) versus neurotypical (NT) controls. We hypothesized that a Deep Reinforcement Learning (DRL) classifier could learn effectively on a small fMRI training set. Methods : We trained a Deep Reinforcement Learning (DRL) classifier on 100 graph-label pairs from the Autism Brain Imaging Data Exchange (ABIDE) database. For comparison, we trained a Supervised Deep Learning (SDL) classifier on the same training set. Results : DRL significantly outperformed SDL, with a p-value of 2.4 x 10^(-7). DRL achieved superior results for a variety of classifier performance metrics, including an F1 score of 76, versus 67 for SDL. Whereas SDL quickly overfit the training data, DRL learned in a progressive manner that generalised to the separate testing set. Conclusion : DRL can learn to classify ASD versus NT in a data efficient manner, doing so for a small training set. Future work will involve optimizing the neural network for data efficiency and applying the approach to other fMRI data sets, namely for brain cancer patients.  ( 2 min )
    exploRNN: Understanding Recurrent Neural Networks through Visual Exploration. (arXiv:2012.06326v3 [cs.LG] UPDATED)
    Due to the success of deep learning (DL) and its growing job market, students and researchers from many areas are interested in learning about DL technologies. Visualization has proven to be of great help during this learning process. While most current educational visualizations are targeted towards one specific architecture or use case, recurrent neural networks (RNNs), which are capable of processing sequential data, are not covered yet. This is despite the fact that tasks on sequential data, such as text and function analysis, are at the forefront of DL research. Therefore, we propose exploRNN, the first interactively explorable educational visualization for RNNs. On the basis of making learning easier and more fun, we define educational objectives targeted towards understanding RNNs. We use these objectives to form guidelines for the visual design process. By means of exploRNN, which is accessible online, we provide an overview of the training process of RNNs at a coarse level, while also allowing a detailed inspection of the data flow within LSTM cells. In an empirical study, we assessed 37 subjects in a between-subjects design to investigate the learning outcomes and cognitive load of exploRNN compared to a classic text-based learning environment. While learners in the text group are ahead in superficial knowledge acquisition, exploRNN is particularly helpful for deeper understanding of the learning content. In addition, the complex content in exploRNN is perceived as significantly easier and causes less extraneous load than in the text group. The study shows that for difficult learning material such as recurrent networks, where deep understanding is important, interactive visualizations such as exploRNN can be helpful.
    Multi-hop RIS-Empowered Terahertz Communications: A DRL-based Hybrid Beamforming Design. (arXiv:2101.09137v2 [eess.SP] UPDATED)
    Wireless communication in the TeraHertz band (0.1--10 THz) is envisioned as one of the key enabling technologies for the future sixth generation (6G) wireless communication systems scaled up beyond massive multiple input multiple output (Massive-MIMO) technology. However, very high propagation attenuations and molecular absorptions of THz frequencies often limit the signal transmission distance and coverage range. Benefited from the recent breakthrough on the reconfigurable intelligent surfaces (RIS) for realizing smart radio propagation environment, we propose a novel hybrid beamforming scheme for the multi-hop RIS-assisted communication networks to improve the coverage range at THz-band frequencies. Particularly, multiple passive and controllable RISs are deployed to assist the transmissions between the base station (BS) and multiple single-antenna users. We investigate the joint design of digital beamforming matrix at the BS and analog beamforming matrices at the RISs, by leveraging the recent advances in deep reinforcement learning (DRL) to combat the propagation loss. To improve the convergence of the proposed DRL-based algorithm, two algorithms are then designed to initialize the digital beamforming and the analog beamforming matrices utilizing the alternating optimization technique. Simulation results show that our proposed scheme is able to improve 50\% more coverage range of THz communications compared with the benchmarks. Furthermore, it is also shown that our proposed DRL-based method is a state-of-the-art method to solve the NP-hard beamforming problem, especially when the signals at RIS-assisted THz communication networks experience multiple hops.
    Learning Optimal Treatment Strategies for Sepsis Using Offline Reinforcement Learning in Continuous Space. (arXiv:2206.11190v1 [cs.LG])
    Sepsis is a leading cause of death in the ICU. It is a disease requiring complex interventions in a short period of time, but its optimal treatment strategy remains uncertain. Evidence suggests that the practices of currently used treatment strategies are problematic and may cause harm to patients. To address this decision problem, we propose a new medical decision model based on historical data to help clinicians recommend the best reference option for real-time treatment. Our model combines offline reinforcement learning with deep reinforcement learning to address the problem that traditional reinforcement learning in healthcare cannot interact with the environment, enabling our model to make decisions in a continuous state-action space. We demonstrate that, on average, the treatments recommended by the model are more valuable and reliable than those recommended by clinicians. In a large validation dataset, we found that patients whose actual doses from clinicians matched the AI's decisions had the lowest mortality rates. Our model provides personalized, clinically interpretable treatment decisions for sepsis that can improve patient care.
    General Univariate Estimation-of-Distribution Algorithms. (arXiv:2206.11198v1 [cs.NE])
    We propose a general formulation of a univariate estimation-of-distribution algorithm (EDA). It naturally incorporates the three classic univariate EDAs \emph{compact genetic algorithm}, \emph{univariate marginal distribution algorithm} and \emph{population-based incremental learning} as well as the \emph{max-min ant system} with iteration-best update. Our unified description of the existing algorithms allows a unified analysis of these; we demonstrate this by providing an analysis of genetic drift that immediately gives the existing results proven separately for the four algorithms named above. Our general model also includes EDAs that are more efficient than the existing ones and these may not be difficult to find as we demonstrate for the OneMax and LeadingOnes benchmarks.
    An Embedded Feature Selection Framework for Control. (arXiv:2206.11064v1 [cs.LG])
    Reducing sensor requirements while keeping optimal control performance is crucial to many industrial control applications to achieve robust, low-cost, and computation-efficient controllers. However, existing feature selection solutions for the typical machine learning domain can hardly be applied in the domain of control with changing dynamics. In this paper, a novel framework, namely the Dual-world embedded Attentive Feature Selection (D-AFS), can efficiently select the most relevant sensors for the system under dynamic control. Rather than the one world used in most Deep Reinforcement Learning (DRL) algorithms, D-AFS has both the real world and its virtual peer with twisted features. By analyzing the DRL's response in two worlds, D-AFS can quantitatively identify respective features' importance towards control. A well-known active flow control problem, cylinder drag reduction, is used for evaluation. Results show that D-AFS successfully finds an optimized five-probes layout with 18.7\% drag reduction than the state-of-the-art solution with 151 probes and 49.2\% reduction than five-probes layout by human experts. We also apply this solution to four OpenAI classical control cases. In all cases, D-AFS achieves the same or better sensor configurations than originally provided solutions. Results highlight, we argued, a new way to achieve efficient and optimal sensor designs for experimental or industrial systems. Our source codes are made publicly available at https://github.com/G-AILab/DAFSFluid.
    Surfer100: Generating Surveys From Web Resources, Wikipedia-style. (arXiv:2112.06377v4 [cs.CL] UPDATED)
    Fast-developing fields such as Artificial Intelligence (AI) often outpace the efforts of encyclopedic sources such as Wikipedia, which either do not completely cover recently-introduced topics or lack such content entirely. As a result, methods for automatically producing content are valuable tools to address this information overload. We show that recent advances in pretrained language modeling can be combined for a two-stage extractive and abstractive approach for Wikipedia lead paragraph generation. We extend this approach to generate longer Wikipedia-style summaries with sections and examine how such methods struggle in this application through detailed studies with 100 reference human-collected surveys. This is the first study on utilizing web resources for long Wikipedia-style summaries to the best of our knowledge.
    Beyond Greedy Search: Tracking by Multi-Agent Reinforcement Learning-based Beam Search. (arXiv:2205.09676v2 [cs.CV] UPDATED)
    To track the target in a video, current visual trackers usually adopt greedy search for target object localization in each frame, that is, the candidate region with the maximum response score will be selected as the tracking result of each frame. However, we found that this may be not an optimal choice, especially when encountering challenging tracking scenarios such as heavy occlusion and fast motion. To address this issue, we propose to maintain multiple tracking trajectories and apply beam search strategy for visual tracking, so that the trajectory with fewer accumulated errors can be identified. Accordingly, this paper introduces a novel multi-agent reinforcement learning based beam search tracking strategy, termed BeamTracking. It is mainly inspired by the image captioning task, which takes an image as input and generates diverse descriptions using beam search algorithm. Accordingly, we formulate the tracking as a sample selection problem fulfilled by multiple parallel decision-making processes, each of which aims at picking out one sample as their tracking result in each frame. Each maintained trajectory is associated with an agent to perform the decision-making and determine what actions should be taken to update related information. When all the frames are processed, we select the trajectory with the maximum accumulated score as the tracking result. Extensive experiments on seven popular tracking benchmark datasets validated the effectiveness of the proposed algorithm.
    Robust fine-tuning of zero-shot models. (arXiv:2109.01903v3 [cs.CV] UPDATED)
    Large pre-trained models such as CLIP or ALIGN offer consistent accuracy across a range of data distributions when performing zero-shot inference (i.e., without fine-tuning on a specific dataset). Although existing fine-tuning methods substantially improve accuracy on a given target distribution, they often reduce robustness to distribution shifts. We address this tension by introducing a simple and effective method for improving robustness while fine-tuning: ensembling the weights of the zero-shot and fine-tuned models (WiSE-FT). Compared to standard fine-tuning, WiSE-FT provides large accuracy improvements under distribution shift, while preserving high accuracy on the target distribution. On ImageNet and five derived distribution shifts, WiSE-FT improves accuracy under distribution shift by 4 to 6 percentage points (pp) over prior work while increasing ImageNet accuracy by 1.6 pp. WiSE-FT achieves similarly large robustness gains (2 to 23 pp) on a diverse set of six further distribution shifts, and accuracy gains of 0.8 to 3.3 pp compared to standard fine-tuning on seven commonly used transfer learning datasets. These improvements come at no additional computational cost during fine-tuning or inference.
    Graph Ordering Attention Networks. (arXiv:2204.05351v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have been successfully used in many problems involving graph-structured data, achieving state-of-the-art performance. GNNs typically employ a message-passing scheme, in which every node aggregates information from its neighbors using a permutation-invariant aggregation function. Standard well-examined choices such as the mean or sum aggregation functions have limited capabilities, as they are not able to capture interactions among neighbors. In this work, we formalize these interactions using an information-theoretic framework that notably includes synergistic information. Driven by this definition, we introduce the Graph Ordering Attention (GOAT) layer, a novel GNN component that captures interactions between nodes in a neighborhood. This is achieved by learning local node orderings via an attention mechanism and processing the ordered representations using a recurrent neural network aggregator. This design allows us to make use of a permutation-sensitive aggregator while maintaining the permutation-equivariance of the proposed GOAT layer. The GOAT model demonstrates its increased performance in modeling graph metrics that capture complex information, such as the betweenness centrality and the effective size of a node. In practical use-cases, its superior modeling capability is confirmed through its success in several real-world node classification benchmarks.
    EXACT: How to Train Your Accuracy. (arXiv:2205.09615v2 [cs.LG] UPDATED)
    Classification tasks are usually evaluated in terms of accuracy. However, accuracy is discontinuous and cannot be directly optimized using gradient ascent. Popular methods minimize cross-entropy, Hinge loss, or other surrogate losses, which can lead to suboptimal results. In this paper, we propose a new optimization framework by introducing stochasticity to a model's output and optimizing expected accuracy, i.e. accuracy of the stochastic model. Extensive experiments on image classification show that the proposed optimization method is a powerful alternative to widely used classification losses.
    Discriminative Bayesian filtering lends momentum to the stochastic Newton method for minimizing log-convex functions. (arXiv:2104.12949v2 [stat.ML] UPDATED)
    To minimize the average of a set of log-convex functions, the stochastic Newton method iteratively updates its estimate using subsampled versions of the full objective's gradient and Hessian. We contextualize this optimization problem as sequential Bayesian inference on a latent state-space model with a discriminatively-specified observation process. Applying Bayesian filtering then yields a novel optimization algorithm that considers the entire history of gradients and Hessians when forming an update. We establish matrix-based conditions under which the effect of older observations diminishes over time, in a manner analogous to Polyak's heavy ball momentum. We illustrate various aspects of our approach with an example and review other relevant innovations for the stochastic Newton method.
    Predicting Human Performance in Vertical Hierarchical Menu Selection in Immersive AR Using Hand-gesture and Head-gaze. (arXiv:2206.09480v1 [cs.HC] CROSS LISTED)
    There are currently limited guidelines on designing user interfaces (UI) for immersive augmented reality (AR) applications. Designers must reflect on their experience designing UI for desktop and mobile applications and conjecture how a UI will influence AR users' performance. In this work, we introduce a predictive model for determining users' performance for a target UI without the subsequent involvement of participants in user studies. The model is trained on participants' responses to objective performance measures such as consumed endurance (CE) and pointing time (PT) using hierarchical drop-down menus. Large variability in the depth and context of the menus is ensured by randomly and dynamically creating the hierarchical drop-down menus and associated user tasks from words contained in the lexical database WordNet. Subjective performance bias is reduced by incorporating the users' non-verbal standard performance WAIS-IV during the model training. The semantic information of the menu is encoded using the Universal Sentence Encoder. We present the results of a user study that demonstrates that the proposed predictive model achieves high accuracy in predicting the CE on hierarchical menus of users with various cognitive abilities. To the best of our knowledge, this is the first work on predicting CE in designing UI for immersive AR applications.
    Intuitive Shape Editing in Latent Space. (arXiv:2111.12488v3 [cs.CV] UPDATED)
    The use of autoencoders for shape editing or generation through latent space manipulation suffers from unpredictable changes in the output shape. Our autoencoder-based method enables intuitive shape editing in latent space by disentangling latent sub-spaces into style variables and control points on the surface that can be manipulated independently. The key idea is adding a Lipschitz-type constraint to the loss function, i.e. bounding the change of the output shape proportionally to the change in latent space, leading to interpretable latent space representations. The control points on the surface that are part of the latent code of an object can then be freely moved, allowing for intuitive shape editing directly in latent space. We evaluate our method by comparing to state-of-the-art data-driven shape editing methods. We further demonstrate the expressiveness of our learned latent space by leveraging it for unsupervised part segmentation.
    Descent Steps of a Relation-Aware Energy Produce Heterogeneous Graph Neural Networks. (arXiv:2206.11081v1 [cs.LG])
    Heterogeneous graph neural networks (GNNs) achieve strong performance on node classification tasks in a semi-supervised learning setting. However, as in the simpler homogeneous GNN case, message-passing-based heterogeneous GNNs may struggle to balance between resisting the oversmoothing occuring in deep models and capturing long-range dependencies graph structured data. Moreover, the complexity of this trade-off is compounded in the heterogeneous graph case due to the disparate heterophily relationships between nodes of different types. To address these issues, we proposed a novel heterogeneous GNN architecture in which layers are derived from optimization steps that descend a novel relation-aware energy function. The corresponding minimizer is fully differentiable with respect to the energy function parameters, such that bilevel optimization can be applied to effectively learn a functional form whose minimum provides optimal node representations for subsequent classification tasks. In particular, this methodology allows us to model diverse heterophily relationships between different node types while avoiding oversmoothing effects. Experimental results on 8 heterogeneous graph benchmarks demonstrates that our proposed method can achieve competitive node classification accuracy.
    Correct and Certify: A New Approach to Self-Supervised 3D-Object Perception. (arXiv:2206.11215v1 [cs.CV])
    We consider an object pose estimation and model fitting problem, where - given a partial point cloud of an object - the goal is to estimate the object pose by fitting a CAD model to the sensor data. We solve this problem by combining (i) a semantic keypoint-based pose estimation model, (ii) a novel self-supervised training approach, and (iii) a certification procedure, that not only verifies whether the output produced by the model is correct or not, but also flags uniqueness of the produced solution. The semantic keypoint detector model is initially trained in simulation and does not perform well on real-data due to the domain gap. Our self-supervised training procedure uses a corrector and a certification module to improve the detector. The corrector module corrects the detected keypoints to compensate for the domain gap, and is implemented as a declarative layer, for which we develop a simple differentiation rule. The certification module declares whether the corrected output produced by the model is certifiable (i.e. correct) or not. At each iteration, the approach optimizes over the loss induced only by the certifiable input-output pairs. As training progresses, we see that the fraction of outputs that are certifiable increases, eventually reaching near $100\%$ in many cases. We also introduce the notion of strong certifiability wherein the model can determine if the predicted object model fit is unique or not. The detected semantic keypoints help us implement this in the forward pass. We conduct extensive experiments to evaluate the performance of the corrector, the certification, and the proposed self-supervised training using the ShapeNet and YCB datasets, and show the proposed approach achieves performance comparable to fully supervised baselines while not requiring pose or keypoint supervision on real data.
    Margin Calibration for Long-Tailed Visual Recognition. (arXiv:2112.07225v4 [cs.CV] UPDATED)
    The long-tailed class distribution in visual recognition tasks poses great challenges for neural networks on how to handle the biased predictions between head and tail classes, i.e., the model tends to classify tail classes as head classes. While existing research focused on data resampling and loss function engineering, in this paper, we take a different perspective: the classification margins. We study the relationship between the margins and logits (classification scores) and empirically observe the biased margins and the biased logits are positively correlated. We propose MARC, a simple yet effective MARgin Calibration function to dynamically calibrate the biased margins for unbiased logits. We validate MARC through extensive experiments on common long-tailed benchmarks including CIFAR-LT, ImageNet-LT, Places-LT, and iNaturalist-LT. Experimental results demonstrate that our MARC achieves favorable results on these benchmarks. In addition, MARC is extremely easy to implement with just three lines of code. We hope this simple method will motivate people to rethink the biased margins and biased logits in long-tailed visual recognition.
    Towards Unsupervised Content Disentanglement in Sentence Representations via Syntactic Roles. (arXiv:2206.11184v1 [cs.CL])
    Linking neural representations to linguistic factors is crucial in order to build and analyze NLP models interpretable by humans. Among these factors, syntactic roles (e.g. subjects, direct objects,$\dots$) and their realizations are essential markers since they can be understood as a decomposition of predicative structures and thus the meaning of sentences. Starting from a deep probabilistic generative model with attention, we measure the interaction between latent variables and realizations of syntactic roles and show that it is possible to obtain, without supervision, representations of sentences where different syntactic roles correspond to clearly identified different latent variables. The probabilistic model we propose is an Attention-Driven Variational Autoencoder (ADVAE). Drawing inspiration from Transformer-based machine translation models, ADVAEs enable the analysis of the interactions between latent variables and input tokens through attention. We also develop an evaluation protocol to measure disentanglement with regard to the realizations of syntactic roles. This protocol is based on attention maxima for the encoder and on latent variable perturbations for the decoder. Our experiments on raw English text from the SNLI dataset show that $\textit{i)}$ disentanglement of syntactic roles can be induced without supervision, $\textit{ii)}$ ADVAE separates syntactic roles better than classical sequence VAEs and Transformer VAEs, $\textit{iii)}$ realizations of syntactic roles can be separately modified in sentences by mere intervention on the associated latent variables. Our work constitutes a first step towards unsupervised controllable content generation. The code for our work is publicly available.
    Model-Based Deep Learning: On the Intersection of Deep Learning and Optimization. (arXiv:2205.02640v2 [eess.SP] UPDATED)
    Decision making algorithms are used in a multitude of different applications. Conventional approaches for designing decision algorithms employ principled and simplified modelling, based on which one can determine decisions via tractable optimization. More recently, deep learning approaches that use highly parametric architectures tuned from data without relying on mathematical models, are becoming increasingly popular. Model-based optimization and data-centric deep learning are often considered to be distinct disciplines. Here, we characterize them as edges of a continuous spectrum varying in specificity and parameterization, and provide a tutorial-style presentation to the methodologies lying in the middle ground of this spectrum, referred to as model-based deep learning. We accompany our presentation with running examples in super-resolution and stochastic control, and show how they are expressed using the provided characterization and specialized in each of the detailed methodologies. The gains of combining model-based optimization and deep learning are demonstrated using experimental results in various applications, ranging from biomedical imaging to digital communications.
    Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer. (arXiv:2206.11053v1 [cs.CV])
    Visual question answering (VQA) in surgery is largely unexplored. Expert surgeons are scarce and are often overloaded with clinical and academic workloads. This overload often limits their time answering questionnaires from patients, medical students or junior residents related to surgical procedures. At times, students and junior residents also refrain from asking too many questions during classes to reduce disruption. While computer-aided simulators and recording of past surgical procedures have been made available for them to observe and improve their skills, they still hugely rely on medical experts to answer their questions. Having a Surgical-VQA system as a reliable 'second opinion' could act as a backup and ease the load on the medical experts in answering these questions. The lack of annotated medical data and the presence of domain-specific terms has limited the exploration of VQA for surgical procedures. In this work, we design a Surgical-VQA task that answers questionnaires on surgical procedures based on the surgical scene. Extending the MICCAI endoscopic vision challenge 2018 dataset and workflow recognition dataset further, we introduce two Surgical-VQA datasets with classification and sentence-based answers. To perform Surgical-VQA, we employ vision-text transformers models. We further introduce a residual MLP-based VisualBert encoder model that enforces interaction between visual and text tokens, improving performance in classification-based answering. Furthermore, we study the influence of the number of input image patches and temporal visual features on the model performance in both classification and sentence-based answering.
    Information Geometry of Dropout Training. (arXiv:2206.10936v1 [stat.ML])
    Dropout is one of the most popular regularization techniques in neural network training. Because of its power and simplicity of idea, dropout has been analyzed extensively and many variants have been proposed. In this paper, several properties of dropout are discussed in a unified manner from the viewpoint of information geometry. We showed that dropout flattens the model manifold and that their regularization performance depends on the amount of the curvature. Then, we showed that dropout essentially corresponds to a regularization that depends on the Fisher information, and support this result from numerical experiments. Such a theoretical analysis of the technique from a different perspective is expected to greatly assist in the understanding of neural networks, which are still in their infancy.
    World of Bugs: A Platform for Automated Bug Detection in 3D Video Games. (arXiv:2206.11037v1 [cs.SE])
    We present World of Bugs (WOB), an open platform that aims to support Automated Bug Detection (ABD) research in video games. We discuss some open problems in ABD and how they relate to the platform's design, arguing that learning-based solutions are required if further progress is to be made. The platform's key feature is a growing collection of common video game bugs that may be used for training and evaluating ABD approaches.
    Multi-task twin support vector machine with Universum data. (arXiv:2206.10978v1 [cs.LG])
    Multi-task learning (MTL) has emerged as a promising topic of machine learning in recent years, aiming to enhance the performance of numerous related learning tasks by exploiting beneficial information. During the training phase, most of the existing multi-task learning models concentrate entirely on the target task data and ignore the non-target task data contained in the target tasks. To address this issue, Universum data, that do not correspond to any class of a classification problem, may be used as prior knowledge in the training model. This study looks at the challenge of multi-task learning using Universum data to employ non-target task data, which leads to better performance. It proposes a multi-task twin support vector machine with Universum data (UMTSVM) and provides two approaches to its solution. The first approach takes into account the dual formulation of UMTSVM and tries to solve a quadratic programming problem. The second approach formulates a least-squares version of UMTSVM and refers to it as LS-UMTSVM to further increase the generalization performance. The solution of the two primal problems in LS-UMTSVM is simplified to solving just two systems of linear equations, resulting in an incredibly simple and quick approach. Numerical experiments on several popular multi-task data sets and medical data sets demonstrate the efficiency of the proposed methods.
    Auto-Encoding Adversarial Imitation Learning. (arXiv:2206.11004v1 [cs.LG])
    Reinforcement learning (RL) provides a powerful framework for decision-making, but its application in practice often requires a carefully designed reward function. Adversarial Imitation Learning (AIL) sheds light on automatic policy acquisition without access to the reward signal from the environment. In this work, we propose Auto-Encoding Adversarial Imitation Learning (AEAIL), a robust and scalable AIL framework. To induce expert policies from demonstrations, AEAIL utilizes the reconstruction error of an auto-encoder as a reward signal, which provides more information for optimizing policies than the prior discriminator-based ones. Subsequently, we use the derived objective functions to train the auto-encoder and the agent policy. Experiments show that our AEAIL performs superior compared to state-of-the-art methods in the MuJoCo environments. More importantly, AEAIL shows much better robustness when the expert demonstrations are noisy. Specifically, our method achieves $16.4\%$ and $47.2\%$ relative improvement overall compared to the best baseline FAIRL and PWIL on clean and noisy expert data, respectively. Video results, open-source code and dataset are available in https://sites.google.com/view/auto-encoding-imitation.
    Fighting Fire with Fire: Avoiding DNN Shortcuts through Priming. (arXiv:2206.10816v1 [cs.LG])
    Across applications spanning supervised classification and sequential control, deep learning has been reported to find "shortcut" solutions that fail catastrophically under minor changes in the data distribution. In this paper, we show empirically that DNNs can be coaxed to avoid poor shortcuts by providing an additional "priming" feature computed from key input features, usually a coarse output estimate. Priming relies on approximate domain knowledge of these task-relevant key input features, which is often easy to obtain in practical settings. For example, one might prioritize recent frames over past frames in a video input for visual imitation learning, or salient foreground over background pixels for image classification. On NICO image classification, MuJoCo continuous control, and CARLA autonomous driving, our priming strategy works significantly better than several popular state-of-the-art approaches for feature selection and data augmentation. We connect these empirical findings to recent theoretical results on DNN optimization, and argue theoretically that priming distracts the optimizer away from poor shortcuts by creating better, simpler shortcuts.
    AI-based software for lung nodule detection in chest X-rays -- Time for a second reader approach?. (arXiv:2206.10912v1 [eess.IV])
    Objectives: To compare artificial intelligence (AI) as a second reader in detecting lung nodules on chest X-rays (CXR) versus radiologists of two binational institutions, and to evaluate AI performance when using two different modes: automated versus assisted (additional remote radiologist review). Methods: The CXR public database (n = 247) of the Japanese Society of Radiological Technology with various types and sizes of lung nodules was analyzed. Eight radiologists evaluated the CXR images with regard to the presence of lung nodules and nodule conspicuity. After radiologist review, the AI software processed and flagged the CXR with the highest probability of missed nodules. The calculated accuracy metrics were the area under the curve (AUC), sensitivity, specificity, F1 score, false negative case number (FN), and the effect of different AI modes (automated/assisted) on the accuracy of nodule detection. Results: For radiologists, the average AUC value was 0.77 $\pm$ 0.07, while the average FN was 52.63 $\pm$ 17.53 (all studies) and 32 $\pm$ 11.59 (studies containing a nodule of malignant etiology = 32% rate of missed malignant nodules). Both AI modes -- automated and assisted -- produced an average increase in sensitivity (by 14% and 12%) and of F1-score (5% and 6%) and a decrease in specificity (by 10% and 3%, respectively). Conclusions: Both AI modes flagged the pulmonary nodules missed by radiologists in a significant number of cases. AI as a second reader has a high potential to improve diagnostic accuracy and radiology workflow. AI might detect certain pulmonary nodules earlier than radiologists, with a potentially significant impact on patient outcomes.
    A Study on the Evaluation of Generative Models. (arXiv:2206.10935v1 [cs.LG])
    Implicit generative models, which do not return likelihood values, such as generative adversarial networks and diffusion models, have become prevalent in recent years. While it is true that these models have shown remarkable results, evaluating their performance is challenging. This issue is of vital importance to push research forward and identify meaningful gains from random noise. Currently, heuristic metrics such as the Inception score (IS) and Frechet Inception Distance (FID) are the most common evaluation metrics, but what they measure is not entirely clear. Additionally, there are questions regarding how meaningful their score actually is. In this work, we study the evaluation metrics of generative models by generating a high-quality synthetic dataset on which we can estimate classical metrics for comparison. Our study shows that while FID and IS do correlate to several f-divergences, their ranking of close models can vary considerably making them problematic when used for fain-grained comparison. We further used this experimental setting to study which evaluation metric best correlates with our probabilistic metrics. Lastly, we look into the base features used for metrics such as FID.
    FairGrad: Fairness Aware Gradient Descent. (arXiv:2206.10923v1 [cs.LG])
    We tackle the problem of group fairness in classification, where the objective is to learn models that do not unjustly discriminate against subgroups of the population. Most existing approaches are limited to simple binary tasks or involve difficult to implement training mechanisms. This reduces their practical applicability. In this paper, we propose FairGrad, a method to enforce fairness based on a reweighting scheme that iteratively learns group specific weights based on whether they are advantaged or not. FairGrad is easy to implement and can accommodate various standard fairness definitions. Furthermore, we show that it is comparable to standard baselines over various datasets including ones used in natural language processing and computer vision.
    Automated GI tract segmentation using deep learning. (arXiv:2206.11048v1 [eess.IV])
    The job of Radiation oncologists is to deliver x-ray beams pointed toward the tumor and at the same time avoid the stomach and intestines. With MR-Linacs (magnetic resonance imaging and linear accelerator systems), oncologists can visualize the position of the tumor and allow for precise dose according to tumor cell presence which can vary from day to day. The current job of outlining the position of the stomach and intestines to adjust the X-ray beams direction for the dose delivery to the tumor while avoiding the organs. This is a time-consuming and labor-intensive process that can easily prolong treatments from 15 minutes to an hour a day unless deep learning methods can automate the segmentation process. This paper discusses an automated segmentation process using deep learning to make this process faster and allow more patients to get effective treatment.
    Defect Prediction Using Stylistic Metrics. (arXiv:2206.10959v1 [cs.SE])
    Defect prediction is one of the most popular research topics due to its potential to minimize software quality assurance efforts. Existing approaches have examined defect prediction from various perspectives such as complexity and developer metrics. However, none of these consider programming style for defect prediction. This paper aims at analyzing the impact of stylistic metrics on both within-project and crossproject defect prediction. For prediction, 4 widely used machine learning algorithms namely Naive Bayes, Support Vector Machine, Decision Tree and Logistic Regression are used. The experiment is conducted on 14 releases of 5 popular, open source projects. F1, Precision and Recall are inspected to evaluate the results. Results reveal that stylistic metrics are a good predictor of defects.
    Quantization Robust Federated Learning for Efficient Inference on Heterogeneous Devices. (arXiv:2206.10844v1 [cs.LG])
    Federated Learning (FL) is a machine learning paradigm to distributively learn machine learning models from decentralized data that remains on-device. Despite the success of standard Federated optimization methods, such as Federated Averaging (FedAvg) in FL, the energy demands and hardware induced constraints for on-device learning have not been considered sufficiently in the literature. Specifically, an essential demand for on-device learning is to enable trained models to be quantized to various bit-widths based on the energy needs and heterogeneous hardware designs across the federation. In this work, we introduce multiple variants of federated averaging algorithm that train neural networks robust to quantization. Such networks can be quantized to various bit-widths with only limited reduction in full precision model accuracy. We perform extensive experiments on standard FL benchmarks to evaluate our proposed FedAvg variants for quantization robustness and provide a convergence analysis for our Quantization-Aware variants in FL. Our results demonstrate that integrating quantization robustness results in FL models that are significantly more robust to different bit-widths during quantized on-device inference.
    How to Combine Variational Bayesian Networks in Federated Learning. (arXiv:2206.10897v1 [cs.LG])
    Federated Learning enables multiple data centers to train a central model collaboratively without exposing any confidential data. Even though deterministic models are capable of performing high prediction accuracy, their lack of calibration and capability to quantify uncertainty is problematic for safety-critical applications. Different from deterministic models, probabilistic models such as Bayesian neural networks are relatively well-calibrated and able to quantify uncertainty alongside their competitive prediction accuracy. Both of the approaches appear in the federated learning framework; however, the aggregation scheme of deterministic models cannot be directly applied to probabilistic models since weights correspond to distributions instead of point estimates. In this work, we study the effects of various aggregation schemes for variational Bayesian neural networks. With empirical results on three image classification datasets, we observe that the degree of spread for an aggregated distribution is a significant factor in the learning process. Hence, we present an investigation on the question of how to combine variational Bayesian networks in federated learning, while providing benchmarks for different aggregation settings.
    List-Decodable Covariance Estimation. (arXiv:2206.10942v1 [cs.DS])
    We give the first polynomial time algorithm for \emph{list-decodable covariance estimation}. For any $\alpha > 0$, our algorithm takes input a sample $Y \subseteq \mathbb{R}^d$ of size $n\geq d^{\mathsf{poly}(1/\alpha)}$ obtained by adversarially corrupting an $(1-\alpha)n$ points in an i.i.d. sample $X$ of size $n$ from the Gaussian distribution with unknown mean $\mu_*$ and covariance $\Sigma_*$. In $n^{\mathsf{poly}(1/\alpha)}$ time, it outputs a constant-size list of $k = k(\alpha)= (1/\alpha)^{\mathsf{poly}(1/\alpha)}$ candidate parameters that, with high probability, contains a $(\hat{\mu},\hat{\Sigma})$ such that the total variation distance $TV(\mathcal{N}(\mu_*,\Sigma_*),\mathcal{N}(\hat{\mu},\hat{\Sigma}))<1-O_{\alpha}(1)$. This is the statistically strongest notion of distance and implies multiplicative spectral and relative Frobenius distance approximation for parameters with dimension independent error. Our algorithm works more generally for $(1-\alpha)$-corruptions of any distribution $D$ that possesses low-degree sum-of-squares certificates of two natural analytic properties: 1) anti-concentration of one-dimensional marginals and 2) hypercontractivity of degree 2 polynomials. Prior to our work, the only known results for estimating covariance in the list-decodable setting were for the special cases of list-decodable linear regression and subspace recovery due to Karmarkar, Klivans, and Kothari (2019), Raghavendra and Yau (2019 and 2020) and Bakshi and Kothari (2020). These results need superpolynomial time for obtaining any subconstant error in the underlying dimension. Our result implies the first polynomial-time \emph{exact} algorithm for list-decodable linear regression and subspace recovery that allows, in particular, to obtain $2^{-\mathsf{poly}(d)}$ error in polynomial-time. Our result also implies an improved algorithm for clustering non-spherical mixtures.
    Neural Networks as Paths through the Space of Representations. (arXiv:2206.10999v1 [cs.LG])
    Deep neural networks implement a sequence of layer-by-layer operations that are each relatively easy to understand, but the resulting overall computation is generally difficult to understand. We develop a simple idea for interpreting the layer-by-layer construction of useful representations: the role of each layer is to reformat information to reduce the "distance" to the target outputs. We formalize this intuitive idea of "distance" by leveraging recent work on metric representational similarity, and show how it leads to a rich space of geometric concepts. With this framework, the layer-wise computation implemented by a deep neural network can be viewed as a path in a high-dimensional representation space. We develop tools to characterize the geometry of these in terms of distances, angles, and geodesics. We then ask three sets of questions of residual networks trained on CIFAR-10: (1) how straight are paths, and how does each layer contribute towards the target? (2) how do these properties emerge over training? and (3) how similar are the paths taken by wider versus deeper networks? We conclude by sketching additional ways that this kind of representational geometry can be used to understand and interpret network training, or to prescriptively improve network architectures to suit a task.
    ROSE: A RObust and SEcure DNN Watermarking. (arXiv:2206.11024v1 [cs.CR])
    Protecting the Intellectual Property rights of DNN models is of primary importance prior to their deployment. So far, the proposed methods either necessitate changes to internal model parameters or the machine learning pipeline, or they fail to meet both the security and robustness requirements. This paper proposes a lightweight, robust, and secure black-box DNN watermarking protocol that takes advantage of cryptographic one-way functions as well as the injection of in-task key image-label pairs during the training process. These pairs are later used to prove DNN model ownership during testing. The main feature is that the value of the proof and its security are measurable. The extensive experiments watermarking image classification models for various datasets as well as exposing them to a variety of attacks, show that it provides protection while maintaining an adequate level of security and robustness.
    Guided Diffusion Model for Adversarial Purification from Random Noise. (arXiv:2206.10875v1 [cs.LG])
    In this paper, we propose a novel guided diffusion purification approach to provide a strong defense against adversarial attacks. Our model achieves 89.62% robust accuracy under PGD-L_inf attack (eps = 8/255) on the CIFAR-10 dataset. We first explore the essential correlations between unguided diffusion models and randomized smoothing, enabling us to apply the models to certified robustness. The empirical results show that our models outperform randomized smoothing by 5% when the certified L2 radius r is larger than 0.5.
    S2TNet: Spatio-Temporal Transformer Networks for Trajectory Prediction in Autonomous Driving. (arXiv:2206.10902v1 [cs.CV])
    To safely and rationally participate in dense and heterogeneous traffic, autonomous vehicles require to sufficiently analyze the motion patterns of surrounding traffic-agents and accurately predict their future trajectories. This is challenging because the trajectories of traffic-agents are not only influenced by the traffic-agents themselves but also by spatial interaction with each other. Previous methods usually rely on the sequential step-by-step processing of Long Short-Term Memory networks (LSTMs) and merely extract the interactions between spatial neighbors for single type traffic-agents. We propose the Spatio-Temporal Transformer Networks (S2TNet), which models the spatio-temporal interactions by spatio-temporal Transformer and deals with the temporel sequences by temporal Transformer. We input additional category, shape and heading information into our networks to handle the heterogeneity of traffic-agents. The proposed methods outperforms state-of-the-art methods on ApolloScape Trajectory dataset by more than 7\% on both the weighted sum of Average and Final Displacement Error. Our code is available at https://github.com/chenghuang66/s2tnet.
    Optical Flow Regularization of Implicit Neural Representations for Video Frame Interpolation. (arXiv:2206.10886v1 [cs.CV])
    Recent works have shown the ability of Implicit Neural Representations (INR) to carry meaningful representations of signal derivatives. In this work, we leverage this property to perform Video Frame Interpolation (VFI) by explicitly constraining the derivatives of the INR to satisfy the optical flow constraint equation. We achieve state of the art VFI on limited motion ranges using only a target video and its optical flow, without learning the interpolation operator from additional training data. We further show that constraining the INR derivatives not only allows to better interpolate intermediate frames but also improves the ability of narrow networks to fit the observed frames, which suggests potential applications to video compression and INR optimization.
    Traffic Congestion Prediction Using Machine Learning Techniques. (arXiv:2206.10983v1 [cs.LG])
    The prediction of traffic congestion can serve a crucial role in making future decisions. Although many studies have been conducted regarding congestion, most of these could not cover all the important factors (e.g., weather conditions). We proposed a prediction model for traffic congestion that can predict congestion based on day, time and several weather data (e.g., temperature, humidity). To evaluate our model, it has been tested against the traffic data of New Delhi. With this model, congestion of a road can be predicted one week ahead with an average RMSE of 1.12. Therefore, this model can be used to take preventive measure beforehand.
    Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models. (arXiv:2206.02246v2 [cs.SD] UPDATED)
    We present a novel way of conditioning a pretrained denoising diffusion speech model to produce speech in the voice of a novel person unseen during training. The method requires a short (~3 seconds) sample from the target person, and generation is steered at inference time, without any training steps. At the heart of the method lies a sampling process that combines the estimation of the denoising model with a low-pass version of the new speaker's sample. The objective and subjective evaluations show that our sampling method can generate a voice similar to that of the target speaker in terms of frequency, with an accuracy comparable to state-of-the-art methods, and without training.
    On the Impossibility of Learning to Cooperate with Adaptive Partner Strategies in Repeated Games. (arXiv:2206.10614v1 [cs.GT])
    Learning to cooperate with other agents is challenging when those agents also possess the ability to adapt to our own behavior. Practical and theoretical approaches to learning in cooperative settings typically assume that other agents' behaviors are stationary, or else make very specific assumptions about other agents' learning processes. The goal of this work is to understand whether we can reliably learn to cooperate with other agents without such restrictive assumptions, which are unlikely to hold in real-world applications. Our main contribution is a set of impossibility results, which show that no learning algorithm can reliably learn to cooperate with all possible adaptive partners in a repeated matrix game, even if that partner is guaranteed to cooperate with some stationary strategy. Motivated by these results, we then discuss potential alternative assumptions which capture the idea that an adaptive partner will only adapt rationally to our behavior.
    Learning Neuro-Symbolic Skills for Bilevel Planning. (arXiv:2206.10680v1 [cs.RO])
    Decision-making is challenging in robotics environments with continuous object-centric states, continuous actions, long horizons, and sparse feedback. Hierarchical approaches, such as task and motion planning (TAMP), address these challenges by decomposing decision-making into two or more levels of abstraction. In a setting where demonstrations and symbolic predicates are given, prior work has shown how to learn symbolic operators and neural samplers for TAMP with manually designed parameterized policies. Our main contribution is a method for learning parameterized polices in combination with operators and samplers. These components are packaged into modular neuro-symbolic skills and sequenced together with search-then-sample TAMP to solve new tasks. In experiments in four robotics domains, we show that our approach -- bilevel planning with neuro-symbolic skills -- can solve a wide range of tasks with varying initial states, goals, and objects, outperforming six baselines and ablations. Video: https://youtu.be/PbFZP8rPuGg Code: https://tinyurl.com/skill-learning
    Generational Differences in Automobility: Comparing America's Millennials and Gen Xers Using Gradient Boosting Decision Trees. (arXiv:2206.11056v1 [cs.LG])
    Whether the Millennials are less auto-centric than the previous generations has been widely discussed in the literature. Most existing studies use regression models and assume that all factors are linear-additive in contributing to the young adults' driving behaviors. This study relaxes this assumption by applying a non-parametric statistical learning method, namely the gradient boosting decision trees (GBDT). Using U.S. nationwide travel surveys for 2001 and 2017, this study examines the non-linear dose-response effects of lifecycle, socio-demographic and residential factors on daily driving distances of Millennial and Gen-X young adults. Holding all other factors constant, Millennial young adults had shorter predicted daily driving distances than their Gen-X counterparts. Besides, residential and economic factors explain around 50% of young adults' daily driving distances, while the collective contributions for life course events and demographics are about 33%. This study also identifies the density ranges for formulating effective land use policies aiming at reducing automobile travel demand.
    Performance Prediction Under Dataset Shift. (arXiv:2206.10697v1 [cs.LG])
    ML models deployed in production often have to face unknown domain changes, fundamentally different from their training settings. Performance prediction models carry out the crucial task of measuring the impact of these changes on model performance. We study the generalization capabilities of various performance prediction models to new domains by learning on generated synthetic perturbations. Empirical validation on a benchmark of ten tabular datasets shows that models based upon state-of-the-art shift detection metrics are not expressive enough to generalize to unseen domains, while Error Predictors bring a consistent improvement in performance prediction under shift. We additionally propose a natural and effortless uncertainty estimation of the predicted accuracy that ensures reliable use of performance predictors. Our implementation is available at https: //github.com/dataiku-research/performance_prediction_under_shift.
    Multi-level Domain Adaptation for Lane Detection. (arXiv:2206.10692v1 [cs.CV])
    We focus on bridging domain discrepancy in lane detection among different scenarios to greatly reduce extra annotation and re-training costs for autonomous driving. Critical factors hinder the performance improvement of cross-domain lane detection that conventional methods only focus on pixel-wise loss while ignoring shape and position priors of lanes. To address the issue, we propose the Multi-level Domain Adaptation (MLDA) framework, a new perspective to handle cross-domain lane detection at three complementary semantic levels of pixel, instance and category. Specifically, at pixel level, we propose to apply cross-class confidence constraints in self-training to tackle the imbalanced confidence distribution of lane and background. At instance level, we go beyond pixels to treat segmented lanes as instances and facilitate discriminative features in target domain with triplet learning, which effectively rebuilds the semantic context of lanes and contributes to alleviating the feature confusion. At category level, we propose an adaptive inter-domain embedding module to utilize the position prior of lanes during adaptation. In two challenging datasets, ie TuSimple and CULane, our approach improves lane detection performance by a large margin with gains of 8.8% on accuracy and 7.4% on F1-score respectively, compared with state-of-the-art domain adaptation algorithms.
    Sparse Kernel Gaussian Processes through Iterative Charted Refinement (ICR). (arXiv:2206.10634v1 [cs.LG])
    Gaussian Processes (GPs) are highly expressive, probabilistic models. A major limitation is their computational complexity. Naively, exact GP inference requires $\mathcal{O}(N^3)$ computations with $N$ denoting the number of modeled points. Current approaches to overcome this limitation either rely on sparse, structured or stochastic representations of data or kernel respectively and usually involve nested optimizations to evaluate a GP. We present a new, generative method named Iterative Charted Refinement (ICR) to model GPs on nearly arbitrarily spaced points in $\mathcal{O}(N)$ time for decaying kernels without nested optimizations. ICR represents long- as well as short-range correlations by combining views of the modeled locations at varying resolutions with a user-provided coordinate chart. In our experiment with points whose spacings vary over two orders of magnitude, ICR's accuracy is comparable to state-of-the-art GP methods. ICR outperforms existing methods in terms of computational speed by one order of magnitude on the CPU and GPU and has already been successfully applied to model a GP with $122$ billion parameters.
    Learning Continuous Rotation Canonicalization with Radial Beam Sampling. (arXiv:2206.10690v1 [cs.CV])
    Nearly all state of the art vision models are sensitive to image rotations. Existing methods often compensate for missing inductive biases by using augmented training data to learn pseudo-invariances. Alongside the resource demanding data inflation process, predictions often poorly generalize. The inductive biases inherent to convolutional neural networks allow for translation equivariance through kernels acting parallely to the horizontal and vertical axes of the pixel grid. This inductive bias, however, does not allow for rotation equivariance. We propose a radial beam sampling strategy along with radial kernels operating on these beams to inherently incorporate center-rotation covariance. Together with an angle distance loss, we present a radial beam-based image canonicalization model, short BIC. Our model allows for maximal continuous angle regression and canonicalizes arbitrary center-rotated input images. As a pre-processing model, this enables rotation-invariant vision pipelines with model-agnostic rotation-sensitive downstream predictions. We show that our end-to-end trained angle regressor is able to predict continuous rotation angles on several vision datasets, i.e. FashionMNIST, CIFAR10, COIL100, and LFW.
    TiCo: Transformation Invariance and Covariance Contrast for Self-Supervised Visual Representation Learning. (arXiv:2206.10698v1 [cs.CV])
    We present Transformation Invariance and Covariance Contrast (TiCo) for self-supervised visual representation learning. Similar to other recent self-supervised learning methods, our method is based on maximizing the agreement among embeddings of different distorted versions of the same image, which pushes the encoder to produce transformation invariant representations. To avoid the trivial solution where the encoder generates constant vectors, we regularize the covariance matrix of the embeddings from different images by penalizing low rank solutions. By jointly minimizing the transformation invariance loss and covariance contrast loss, we get an encoder that is able to produce useful representations for downstream tasks. We analyze our method and show that it can be viewed as a variant of MoCo with an implicit memory bank of unlimited size at no extra memory cost. This makes our method perform better than alternative methods when using small batch sizes. TiCo can also be seen as a modification of Barlow Twins. By connecting the contrastive and redundancy-reduction methods together, TiCo gives us new insights into how joint embedding methods work.
    Physics-informed machine learning with differentiable programming for heterogeneous underground reservoir pressure management. (arXiv:2206.10718v1 [physics.comp-ph])
    Avoiding over-pressurization in subsurface reservoirs is critical for applications like CO2 sequestration and wastewater injection. Managing the pressures by controlling injection/extraction are challenging because of complex heterogeneity in the subsurface. The heterogeneity typically requires high-fidelity physics-based models to make predictions on CO$_2$ fate. Furthermore, characterizing the heterogeneity accurately is fraught with parametric uncertainty. Accounting for both, heterogeneity and uncertainty, makes this a computationally-intensive problem challenging for current reservoir simulators. To tackle this, we use differentiable programming with a full-physics model and machine learning to determine the fluid extraction rates that prevent over-pressurization at critical reservoir locations. We use DPFEHM framework, which has trustworthy physics based on the standard two-point flux finite volume discretization and is also automatically differentiable like machine learning models. Our physics-informed machine learning framework uses convolutional neural networks to learn an appropriate extraction rate based on the permeability field. We also perform a hyperparameter search to improve the model's accuracy. Training and testing scenarios are executed to evaluate the feasibility of using physics-informed machine learning to manage reservoir pressures. We constructed and tested a sufficiently accurate simulator that is 400000 times faster than the underlying physics-based simulator, allowing for near real-time analysis and robust uncertainty quantification.
    Dynamic Restrained Uncertainty Weighting Loss for Multitask Learning of Vocal Expression. (arXiv:2206.11049v1 [cs.SD])
    We propose a novel Dynamic Restrained Uncertainty Weighting Loss to experimentally handle the problem of balancing the contributions of multiple tasks on the ICML ExVo 2022 Challenge. The multitask aims to recognize expressed emotions and demographic traits from vocal bursts jointly. Our strategy combines the advantages of Uncertainty Weight and Dynamic Weight Average, by extending weights with a restraint term to make the learning process more explainable. We use a lightweight multi-exit CNN architecture to implement our proposed loss approach. The experimental H-Mean score (0.394) shows a substantial improvement over the baseline H-Mean score (0.335).  ( 2 min )
    Learning Debiased Classifier with Biased Committee. (arXiv:2206.10843v1 [cs.LG])
    Neural networks are prone to be biased towards spurious correlations between classes and latent attributes exhibited in a major portion of training data, which ruins their generalization capability. This paper proposes a new method for training debiased classifiers with no spurious attribute label. The key idea of the method is to employ a committee of classifiers as an auxiliary module that identifies bias-conflicting data, i.e., data without spurious correlations, and assigns large weights to them when training the main classifier. The committee is learned as a bootstrapped ensemble so that a majority of its classifiers are biased as well as being diverse, and intentionally fail to predict classes of bias-conflicting data accordingly. The consensus within the committee on prediction difficulty thus provides a reliable cue for identifying and weighting bias-conflicting data. Moreover, the committee is also trained with knowledge transferred from the main classifier so that it gradually becomes debiased along with the main classifier and emphasizes more difficult data as training progresses. On five real-world datasets, our method outperforms existing methods using no spurious attribute label like ours and even surpasses those relying on bias labels occasionally.  ( 2 min )
    KiloNeuS: Implicit Neural Representations with Real-Time Global Illumination. (arXiv:2206.10885v1 [cs.CV])
    The latest trends in inverse rendering techniques for reconstruction use neural networks to learn 3D representations as neural fields. NeRF-based techniques fit multi-layer perceptrons (MLPs) to a set of training images to estimate a radiance field which can then be rendered from any virtual camera by means of volume rendering algorithms. Major drawbacks of these representations are the lack of well-defined surfaces and non-interactive rendering times, as wide and deep MLPs must be queried millions of times per single frame. These limitations have recently been singularly overcome, but managing to accomplish this simultaneously opens up new use cases. We present KiloNeuS, a new neural object representation that can be rendered in path-traced scenes at interactive frame rates. KiloNeuS enables the simulation of realistic light interactions between neural and classic primitives in shared scenes, and it demonstrably performs in real-time with plenty of room for future optimizations and extensions.  ( 2 min )
    $\texttt{FedBC}$: Calibrating Global and Local Models via Federated Learning Beyond Consensus. (arXiv:2206.10815v1 [cs.LG])
    In federated learning (FL), the objective of collaboratively learning a global model through aggregation of model updates across devices tends to oppose the goal of personalization via local information. In this work, we calibrate this tradeoff in a quantitative manner through a multi-criterion optimization-based framework, which we cast as a constrained program: the objective for a device is its local objective, which it seeks to minimize while satisfying nonlinear constraints that quantify the proximity between the local and the global model. By considering the Lagrangian relaxation of this problem, we develop an algorithm that allows each node to minimize its local component of Lagrangian through queries to a first-order gradient oracle. Then, the server executes Lagrange multiplier ascent steps followed by a Lagrange multiplier-weighted averaging step. We call this instantiation of the primal-dual method Federated Learning Beyond Consensus ($\texttt{FedBC}$). Theoretically, we establish that $\texttt{FedBC}$ converges to a first-order stationary point at rates that matches the state of the art, up to an additional error term that depends on the tolerance parameter that arises due to the proximity constraints. Overall, the analysis is a novel characterization of primal-dual methods applied to non-convex saddle point problems with nonlinear constraints. Finally, we demonstrate that $\texttt{FedBC}$ balances the global and local model test accuracy metrics across a suite of datasets (Synthetic, MNIST, CIFAR-10, Shakespeare), achieving competitive performance with the state of the art.  ( 3 min )
    Supermodular $\mf$-divergences and bounds on lossy compression and generalization error with mutual $\mf$-information. (arXiv:2206.11042v1 [cs.IT])
    In this paper, we introduce super-modular $\mf$-divergences and provide three applications for them: (i) we introduce Sanov's upper bound on the tail probability of sum of independent random variables based on super-modular $\mf$-divergence and show that our generalized Sanov's bound strictly improves over ordinary one, (ii) we consider the lossy compression problem which studies the set of achievable rates for a given distortion and code length. We extend the rate-distortion function using mutual $\mf$-information and provide new and strictly better bounds on achievable rates in the finite blocklength regime using super-modular $\mf$-divergences, and (iii) we provide a connection between the generalization error of algorithms with bounded input/output mutual $\mf$-information and a generalized rate-distortion problem. This connection allows us to bound the generalization error of learning algorithms using lower bounds on the rate-distortion function. Our bound is based on a new lower bound on the rate-distortion function that (for some examples) strictly improves over previously best-known bounds. Moreover, super-modular $\mf$-divergences are utilized to reduce the dimension of the problem and obtain single-letter bounds.
    POGEMA: Partially Observable Grid Environment for Multiple Agents. (arXiv:2206.10944v1 [cs.LG])
    We introduce POGEMA (https://github.com/AIRI-Institute/pogema) a sandbox for challenging partially observable multi-agent pathfinding (PO-MAPF) problems . This is a grid-based environment that was specifically designed to be a flexible, tunable and scalable benchmark. It can be tailored to a variety of PO-MAPF, which can serve as an excellent testing ground for planning and learning methods, and their combination, which will allow us to move towards filling the gap between AI planning and learning.  ( 2 min )
    Deep Reinforcement Learning for Turbulence Modeling in Large Eddy Simulations. (arXiv:2206.11038v1 [physics.flu-dyn])
    Over the last years, supervised learning (SL) has established itself as the state-of-the-art for data-driven turbulence modeling. In the SL paradigm, models are trained based on a dataset, which is typically computed a priori from a high-fidelity solution by applying the respective filter function, which separates the resolved and the unresolved flow scales. For implicitly filtered large eddy simulation (LES), this approach is infeasible, since here, the employed discretization itself acts as an implicit filter function. As a consequence, the exact filter form is generally not known and thus, the corresponding closure terms cannot be computed even if the full solution is available. The reinforcement learning (RL) paradigm can be used to avoid this inconsistency by training not on a previously obtained training dataset, but instead by interacting directly with the dynamical LES environment itself. This allows to incorporate the potentially complex implicit LES filter into the training process by design. In this work, we apply a reinforcement learning framework to find an optimal eddy-viscosity for implicitly filtered large eddy simulations of forced homogeneous isotropic turbulence. For this, we formulate the task of turbulence modeling as an RL task with a policy network based on convolutional neural networks that adapts the eddy-viscosity in LES dynamically in space and time based on the local flow state only. We demonstrate that the trained models can provide long-term stable simulations and that they outperform established analytical models in terms of accuracy. In addition, the models generalize well to other resolutions and discretizations. We thus demonstrate that RL can provide a framework for consistent, accurate and stable turbulence modeling especially for implicitly filtered LES.  ( 3 min )
    Robust Universal Adversarial Perturbations. (arXiv:2206.10858v1 [cs.LG])
    Universal Adversarial Perturbations (UAPs) are imperceptible, image-agnostic vectors that cause deep neural networks (DNNs) to misclassify inputs from a data distribution with high probability. Existing methods do not create UAPs robust to transformations, thereby limiting their applicability as a real-world attacks. In this work, we introduce a new concept and formulation of robust universal adversarial perturbations. Based on our formulation, we build a novel, iterative algorithm that leverages probabilistic robustness bounds for generating UAPs robust against transformations generated by composing arbitrary sub-differentiable transformation functions. We perform an extensive evaluation on the popular CIFAR-10 and ILSVRC 2012 datasets measuring robustness under human-interpretable semantic transformations, such as rotation, contrast changes, etc, that are common in the real-world. Our results show that our generated UAPs are significantly more robust than those from baselines.  ( 2 min )
    Influence of uncertainty estimation techniques on false-positive reduction in liver lesion detection. (arXiv:2206.10911v1 [eess.IV])
    Deep learning techniques show success in detecting objects in medical images, but still suffer from false-positive predictions that may hinder accurate diagnosis. The estimated uncertainty of the neural network output has been used to flag incorrect predictions. We study the role played by features computed from neural network uncertainty estimates and shape-based features computed from binary predictions in reducing false positives in liver lesion detection by developing a classification-based post-processing step for different uncertainty estimation methods. We demonstrate an improvement in the lesion detection performance of the neural network (with respect to F1-score) for all uncertainty estimation methods on two datasets, comprising abdominal MR and CT images respectively. We show that features computed from neural network uncertainty estimates tend not to contribute much toward reducing false positives. Our results show that factors like class imbalance (true over false positive ratio) and shape-based features extracted from uncertainty maps play an important role in distinguishing false positive from true positive predictions
    Graph Neural Networks as Gradient Flows. (arXiv:2206.10991v1 [cs.LG])
    Dynamical systems minimizing an energy are ubiquitous in geometry and physics. We propose a gradient flow framework for GNNs where the equations follow the direction of steepest descent of a learnable energy. This approach allows to explain the GNN evolution from a multi-particle perspective as learning attractive and repulsive forces in feature space via the positive and negative eigenvalues of a symmetric "channel-mixing" matrix. We perform spectral analysis of the solutions and conclude that gradient flow graph convolutional models can induce a dynamics dominated by the graph high frequencies which is desirable for heterophilic datasets. We also describe structural constraints on common GNN architectures allowing to interpret them as gradient flows. We perform thorough ablation studies corroborating our theoretical analysis and show competitive performance of simple and lightweight models on real-world homophilic and heterophilic datasets.
    Agent-based Graph Neural Networks. (arXiv:2206.11010v1 [cs.LG])
    We present a novel graph neural network we call AgentNet, which is designed specifically for graph-level tasks. AgentNet is inspired by sublinear algorithms, featuring a computational complexity that is independent of the graph size. The architecture of AgentNet differs fundamentally from the architectures of known graph neural networks. In AgentNet, some trained \textit{neural agents} intelligently walk the graph, and then collectively decide on the output. We provide an extensive theoretical analysis of AgentNet: We show that the agents can learn to systematically explore their neighborhood and that AgentNet can distinguish some structures that are even indistinguishable by 3-WL. Moreover, AgentNet is able to separate any two graphs which are sufficiently different in terms of subgraphs. We confirm these theoretical results with synthetic experiments on hard-to-distinguish graphs and real-world graph classification tasks. In both cases, we compare favorably not only to standard GNNs but also to computationally more expensive GNN extensions.
    A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement. (arXiv:2206.11000v1 [eess.AS])
    Speech enhancement has seen great improvement in recent years using end-to-end neural networks. However, most models are agnostic to the spoken phonetic content. Recently, several studies suggested phonetic-aware speech enhancement, mostly using perceptual supervision. Yet, injecting phonetic features during model optimization can take additional forms (e.g., model conditioning). In this paper, we conduct a systematic comparison between different methods of incorporating phonetic information in a speech enhancement model. By conducting a series of controlled experiments, we observe the influence of different phonetic content models as well as various feature-injection techniques on enhancement performance, considering both causal and non-causal models. Specifically, we evaluate three settings for injecting phonetic information, namely: i) feature conditioning; ii) perceptual supervision; and iii) regularization. Phonetic features are obtained using an intermediate layer of either a supervised pre-trained Automatic Speech Recognition (ASR) model or by using a pre-trained Self-Supervised Learning (SSL) model. We further observe the effect of choosing different embedding layers on performance, considering both manual and learned configurations. Results suggest that using a SSL model as phonetic features outperforms the ASR one in most cases. Interestingly, the conditioning setting performs best among the evaluated configurations.  ( 2 min )
    Decentralized Gossip-Based Stochastic Bilevel Optimization over Communication Networks. (arXiv:2206.10870v1 [stat.ML])
    Bilevel optimization have gained growing interests, with numerous applications found in meta learning, minimax games, reinforcement learning, and nested composition optimization. This paper studies the problem of distributed bilevel optimization over a network where agents can only communicate with neighbors, including examples from multi-task, multi-agent learning and federated learning. In this paper, we propose a gossip-based distributed bilevel learning algorithm that allows networked agents to solve both the inner and outer optimization problems in a single timescale and share information via network propagation. We show that our algorithm enjoys the $\mathcal{O}(\frac{1}{K \epsilon^2})$ per-agent sample complexity for general nonconvex bilevel optimization and $\mathcal{O}(\frac{1}{K \epsilon})$ for strongly convex objective, achieving a speedup that scales linearly with the network size. The sample complexities are optimal in both $\epsilon$ and $K$. We test our algorithm on the examples of hyperparameter tuning and decentralized reinforcement learning. Simulated experiments confirmed that our algorithm achieves the state-of-the-art training efficiency and test accuracy.  ( 2 min )
    COVYT: Introducing the Coronavirus YouTube and TikTok speech dataset featuring the same speakers with and without infection. (arXiv:2206.11045v1 [eess.AS])
    More than two years after its outbreak, the COVID-19 pandemic continues to plague medical systems around the world, putting a strain on scarce resources, and claiming human lives. From the very beginning, various AI-based COVID-19 detection and monitoring tools have been pursued in an attempt to stem the tide of infections through timely diagnosis. In particular, computer audition has been suggested as a non-invasive, cost-efficient, and eco-friendly alternative for detecting COVID-19 infections through vocal sounds. However, like all AI methods, also computer audition is heavily dependent on the quantity and quality of available data, and large-scale COVID-19 sound datasets are difficult to acquire -- amongst other reasons -- due to the sensitive nature of such data. To that end, we introduce the COVYT dataset -- a novel COVID-19 dataset collected from public sources containing more than 8 hours of speech from 65 speakers. As compared to other existing COVID-19 sound datasets, the unique feature of the COVYT dataset is that it comprises both COVID-19 positive and negative samples from all 65 speakers. We analyse the acoustic manifestation of COVID-19 on the basis of these perfectly speaker characteristic balanced `in-the-wild' data using interpretable audio descriptors, and investigate several classification scenarios that shed light into proper partitioning strategies for a fair speech-based COVID-19 detection.
    Predicting Team Performance with Spatial Temporal Graph Convolutional Networks. (arXiv:2206.10720v1 [cs.LG])
    This paper presents a new approach for predicting team performance from the behavioral traces of a set of agents. This spatiotemporal forecasting problem is very relevant to sports analytics challenges such as coaching and opponent modeling. We demonstrate that our proposed model, Spatial Temporal Graph Convolutional Networks (ST-GCN), outperforms other classification techniques at predicting game score from a short segment of player movement and game features. Our proposed architecture uses a graph convolutional network to capture the spatial relationships between team members and Gated Recurrent Units to analyze dynamic motion information. An ablative evaluation was performed to demonstrate the contributions of different aspects of our architecture.
    SpA-Former: Transformer image shadow detection and removal via spatial attention. (arXiv:2206.10910v1 [cs.CV])
    In this paper, we propose an end-to-end SpA-Former to recover a shadow-free image from a single shaded image. Unlike traditional methods that require two steps for shadow detection and then shadow removal, the SpA-Former unifies these steps into one, which is a one-stage network capable of directly learning the mapping function between shadows and no shadows, it does not require a separate shadow detection. Thus, SpA-former is adaptable to real image de-shadowing for shadows projected on different semantic regions. SpA-Former consists of transformer layer and a series of joint Fourier transform residual blocks and two-wheel joint spatial attention. The network in this paper is able to handle the task while achieving a very fast processing efficiency. Our code is relased on https://github.com/ zhangbaijin/Spatial-Transformer-shadow-removal
    Robust Bayesian Recourse. (arXiv:2206.10833v1 [cs.LG])
    Algorithmic recourse aims to recommend an informative feedback to overturn an unfavorable machine learning decision. We introduce in this paper the Bayesian recourse, a model-agnostic recourse that minimizes the posterior probability odds ratio. Further, we present its min-max robust counterpart with the goal of hedging against future changes in the machine learning model parameters. The robust counterpart explicitly takes into account possible perturbations of the data in a Gaussian mixture ambiguity set prescribed using the optimal transport (Wasserstein) distance. We show that the resulting worst-case objective function can be decomposed into solving a series of two-dimensional optimization subproblems, and the min-max recourse finding problem is thus amenable to a gradient descent algorithm. Contrary to existing methods for generating robust recourses, the robust Bayesian recourse does not require a linear approximation step. The numerical experiment demonstrates the effectiveness of our proposed robust Bayesian recourse facing model shifts. Our code is available at https://github.com/VinAIResearch/robust-bayesian-recourse.
    Bregman Power k-Means for Clustering Exponential Family Data. (arXiv:2206.10860v1 [stat.ML])
    Recent progress in center-based clustering algorithms combats poor local minima by implicit annealing, using a family of generalized means. These methods are variations of Lloyd's celebrated $k$-means algorithm, and are most appropriate for spherical clusters such as those arising from Gaussian data. In this paper, we bridge these algorithmic advances to classical work on hard clustering under Bregman divergences, which enjoy a bijection to exponential family distributions and are thus well-suited for clustering objects arising from a breadth of data generating mechanisms. The elegant properties of Bregman divergences allow us to maintain closed form updates in a simple and transparent algorithm, and moreover lead to new theoretical arguments for establishing finite sample bounds that relax the bounded support assumption made in the existing state of the art. Additionally, we consider thorough empirical analyses on simulated experiments and a case study on rainfall data, finding that the proposed method outperforms existing peer methods in a variety of non-Gaussian data settings.
    Learning Distribution Grid Topologies: A Tutorial. (arXiv:2206.10837v1 [math.OC])
    Unveiling feeder topologies from data is of paramount importance to advance situational awareness and proper utilization of smart resources in power distribution grids. This tutorial summarizes, contrasts, and establishes useful links between recent works on topology identification and detection schemes that have been proposed for power distribution grids.% under different regimes of measurement type, observability, and sampling. The primary focus is to highlight methods that overcome the limited availability of measurement devices in distribution grids, while enhancing topology estimates using conservation laws of power-flow physics and structural properties of feeders. Grid data from phasor measurement units or smart meters can be collected either passively in the traditional way, or actively, upon actuating grid resources and measuring the feeder's voltage response. Analytical claims on feeder identifiability and detectability are reviewed under disparate meter placement scenarios. Such topology learning claims can be attained exactly or approximately so via algorithmic solutions with various levels of computational complexity, ranging from least-squares fits to convex optimization problems, and from polynomial-time searches over graphs to mixed-integer programs. This tutorial aspires to provide researchers and engineers with knowledge of the current state-of-the-art in tractable distribution grid learning and insights into future directions of work.
    Diagnostic Tool for Out-of-Sample Model Evaluation. (arXiv:2206.10982v1 [stat.ML])
    Assessment of model fitness is an important step in many problems. Models are typically fitted to training data by minimizing a loss function, such as the squared-error or negative log-likelihood, and it is natural to desire low losses on future data. This letter considers the use of a test data set to characterize the out-of-sample losses of a model. We propose a simple model diagnostic tool that provides finite-sample guarantees under weak assumptions. The tool is computationally efficient and can be interpreted as an empirical quantile. Several numerical experiments are presented to show how the proposed method quantifies the impact of distribution shifts, aids the analysis of regression, and enables model selection as well as hyper-parameter tuning.
    Play It Cool: Dynamic Shifting Prevents Thermal Throttling. (arXiv:2206.10849v1 [cs.LG])
    Machine learning (ML) has entered the mobile era where an enormous number of ML models are deployed on edge devices. However, running common ML models on edge devices continuously may generate excessive heat from the computation, forcing the device to "slow down" to prevent overheating, a phenomenon called thermal throttling. This paper studies the impact of thermal throttling on mobile phones: when it occurs, the CPU clock frequency is reduced, and the model inference latency may increase dramatically. This unpleasant inconsistent behavior has a substantial negative effect on user experience, but it has been overlooked for a long time. To counter thermal throttling, we propose to utilize dynamic networks with shared weights and dynamically shift between large and small ML models seamlessly according to their thermal profile, i.e., shifting to a small model when the system is about to throttle. With the proposed dynamic shifting, the application runs consistently without experiencing CPU clock frequency degradation and latency increase. In addition, we also study the resulting accuracy when dynamic shifting is deployed and show that our approach provides a reasonable trade-off between model latency and model accuracy.  ( 2 min )
    Multi-Omic Data Integration and Feature Selection for Survival-based Patient Stratification via Supervised Concrete Autoencoders. (arXiv:2206.10699v1 [cs.LG])
    Cancer is a complex disease with significant social and economic impact. Advancements in high-throughput molecular assays and the reduced cost for performing high-quality multi-omics measurements have fuelled insights through machine learning . Previous studies have shown promise on using multiple omic layers to predict survival and stratify cancer patients. In this paper, we developed a Supervised Autoencoder (SAE) model for survival-based multi-omic integration which improves upon previous work, and report a Concrete Supervised Autoencoder model (CSAE), which uses feature selection to jointly reconstruct the input features as well as predict survival. Our experiments show that our models outperform or are on par with some of the most commonly used baselines, while either providing a better survival separation (SAE) or being more interpretable (CSAE). We also perform a feature selection stability analysis on our models and notice that there is a power-law relationship with features which are commonly associated with survival. The code for this project is available at: https://github.com/phcavelar/coxae
    A consistent and flexible framework for deep matrix factorizations. (arXiv:2206.10693v1 [cs.LG])
    Deep matrix factorizations (deep MFs) are recent unsupervised data mining techniques inspired by constrained low-rank approximations. They aim to extract complex hierarchies of features within high-dimensional datasets. Most of the loss functions proposed in the literature to evaluate the quality of deep MF models and the underlying optimization frameworks are not consistent because different losses are used at different layers. In this paper, we introduce two meaningful loss functions for deep MF and present a generic framework to solve the corresponding optimization problems. We illustrate the effectiveness of this approach through the integration of various constraints and regularizations, such as sparsity, nonnegativity and minimum-volume. The models are successfully applied on both synthetic and real data, namely for hyperspectral unmixing and extraction of facial features.
    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation. (arXiv:2206.10789v1 [cs.CV])
    We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements. See https://parti.research.google/ for high-resolution images.  ( 2 min )
    Jointist: Joint Learning for Multi-instrument Transcription and Its Applications. (arXiv:2206.10805v1 [cs.SD])
    In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of the instrument recognition module that conditions the other modules: the transcription module that outputs instrument-specific piano rolls, and the source separation module that utilizes instrument information and transcription results. The instrument conditioning is designed for an explicit multi-instrument functionality while the connection between the transcription and source separation modules is for better transcription performance. Our challenging problem formulation makes the model highly useful in the real world given that modern popular music typically consists of multiple instruments. However, its novelty necessitates a new perspective on how to evaluate such a model. During the experiment, we assess the model from various aspects, providing a new evaluation perspective for multi-instrument transcription. We also argue that transcription models can be utilized as a preprocessing module for other music analysis tasks. In the experiment on several downstream tasks, the symbolic representation provided by our transcription model turned out to be helpful to spectrograms in solving downbeat detection, chord recognition, and key estimation.  ( 2 min )
    DaisyRec 2.0: Benchmarking Recommendation for Rigorous Evaluation. (arXiv:2206.10848v1 [cs.IR])
    Recently, one critical issue looms large in the field of recommender systems -- there are no effective benchmarks for rigorous evaluation -- which consequently leads to unreproducible evaluation and unfair comparison. We, therefore, conduct studies from the perspectives of practical theory and experiments, aiming at benchmarking recommendation for rigorous evaluation. Regarding the theoretical study, a series of hyper-factors affecting recommendation performance throughout the whole evaluation chain are systematically summarized and analyzed via an exhaustive review on 141 papers published at eight top-tier conferences within 2017-2020. We then classify them into model-independent and model-dependent hyper-factors, and different modes of rigorous evaluation are defined and discussed in-depth accordingly. For the experimental study, we release DaisyRec 2.0 library by integrating these hyper-factors to perform rigorous evaluation, whereby a holistic empirical study is conducted to unveil the impacts of different hyper-factors on recommendation performance. Supported by the theoretical and experimental studies, we finally create benchmarks for rigorous evaluation by proposing standardized procedures and providing performance of ten state-of-the-arts across six evaluation metrics on six datasets as a reference for later study. Overall, our work sheds light on the issues in recommendation evaluation, provides potential solutions for rigorous evaluation, and lays foundation for further investigation.  ( 2 min )
    Automated Cancer Subtyping via Vector Quantization Mutual Information Maximization. (arXiv:2206.10801v1 [cs.LG])
    Cancer subtyping is crucial for understanding the nature of tumors and providing suitable therapy. However, existing labelling methods are medically controversial, and have driven the process of subtyping away from teaching signals. Moreover, cancer genetic expression profiles are high-dimensional, scarce, and have complicated dependence, thereby posing a serious challenge to existing subtyping models for outputting sensible clustering. In this study, we propose a novel clustering method for exploiting genetic expression profiles and distinguishing subtypes in an unsupervised manner. The proposed method adaptively learns categorical correspondence from latent representations of expression profiles to the subtypes output by the model. By maximizing the problem -- agnostic mutual information between input expression profiles and output subtypes, our method can automatically decide a suitable number of subtypes. Through experiments, we demonstrate that our proposed method can refine existing controversial labels, and, by further medical analysis, this refinement is proven to have a high correlation with cancer survival rates.  ( 2 min )
    Sharp Constants in Uniformity Testing via the Huber Statistic. (arXiv:2206.10722v1 [stat.ML])
    Uniformity testing is one of the most well-studied problems in property testing, with many known test statistics, including ones based on counting collisions, singletons, and the empirical TV distance. It is known that the optimal sample complexity to distinguish the uniform distribution on $m$ elements from any $\epsilon$-far distribution with $1-\delta$ probability is $n = \Theta\left(\frac{\sqrt{m \log (1/\delta)}}{\epsilon^2} + \frac{\log (1/\delta)}{\epsilon^2}\right)$, which is achieved by the empirical TV tester. Yet in simulation, these theoretical analyses are misleading: in many cases, they do not correctly rank order the performance of existing testers, even in an asymptotic regime of all parameters tending to $0$ or $\infty$. We explain this discrepancy by studying the \emph{constant factors} required by the algorithms. We show that the collisions tester achieves a sharp maximal constant in the number of standard deviations of separation between uniform and non-uniform inputs. We then introduce a new tester based on the Huber loss, and show that it not only matches this separation, but also has tails corresponding to a Gaussian with this separation. This leads to a sample complexity of $(1 + o(1))\frac{\sqrt{m \log (1/\delta)}}{\epsilon^2}$ in the regime where this term is dominant, unlike all other existing testers.  ( 2 min )
    Towards OOD Detection in Graph Classification from Uncertainty Estimation Perspective. (arXiv:2206.10691v1 [cs.LG])
    The problem of out-of-distribution detection for graph classification is far from being solved. The existing models tend to be overconfident about OOD examples or completely ignore the detection task. In this work, we consider this problem from the uncertainty estimation perspective and perform the comparison of several recently proposed methods. In our experiment, we find that there is no universal approach for OOD detection, and it is important to consider both graph representations and predictive categorical distribution.
    TraSE: Towards Tackling Authorial Style from a Cognitive Science Perspective. (arXiv:2206.10706v1 [cs.CL])
    Stylistic analysis of text is a key task in research areas ranging from authorship attribution to forensic analysis and personality profiling. The existing approaches for stylistic analysis are plagued by issues like topic influence, lack of discriminability for large number of authors and the requirement for large amounts of diverse data. In this paper, the source of these issues are identified along with the necessity for a cognitive perspective on authorial style in addressing them. A novel feature representation, called Trajectory-based Style Estimation (TraSE), is introduced to support this purpose. Authorship attribution experiments with over 27,000 authors and 1.4 million samples in a cross-domain scenario resulted in 90% attribution accuracy suggesting that the feature representation is immune to such negative influences and an excellent candidate for stylistic analysis. Finally, a qualitative analysis is performed on TraSE using physical human characteristics, like age, to validate its claim on capturing cognitive traits.
    Imitation Learning for Generalizable Self-driving Policy with Sim-to-real Transfer. (arXiv:2206.10797v1 [cs.LG])
    Imitation Learning uses the demonstrations of an expert to uncover the optimal policy and it is suitable for real-world robotics tasks as well. In this case, however, the training of the agent is carried out in a simulation environment due to safety, economic and time constraints. Later, the agent is applied in the real-life domain using sim-to-real methods. In this paper, we apply Imitation Learning methods that solve a robotics task in a simulated environment and use transfer learning to apply these solutions in the real-world environment. Our task is set in the Duckietown environment, where the robotic agent has to follow the right lane based on the input images of a single forward-facing camera. We present three Imitation Learning and two sim-to-real methods capable of achieving this task. A detailed comparison is provided on these techniques to highlight their advantages and disadvantages.  ( 2 min )
    Federated Latent Class Regression for Hierarchical Data. (arXiv:2206.10783v1 [cs.LG])
    Federated Learning (FL) allows a number of agents to participate in training a global machine learning model without disclosing locally stored data. Compared to traditional distributed learning, the heterogeneity (non-IID) of the agents slows down the convergence in FL. Furthermore, many datasets, being too noisy or too small, are easily overfitted by complex models, such as deep neural networks. Here, we consider the problem of using FL regression on noisy, hierarchical and tabular datasets in which user distributions are significantly different. Inspired by Latent Class Regression (LCR), we propose a novel probabilistic model, Hierarchical Latent Class Regression (HLCR), and its extension to Federated Learning, FEDHLCR. FEDHLCR consists of a mixture of linear regression models, allowing better accuracy than simple linear regression, while at the same time maintaining its analytical properties and avoiding overfitting. Our inference algorithm, being derived from Bayesian theory, provides strong convergence guarantees and good robustness to overfitting. Experimental results show that FEDHLCR offers fast convergence even in non-IID datasets.  ( 2 min )
    On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL. (arXiv:2206.10770v1 [cs.LG])
    We study reward-free reinforcement learning (RL) under general non-linear function approximation, and establish sample efficiency and hardness results under various standard structural assumptions. On the positive side, we propose the RFOLIVE (Reward-Free OLIVE) algorithm for sample-efficient reward-free exploration under minimal structural assumptions, which covers the previously studied settings of linear MDPs (Jin et al., 2020b), linear completeness (Zanette et al., 2020b) and low-rank MDPs with unknown representation (Modi et al., 2021). Our analyses indicate that the explorability or reachability assumptions, previously made for the latter two settings, are not necessary statistically for reward-free exploration. On the negative side, we provide a statistical hardness result for both reward-free and reward-aware exploration under linear completeness assumptions when the underlying features are unknown, showing an exponential separation between low-rank and linear completeness settings.  ( 2 min )
    Efficient Interdependent Systems Recovery Modeling with DeepONets. (arXiv:2206.10829v1 [cs.LG])
    Modeling the recovery of interdependent critical infrastructure is a key component of quantifying and optimizing societal resilience to disruptive events. However, simulating the recovery of large-scale interdependent systems under random disruptive events is computationally expensive. Therefore, we propose the application of Deep Operator Networks (DeepONets) in this paper to accelerate the recovery modeling of interdependent systems. DeepONets are ML architectures which identify mathematical operators from data. The form of governing equations DeepONets identify and the governing equation of interdependent systems recovery model are similar. Therefore, we hypothesize that DeepONets can efficiently model the interdependent systems recovery with little training data. We applied DeepONets to a simple case of four interdependent systems with sixteen states. DeepONets, overall, performed satisfactorily in predicting the recovery of these interdependent systems for out of training sample data when compared to reference results.  ( 2 min )
    BiometricBlender: Ultra-high dimensional, multi-class synthetic data generator to imitate biometric feature space. (arXiv:2206.10747v1 [cs.LG])
    The lack of freely available (real-life or synthetic) high or ultra-high dimensional, multi-class datasets may hamper the rapidly growing research on feature screening, especially in the field of biometrics, where the usage of such datasets is common. This paper reports a Python package called BiometricBlender, which is an ultra-high dimensional, multi-class synthetic data generator to benchmark a wide range of feature screening methods. During the data generation process, the overall usefulness and the intercorrelations of blended features can be controlled by the user, thus the synthetic feature space is able to imitate the key properties of a real biometric dataset.  ( 2 min )
    Quantum-Enhanced Selection Operators for Evolutionary Algorithms. (arXiv:2206.10743v1 [quant-ph])
    Genetic algorithms have unique properties which are useful when applied to black box optimization. Using selection, crossover, and mutation operators, candidate solutions may be obtained without the need to calculate a gradient. In this work, we study results obtained from using quantum-enhanced operators within the selection mechanism of a genetic algorithm. Our approach frames the selection process as a minimization of a binary quadratic model with which we encode fitness and distance between members of a population, and we leverage a quantum annealing system to sample low energy solutions for the selection mechanism. We benchmark these quantum-enhanced algorithms against classical algorithms over various black-box objective functions, including the OneMax function, and functions from the IOHProfiler library for black-box optimization. We observe a performance gain in average number of generations to convergence for the quantum-enhanced elitist selection operator in comparison to classical on the OneMax function. We also find that the quantum-enhanced selection operator with non-elitist selection outperform benchmarks on functions with fitness perturbation from the IOHProfiler library. Additionally, we find that in the case of elitist selection, the quantum-enhanced operators outperform classical benchmarks on functions with varying degrees of dummy variables and neutrality.  ( 2 min )
    Imitate then Transcend: Multi-Agent Optimal Execution with Dual-Window Denoise PPO. (arXiv:2206.10736v1 [cs.LG])
    A novel framework for solving the optimal execution and placement problems using reinforcement learning (RL) with imitation was proposed. The RL agents trained from the proposed framework consistently outperformed the industry benchmark time-weighted average price (TWAP) strategy in execution cost and showed great generalization across out-of-sample trading dates and tickers. The impressive performance was achieved from three aspects. First, our RL network architecture called Dual-window Denoise PPO enabled efficient learning in a noisy market environment. Second, a reward scheme with imitation learning was designed, and a comprehensive set of market features was studied. Third, our flexible action formulation allowed the RL agent to tackle optimal execution and placement collectively resulting in better performance than solving individual problems separately. The RL agent's performance was evaluated in our multi-agent realistic historical limit order book simulator in which price impact was accurately assessed. In addition, ablation studies were also performed, confirming the superiority of our framework.  ( 2 min )
    Beyond Uniform Lipschitz Condition in Differentially Private Optimization. (arXiv:2206.10713v1 [cs.LG])
    Most prior convergence results on differentially private stochastic gradient descent (DP-SGD) are derived under the simplistic assumption of uniform Lipschitzness, i.e., the per-sample gradients are uniformly bounded. This assumption is unrealistic in many problems, e.g., linear regression with Gaussian data. We relax uniform Lipschitzness by instead assuming that the per-sample gradients have \textit{sample-dependent} upper bounds, i.e., per-sample Lipschitz constants, which themselves may be unbounded. We derive new convergence results for DP-SGD on both convex and nonconvex functions when the per-sample Lipschitz constants have bounded moments. Furthermore, we provide principled guidance on choosing the clip norm in DP-SGD for convex settings satisfying our relaxed version of Lipschitzness, without making distributional assumptions on the Lipschitz constants. We verify the effectiveness of our recommendation via experiments on benchmarking datasets.  ( 2 min )
    Generative Pretraining for Black-Box Optimization. (arXiv:2206.10786v1 [cs.LG])
    Many problems in science and engineering involve optimizing an expensive black-box function over a high-dimensional space. For such black-box optimization (BBO) problems, we typically assume a small budget for online function evaluations, but also often have access to a fixed, offline dataset for pretraining. Prior approaches seek to utilize the offline data to approximate the function or its inverse but are not sufficiently accurate far from the data distribution. We propose Black-box Optimization Transformer (BOOMER), a generative framework for pretraining black-box optimizers using offline datasets. In BOOMER, we train an autoregressive model to imitate trajectory runs of implicit black-box function optimizers. Since these trajectories are unavailable by default, we develop a simple randomized heuristic to synthesize trajectories by sorting random points from offline data. We show theoretically that this heuristic induces trajectories that mimic transitions from diverse low-fidelity (exploration) to high-fidelity (exploitation) samples. Further, we introduce mechanisms to control the rate at which a trajectory transitions from exploration to exploitation, and use it to generalize outside the offline data at test-time. Empirically, we instantiate BOOMER using a casually masked Transformer and evaluate it on Design-Bench, where we rank the best on average, outperforming state-of-the-art baselines.  ( 2 min )
    Multi-Resolution, Multi-Horizon Distributed Solar PV Power Forecasting with Forecast Combinations. (arXiv:2206.10795v1 [cs.LG])
    Distributed, small-scale solar photovoltaic (PV) systems are being installed at a rapidly increasing rate. This can cause major impacts on distribution networks and energy markets. As a result, there is a significant need for improved forecasting of the power generation of these systems at different time resolutions and horizons. However, the performance of forecasting models depends on the resolution and horizon. Forecast combinations (ensembles), that combine the forecasts of multiple models into a single forecast may be robust in such cases. Therefore, in this paper, we provide comparisons and insights into the performance of five state-of-the-art forecast models and existing forecast combinations at multiple resolutions and horizons. We propose a forecast combination approach based on particle swarm optimization (PSO) that will enable a forecaster to produce accurate forecasts for the task at hand by weighting the forecasts produced by individual models. Furthermore, we compare the performance of the proposed combination approach with existing forecast combination approaches. A comprehensive evaluation is conducted using a real-world residential PV power data set measured at 25 houses located in three locations in the United States. The results across four different resolutions and four different horizons show that the PSO-based forecast combination approach outperforms the use of any individual forecast model and other forecast combination counterparts, with an average Mean Absolute Scaled Error reduction by 3.81% compared to the best performing individual model. Our approach enables a solar forecaster to produce accurate forecasts for their application regardless of the forecast resolution or horizon.  ( 3 min )
    Meta Reinforcement Learning with Finite Training Tasks -- a Density Estimation Approach. (arXiv:2206.10716v1 [cs.LG])
    In meta reinforcement learning (meta RL), an agent learns from a set of training tasks how to quickly solve a new task, drawn from the same task distribution. The optimal meta RL policy, a.k.a. the Bayes-optimal behavior, is well defined, and guarantees optimal reward in expectation, taken with respect to the task distribution. The question we explore in this work is how many training tasks are required to guarantee approximately optimal behavior with high probability. Recent work provided the first such PAC analysis for a model-free setting, where a history-dependent policy was learned from the training tasks. In this work, we propose a different approach: directly learn the task distribution, using density estimation techniques, and then train a policy on the learned task distribution. We show that our approach leads to bounds that depend on the dimension of the task distribution. In particular, in settings where the task distribution lies in a low-dimensional manifold, we extend our analysis to use dimensionality reduction techniques and account for such structure, obtaining significantly better bounds than previous work, which strictly depend on the number of states and actions. The key of our approach is the regularization implied by the kernel density estimation method. We further demonstrate that this regularization is useful in practice, when `plugged in' the state-of-the-art VariBAD meta RL algorithm.  ( 2 min )
    Efficient and effective training of language and graph neural network models. (arXiv:2206.10781v1 [cs.LG])
    Can we combine heterogenous graph structure with text to learn high-quality semantic and behavioural representations? Graph neural networks (GNN)s encode numerical node attributes and graph structure to achieve impressive performance in a variety of supervised learning tasks. Current GNN approaches are challenged by textual features, which typically need to be encoded to a numerical vector before provided to the GNN that may incur some information loss. In this paper, we put forth an efficient and effective framework termed language model GNN (LM-GNN) to jointly train large-scale language models and graph neural networks. The effectiveness in our framework is achieved by applying stage-wise fine-tuning of the BERT model first with heterogenous graph information and then with a GNN model. Several system and design optimizations are proposed to enable scalable and efficient training. LM-GNN accommodates node and edge classification as well as link prediction tasks. We evaluate the LM-GNN framework in different datasets performance and showcase the effectiveness of the proposed approach. LM-GNN provides competitive results in an Amazon query-purchase-product application.  ( 2 min )
    Does the Data Induce Capacity Control in Deep Learning?. (arXiv:2110.14163v3 [cs.LG] UPDATED)
    We show that the input correlation matrix of typical classification datasets has an eigenspectrum where, after a sharp initial drop, a large number of small eigenvalues are distributed uniformly over an exponentially large range. This structure is mirrored in a network trained on this data: we show that the Hessian and the Fisher Information Matrix (FIM) have eigenvalues that are spread uniformly over exponentially large ranges. We call such eigenspectra "sloppy" because sets of weights corresponding to small eigenvalues can be changed by large magnitudes without affecting the loss. Networks trained on atypical datasets with non-sloppy inputs do not share these traits and deep networks trained on such datasets generalize poorly. Inspired by this, we study the hypothesis that sloppiness of inputs aids generalization in deep networks. We show that if the Hessian is sloppy, we can compute non-vacuous PAC-Bayes generalization bounds analytically. By exploiting our empirical observation that training predominantly takes place in the non-sloppy subspace of the FIM, we develop data-distribution dependent PAC-Bayes priors that lead to accurate generalization bounds using numerical optimization.  ( 2 min )
    Explain to Not Forget: Defending Against Catastrophic Forgetting with XAI. (arXiv:2205.01929v4 [cs.LG] UPDATED)
    The ability to continuously process and retain new information like we do naturally as humans is a feat that is highly sought after when training neural networks. Unfortunately, the traditional optimization algorithms often require large amounts of data available during training time and updates wrt. new data are difficult after the training process has been completed. In fact, when new data or tasks arise, previous progress may be lost as neural networks are prone to catastrophic forgetting. Catastrophic forgetting describes the phenomenon when a neural network completely forgets previous knowledge when given new information. We propose a novel training algorithm called training by explaining in which we leverage Layer-wise Relevance Propagation in order to retain the information a neural network has already learned in previous tasks when training on new data. The method is evaluated on a range of benchmark datasets as well as more complex data. Our method not only successfully retains the knowledge of old tasks within the neural networks but does so more resource-efficiently than other state-of-the-art solutions.  ( 3 min )
    Derivate Informed Neural Operator: An Efficient Framework for High-Dimensional Parametric Derivative Learning. (arXiv:2206.10745v1 [math.NA])
    Neural operators have gained significant attention recently due to their ability to approximate high-dimensional parametric maps between function spaces. At present, only parametric function approximation has been addressed in the neural operator literature. In this work we investigate incorporating parametric derivative information in neural operator training; this information can improve function approximations, additionally it can be used to improve the approximation of the derivative with respect to the parameter, which is often the key to scalable solution of high-dimensional outer-loop problems (e.g. Bayesian inverse problems). Parametric Jacobian information is formally intractable to incorporate due to its high-dimensionality, to address this concern we propose strategies based on reduced SVD, randomized sketching and the use of reduced basis surrogates. All of these strategies only require only $O(r)$ Jacobian actions to construct sample Jacobian data, and allow us to reduce the linear algebra and memory costs associated with the Jacobian training from the product of the input and output dimensions down to $O(r^2)$, where $r$ is the dimensionality associated with the dimension reduction technique. Numerical results for parametric PDE problems demonstrate that the addition of derivative information to the training problem can significantly improve the parametric map approximation, particularly given few data. When Jacobian actions are inexpensive compared to the parametric map, this information can be economically substituted for parametric map data. Additionally we show that Jacobian error approximations improve significantly with the introduction of Jacobian training data. This result opens the door to the use of derivative informed neural operators (DINOs) in outer-loop algorithms where they can amortize the additional training data cost via repeated evaluations.
    MASER: Multi-Agent Reinforcement Learning with Subgoals Generated from Experience Replay Buffer. (arXiv:2206.10607v1 [cs.LG])
    In this paper, we consider cooperative multi-agent reinforcement learning (MARL) with sparse reward. To tackle this problem, we propose a novel method named MASER: MARL with subgoals generated from experience replay buffer. Under the widely-used assumption of centralized training with decentralized execution and consistent Q-value decomposition for MARL, MASER automatically generates proper subgoals for multiple agents from the experience replay buffer by considering both individual Q-value and total Q-value. Then, MASER designs individual intrinsic reward for each agent based on actionable representation relevant to Q-learning so that the agents reach their subgoals while maximizing the joint action value. Numerical results show that MASER significantly outperforms StarCraft II micromanagement benchmark compared to other state-of-the-art MARL algorithms.  ( 2 min )
    A Survey on Computational Intelligence-based Transfer Learning. (arXiv:2206.10593v1 [cs.AI])
    The goal of transfer learning (TL) is providing a framework for exploiting acquired knowledge from source to target data. Transfer learning approaches compared to traditional machine learning approaches are capable of modeling better data patterns from the current domain. However, vanilla TL needs performance improvements by using computational intelligence-based TL. This paper studies computational intelligence-based transfer learning techniques and categorizes them into neural network-based, evolutionary algorithm-based, swarm intelligence-based and fuzzy logic-based transfer learning.  ( 2 min )
    Differentially Private Maximal Information Coefficients. (arXiv:2206.10685v1 [cs.CR])
    The Maximal Information Coefficient (MIC) is a powerful statistic to identify dependencies between variables. However, it may be applied to sensitive data, and publishing it could leak private information. As a solution, we present algorithms to approximate MIC in a way that provides differential privacy. We show that the natural application of the classic Laplace mechanism yields insufficient accuracy. We therefore introduce the MICr statistic, which is a new MIC approximation that is more compatible with differential privacy. We prove MICr is a consistent estimator for MIC, and we provide two differentially private versions of it. We perform experiments on a variety of real and synthetic datasets. The results show that the private MICr statistics significantly outperform direct application of the Laplace mechanism. Moreover, experiments on real-world datasets show accuracy that is usable when the sample size is at least moderately large.  ( 2 min )
    Can Foundation Models Talk Causality?. (arXiv:2206.10591v1 [cs.AI])
    Foundation models are subject to an ongoing heated debate, leaving open the question of progress towards AGI and dividing the community into two camps: the ones who see the arguably impressive results as evidence to the scaling hypothesis, and the others who are worried about the lack of interpretability and reasoning capabilities. By investigating to which extent causal representations might be captured by these large scale language models, we make a humble efforts towards resolving the ongoing philosophical conflicts.  ( 2 min )
    Asymmetric Learned Image Compression with Multi-Scale Residual Block, Importance Map, and Post-Quantization Filtering. (arXiv:2206.10618v1 [eess.IV])
    Recently, deep learning-based image compression has made signifcant progresses, and has achieved better ratedistortion (R-D) performance than the latest traditional method, H.266/VVC, in both subjective metric and the more challenging objective metric. However, a major problem is that many leading learned schemes cannot maintain a good trade-off between performance and complexity. In this paper, we propose an effcient and effective image coding framework, which achieves similar R-D performance with lower complexity than the state of the art. First, we develop an improved multi-scale residual block (MSRB) that can expand the receptive feld and is easier to obtain global information. It can further capture and reduce the spatial correlation of the latent representations. Second, a more advanced importance map network is introduced to adaptively allocate bits to different regions of the image. Third, we apply a 2D post-quantization flter (PQF) to reduce the quantization error, motivated by the Sample Adaptive Offset (SAO) flter in video coding. Moreover, We fnd that the complexity of encoder and decoder have different effects on image compression performance. Based on this observation, we design an asymmetric paradigm, in which the encoder employs three stages of MSRBs to improve the learning capacity, whereas the decoder only needs one stage of MSRB to yield satisfactory reconstruction, thereby reducing the decoding complexity without sacrifcing performance. Experimental results show that compared to the state-of-the-art method, the encoding and decoding time of the proposed method are about 17 times faster, and the R-D performance is only reduced by less than 1% on both Kodak and Tecnick datasets, which is still better than H.266/VVC(4:4:4) and other recent learning-based methods. Our source code is publicly available at https://github.com/fengyurenpingsheng.  ( 3 min )
    On the Maximum Hessian Eigenvalue and Generalization. (arXiv:2206.10654v1 [cs.LG])
    The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly $\lambda_{max}$, the largest eigenvalue of the Hessian of the loss); and algorithms, such as Sharpness-Aware Minimization (SAM) [1], that directly optimize for flatness. Other works question the link between $\lambda_{max}$ and generalization. In this paper, we present findings that call $\lambda_{max}$'s influence on generalization further into question. We show that: (1) while larger learning rates reduce $\lambda_{max}$ for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch size and learning rate simultaneously, we can change $\lambda_{max}$ without affecting generalization; (3) while SAM produces smaller $\lambda_{max}$ for all batch sizes, generalization benefits (also) vanish with larger batch sizes; (4) for dropout, excessively high dropout probabilities can degrade generalization, even as they promote smaller $\lambda_{max}$; and (5) while batch-normalization does not consistently produce smaller $\lambda_{max}$, it nevertheless confers generalization benefits. While our experiments affirm the generalization benefits of large learning rates and SAM for minibatch SGD, the GD-SGD discrepancy demonstrates limits to $\lambda_{max}$'s ability to explain generalization in neural networks.  ( 2 min )
    Artificial intelligence system based on multi-value classification of fully connected neural network for construction management. (arXiv:2206.10604v1 [cs.LG])
    This study is devoted to solving the problem to determine the professional adaptive capabilities of construction management staff using artificial intelligence systems.It is proposed Fully Connected Feed-Forward Neural Network architecture and performed empirical modeling to create a Data Set. Model of artificial intelligence system allows evaluating the processes in an Fully Connected Feed-Forward Neural Network during the execution of multi-value classification of professional areas. A method has been developed for the training process of a machine learning model, which reflects the internal connections between the components of an artificial intelligence system that allow it to learn from training data. To train the neural network, a data set of 35 input parameters and 29 output parameters was used; the amount of data in the set is 936 data lines. Neural network training occurred in the proportion of 10% and 90%, respectively. Results of this study research can be used to further improve the knowledge and skills necessary for successful professional realization.  ( 2 min )
    Generating Diverse Indoor Furniture Arrangements. (arXiv:2206.10608v1 [cs.LG])
    We present a method for generating arrangements of indoor furniture from human-designed furniture layout data. Our method creates arrangements that target specified diversity, such as the total price of all furniture in the room and the number of pieces placed. To generate realistic furniture arrangement, we train a generative adversarial network (GAN) on human-designed layouts. To target specific diversity in the arrangements, we optimize the latent space of the GAN via a quality diversity algorithm to generate a diverse arrangement collection. Experiments show our approach discovers a set of arrangements that are similar to human-designed layouts but varies in price and number of furniture pieces.  ( 2 min )
    Epicasting: An Ensemble Wavelet Neural Network (EWNet) for Forecasting Epidemics. (arXiv:2206.10696v1 [cs.LG])
    Infectious diseases remain among the top contributors to human illness and death worldwide, among which many diseases produce epidemic waves of infection. The unavailability of specific drugs and ready-to-use vaccines to prevent most of these epidemics makes the situation worse. These force public health officials, health care providers, and policymakers to rely on early warning systems generated by reliable and accurate forecasts of epidemics. Accurate forecasts of epidemics can assist stakeholders in tailoring countermeasures, such as vaccination campaigns, staff scheduling, and resource allocation, to the situation at hand, which could translate to reductions in the impact of a disease. Unfortunately, most of these past epidemics (e.g., dengue, malaria, hepatitis, influenza, and most recent, Covid-19) exhibit nonlinear and non-stationary characteristics due to their spreading fluctuations based on seasonal-dependent variability and the nature of these epidemics. We analyze a wide variety of epidemic time series datasets using a maximal overlap discrete wavelet transform (MODWT) based autoregressive neural network and call it EWNet. MODWT techniques effectively characterize non-stationary behavior and seasonal dependencies in the epidemic time series and improve the forecasting scheme of the autoregressive neural network in the proposed ensemble wavelet network framework. From a nonlinear time series viewpoint, we explore the asymptotic stationarity of the proposed EWNet model to show the asymptotic behavior of the associated Markov Chain. We also theoretically investigate the effect of learning stability and the choice of hidden neurons in the proposed EWNet model. From a practical perspective, we compare our proposed EWNet framework with several statistical, machine learning, and deep learning models that have been previously used for epidemic forecasting.  ( 3 min )
    ConTraNet: A single end-to-end hybrid network for EEG-based and EMG-based human machine interfaces. (arXiv:2206.10677v1 [q-bio.NC])
    Objective: Electroencephalography (EEG) and electromyography (EMG) are two non-invasive bio-signals, which are widely used in human machine interface (HMI) technologies (EEG-HMI and EMG-HMI paradigm) for the rehabilitation of physically disabled people. Successful decoding of EEG and EMG signals into respective control command is a pivotal step in the rehabilitation process. Recently, several Convolutional neural networks (CNNs) based architectures are proposed that directly map the raw time-series signal into decision space and the process of meaningful features extraction and classification are performed simultaneously. However, these networks are tailored to the learn the expected characteristics of the given bio-signal and are limited to single paradigm. In this work, we addressed the question that can we build a single architecture which is able to learn distinct features from different HMI paradigms and still successfully classify them. Approach: In this work, we introduce a single hybrid model called ConTraNet, which is based on CNN and Transformer architectures that is equally useful for EEG-HMI and EMG-HMI paradigms. ConTraNet uses CNN block to introduce inductive bias in the model and learn local dependencies, whereas the Transformer block uses the self-attention mechanism to learn the long-range dependencies in the signal, which are crucial for the classification of EEG and EMG signals. Main results: We evaluated and compared the ConTraNet with state-of-the-art methods on three publicly available datasets which belong to EEG-HMI and EMG-HMI paradigms. ConTraNet outperformed its counterparts in all the different category tasks (2-class, 3-class, 4-class, and 10-class decoding tasks). Significance: The results suggest that ConTraNet is robust to learn distinct features from different HMI paradigms and generalizes well as compared to the current state of the art algorithms.  ( 3 min )
    Demystifying the Base and Novel Performances for Few-shot Class-incremental Learning. (arXiv:2206.10596v1 [cs.LG])
    Few-shot class-incremental learning (FSCIL) has addressed challenging real-world scenarios where unseen novel classes continually arrive with few samples. In these scenarios, it is required to develop a model that recognizes the novel classes without forgetting prior knowledge. In other words, FSCIL aims to maintain the base performance and improve the novel performance simultaneously. However, there is little study to investigate the two performances separately. In this paper, we first decompose the entire model into four types of parameters and demonstrate that the tendency of the two performances varies greatly with the updated parameters when the novel classes appear. Based on the analysis, we propose a simple method for FSCIL, coined as NoNPC, which uses normalized prototype classifiers without further training for incremental novel classes. It is shown that our straightforward method has comparable performance with the sophisticated state-of-the-art algorithms.  ( 2 min )
    The Right Tool for the Job: Open-Source Auditing Tools in Machine Learning. (arXiv:2206.10613v1 [cs.LG])
    In recent years, discussions about fairness in machine learning, AI ethics and algorithm audits have increased. Many entities have developed framework guidance to establish a baseline rubric for fairness and accountability. However, in spite of increased discussions and multiple frameworks, algorithm and data auditing still remain difficult to execute in practice. Many open-source auditing tools are available, but users aren't always aware of the tools, what they are useful for, or how to access them. Model auditing and evaluation are not frequently emphasized skills in machine learning. There are also legal reasons for the proactive adoption of these tools that extend beyond the desire for greater fairness in machine learning. There are positive social issues of public perception and goodwill that matter in our highly connected global society. Greater awareness of these tools and the reasons for actively utilizing them may be helpful to the entire continuum of programmers, data scientists, engineers, researchers, users and consumers of AI and machine learning products. It is important for everyone to better understand the input and output differentials, how they are occurring, and what can be done to promote FATE (fairness, accountability, transparency, and ethics) in machine- and deep learning. The ability to freely access open-source auditing tools removes barriers to fairness assessment at the most basic levels of machine learning. This paper aims to reinforce the urgent need to actually use these tools and provides motivations for doing so. The exemplary tools highlighted herein are open-source with software or code-base repositories available that can be used immediately by anyone worldwide.  ( 3 min )
    CoCoPIE XGen: A Full-Stack AI-Oriented Optimizing Framework. (arXiv:2206.10620v1 [cs.LG])
    There is a growing demand for shifting the delivery of AI capability from data centers on the cloud to edge or end devices, exemplified by the fast emerging real-time AI-based apps running on smartphones, AR/VR devices, autonomous vehicles, and various IoT devices. The shift has however been seriously hampered by the large growing gap between DNN computing demands and the computing power on edge or end devices. This article presents the design of XGen, an optimizing framework for DNN designed to bridge the gap. XGen takes cross-cutting co-design as its first-order consideration. Its full-stack AI-oriented optimizations consist of a number of innovative optimizations at every layer of the DNN software stack, all designed in a cooperative manner. The unique technology makes XGen able to optimize various DNNs, including those with an extreme depth (e.g., BERT, GPT, other transformers), and generate code that runs several times faster than those from existing DNN frameworks, while delivering the same level of accuracy.  ( 2 min )
    Good Time to Ask: A Learning Framework for Asking for Help in Embodied Visual Navigation. (arXiv:2206.10606v1 [cs.LG])
    In reality, it is often more efficient to ask for help than to search the entire space to find an object with an unknown location. We present a learning framework that enables an agent to actively ask for help in such embodied visual navigation tasks, where the feedback informs the agent of where the goal is in its view. To emulate the real-world scenario that a teacher may not always be present, we propose a training curriculum where feedback is not always available. We formulate an uncertainty measure of where the goal is and use empirical results to show that through this approach, the agent learns to ask for help effectively while remaining robust when feedback is not available.  ( 2 min )
    Identifying Electrocardiogram Abnormalities Using a Handcrafted-Rule-Enhanced Neural Network. (arXiv:2206.10592v1 [cs.AI])
    A large number of people suffer from life-threatening cardiac abnormalities, and electrocardiogram (ECG) analysis is beneficial to determining whether an individual is at risk of such abnormalities. Automatic ECG classification methods, especially the deep learning based ones, have been proposed to detect cardiac abnormalities using ECG records, showing good potential to improve clinical diagnosis and help early prevention of cardiovascular diseases. However, the predictions of the known neural networks still do not satisfactorily meet the needs of clinicians, and this phenomenon suggests that some information used in clinical diagnosis may not be well captured and utilized by these methods. In this paper, we introduce some rules into convolutional neural networks, which help present clinical knowledge to deep learning based ECG analysis, in order to improve automated ECG diagnosis performance. Specifically, we propose a Handcrafted-Rule-enhanced Neural Network (called HRNN) for ECG classification with standard 12-lead ECG input, which consists of a rule inference module and a deep learning module. Experiments on two large-scale public ECG datasets show that our new approach considerably outperforms existing state-of-the-art methods. Further, our proposed approach not only can improve the diagnosis performance, but also can assist in detecting mislabelled ECG samples. Our codes are available at https://github.com/alwaysbyx/ecg_processing.  ( 2 min )
    Autoencoder-based Attribute Noise Handling Method for Medical Data. (arXiv:2206.10609v1 [cs.LG])
    Medical datasets are particularly subject to attribute noise, that is, missing and erroneous values. Attribute noise is known to be largely detrimental to learning performances. To maximize future learning performances it is primordial to deal with attribute noise before any inference. We propose a simple autoencoder-based preprocessing method that can correct mixed-type tabular data corrupted by attribute noise. No other method currently exists to handle attribute noise in tabular data. We experimentally demonstrate that our method outperforms both state-of-the-art imputation methods and noise correction methods on several real-world medical datasets.  ( 2 min )
    Metareview-informed Explainable Cytokine Storm Detection during CAR-T cell Therapy. (arXiv:2206.10612v1 [q-bio.QM])
    Cytokine release syndrome (CRS), also known as cytokine storm, is one of the most consequential adverse effects of chimeric antigen receptor therapies that have shown promising results in cancer treatment. When emerging, CRS could be identified by the analysis of specific cytokine and chemokine profiles that tend to exhibit similarities across patients. In this paper, we exploit these similarities using machine learning algorithms and set out to pioneer a meta--review informed method for the identification of CRS based on specific cytokine peak concentrations and evidence from previous clinical studies. We argue that such methods could support clinicians in analyzing suspect cytokine profiles by matching them against CRS knowledge from past clinical studies, with the ultimate aim of swift CRS diagnosis. During evaluation with real--world CRS clinical data, we emphasize the potential of our proposed method of producing interpretable results, in addition to being effective in identifying the onset of cytokine storm.  ( 2 min )
    Neural Activation Patterns (NAPs): Visual Explainability of Learned Concepts. (arXiv:2206.10611v1 [cs.LG])
    A key to deciphering the inner workings of neural networks is understanding what a model has learned. Promising methods for discovering learned features are based on analyzing activation values, whereby current techniques focus on analyzing high activation values to reveal interesting features on a neuron level. However, analyzing high activation values limits layer-level concept discovery. We present a method that instead takes into account the entire activation distribution. By extracting similar activation profiles within the high-dimensional activation space of a neural network layer, we find groups of inputs that are treated similarly. These input groups represent neural activation patterns (NAPs) and can be used to visualize and interpret learned layer concepts. We release a framework with which NAPs can be extracted from pre-trained models and provide a visual introspection tool that can be used to analyze NAPs. We tested our method with a variety of networks and show how it complements existing methods for analyzing neural network activation values.  ( 2 min )
    Deep Inverse Reinforcement Learning for Route Choice Modeling. (arXiv:2206.10598v1 [cs.LG])
    Route choice modeling, i.e., the process of estimating the likely path that individuals follow during their journeys, is a fundamental task in transportation planning and demand forecasting. Classical methods generally adopt the discrete choice model (DCM) framework with linear utility functions and high-level route characteristics. While several recent studies have started to explore the applicability of deep learning for travel choice modeling, they are all path-based with relatively simple model architectures and cannot take advantage of detailed link-level features. Existing link-based models, while theoretically promising, are generally not as scalable or flexible enough to account for the destination characteristics. To address these issues, this study proposes a general deep inverse reinforcement learning (IRL) framework for link-based route choice modeling, which is capable of incorporating high-dimensional features and capturing complex relationships. Specifically, we adapt an adversarial IRL model to the route choice problem for efficient estimation of destination-dependent reward and policy functions. Experiment results based on taxi GPS data from Shanghai, China validate the improved performance of the proposed model over conventional DCMs and other imitation learning baselines, even for destinations unseen in the training data. We also demonstrate the model interpretability using explainable AI techniques. The proposed methodology provides a new direction for future development of route choice models. It is general and should be adaptable to other route choice problems across different modes and networks.  ( 2 min )
    Stop ordering machine learning algorithms by their explainability! A user-centered investigation of performance and explainability. (arXiv:2206.10610v1 [cs.LG])
    Machine learning algorithms enable advanced decision making in contemporary intelligent systems. Research indicates that there is a tradeoff between their model performance and explainability. Machine learning models with higher performance are often based on more complex algorithms and therefore lack explainability and vice versa. However, there is little to no empirical evidence of this tradeoff from an end user perspective. We aim to provide empirical evidence by conducting two user experiments. Using two distinct datasets, we first measure the tradeoff for five common classes of machine learning algorithms. Second, we address the problem of end user perceptions of explainable artificial intelligence augmentations aimed at increasing the understanding of the decision logic of high-performing complex models. Our results diverge from the widespread assumption of a tradeoff curve and indicate that the tradeoff between model performance and explainability is much less gradual in the end user's perception. This is a stark contrast to assumed inherent model interpretability. Further, we found the tradeoff to be situational for example due to data complexity. Results of our second experiment show that while explainable artificial intelligence augmentations can be used to increase explainability, the type of explanation plays an essential role in end user perception.  ( 2 min )
  • Open

    $C^*$-algebra Net: A New Approach Generalizing Neural Network Parameters to $C^*$-algebra. (arXiv:2206.09513v2 [stat.ML] UPDATED)
    We propose a new framework that generalizes the parameters of neural network models to $C^*$-algebra-valued ones. $C^*$-algebra is a generalization of the space of complex numbers. A typical example is the space of continuous functions on a compact space. This generalization enables us to combine multiple models continuously and use tools for functions such as regression and integration. Consequently, we can learn features of data efficiently and adapt the models to problems continuously. We apply our framework to practical problems such as density estimation and few-shot learning and show that our framework enables us to learn features of data even with a limited number of samples. Our new framework highlights the potential possibility of applying the theory of $C^*$-algebra to general neural network models.  ( 2 min )
    Tree-Guided Rare Feature Selection and Logic Aggregation with Electronic Health Records Data. (arXiv:2206.09107v1 [cs.LG] CROSS LISTED)
    Statistical learning with a large number of rare binary features is commonly encountered in analyzing electronic health records (EHR) data, especially in the modeling of disease onset with prior medical diagnoses and procedures. Dealing with the resulting highly sparse and large-scale binary feature matrix is notoriously challenging as conventional methods may suffer from a lack of power in testing and inconsistency in model fitting while machine learning methods may suffer from the inability of producing interpretable results or clinically-meaningful risk factors. To improve EHR-based modeling and utilize the natural hierarchical structure of disease classification, we propose a tree-guided feature selection and logic aggregation approach for large-scale regression with rare binary features, in which dimension reduction is achieved through not only a sparsity pursuit but also an aggregation promoter with the logic operator of ``or''. We convert the combinatorial problem into a convex linearly-constrained regularized estimation, which enables scalable computation with theoretical guarantees. In a suicide risk study with EHR data, our approach is able to select and aggregate prior mental health diagnoses as guided by the diagnosis hierarchy of the International Classification of Diseases. By balancing the rarity and specificity of the EHR diagnosis records, our strategy improves both prediction and model interpretation. We identify important higher-level categories and subcategories of mental health conditions and simultaneously determine the level of specificity needed for each of them in predicting suicide risk.  ( 3 min )
    Multiple Testing Framework for Out-of-Distribution Detection. (arXiv:2206.09522v2 [stat.ML] UPDATED)
    We study the problem of Out-of-Distribution (OOD) detection, that is, detecting whether a learning algorithm's output can be trusted at inference time. While a number of tests for OOD detection have been proposed in prior work, a formal framework for studying this problem is lacking. We propose a definition for the notion of OOD that includes both the input distribution and the learning algorithm, which provides insights for the construction of powerful tests for OOD detection. We propose a multiple hypothesis testing inspired procedure to systematically combine any number of different statistics from the learning algorithm using conformal p-values. We further provide strong guarantees on the probability of incorrectly classifying an in-distribution sample as OOD. In our experiments, we find that threshold-based tests proposed in prior work perform well in specific settings, but not uniformly well across different types of OOD instances. In contrast, our proposed method that combines multiple statistics performs uniformly well across different datasets and neural networks.  ( 2 min )
    Inference of Multiscale Gaussian Graphical Model. (arXiv:2202.05775v2 [stat.ML] UPDATED)
    Gaussian Graphical Models (GGMs) are widely used for exploratory data analysis in various fields such as genomics, ecology, psychometry. In a high-dimensional setting, when the number of variables exceeds the number of observations by several orders of magnitude, the estimation of GGM is a difficult and unstable optimization problem. Clustering of variables or variable selection is often performed prior to GGM estimation. We propose a new method allowing to simultaneously infer a hierarchical clustering structure and the graphs describing the structure of independence at each level of the hierarchy. This method is based on solving a convex optimization problem combining a graphical lasso penalty with a fused type lasso penalty. Results on real and synthetic data are presented.
    Beyond No Regret: Instance-Dependent PAC Reinforcement Learning. (arXiv:2108.02717v2 [cs.LG] UPDATED)
    The theory of reinforcement learning has focused on two fundamental problems: achieving low regret, and identifying $\epsilon$-optimal policies. While a simple reduction allows one to apply a low-regret algorithm to obtain an $\epsilon$-optimal policy and achieve the worst-case optimal rate, it is unknown whether low-regret algorithms can obtain the instance-optimal rate for policy identification. We show this is not possible -- there exists a fundamental tradeoff between achieving low regret and identifying an $\epsilon$-optimal policy at the instance-optimal rate. Motivated by our negative finding, we propose a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP. We then propose and analyze a novel, planning-based algorithm which attains this sample complexity -- yielding a complexity which scales with the suboptimality gaps and the "reachability" of a state. We show our algorithm is nearly minimax optimal, and on several examples that our instance-dependent sample complexity offers significant improvements over worst-case bounds.
    Nonparametric Multi-shape Modeling with Uncertainty Quantification. (arXiv:2206.09127v2 [stat.ML] UPDATED)
    The modeling and uncertainty quantification of closed curves is an important problem in the field of shape analysis, and can have significant ramifications for subsequent statistical tasks. Many of these tasks involve collections of closed curves, which often exhibit structural similarities at multiple levels. Modeling multiple closed curves in a way that efficiently incorporates such between-curve dependence remains a challenging problem. In this work, we propose and investigate a multiple-output (a.k.a. multi-output), multi-dimensional Gaussian process modeling framework. We illustrate the proposed methodological advances, and demonstrate the utility of meaningful uncertainty quantification, on several curve and shape-related tasks. This model-based approach not only addresses the problem of inference on closed curves (and their shapes) with kernel constructions, but also opens doors to nonparametric modeling of multi-level dependence for functional objects in general.
    Private and polynomial time algorithms for learning Gaussians and beyond. (arXiv:2111.11320v3 [stat.ML] UPDATED)
    We present a fairly general framework for reducing $(\varepsilon, \delta)$ differentially private (DP) statistical estimation to its non-private counterpart. As the main application of this framework, we give a polynomial time and $(\varepsilon,\delta)$-DP algorithm for learning (unrestricted) Gaussian distributions in $\mathbb{R}^d$. The sample complexity of our approach for learning the Gaussian up to total variation distance $\alpha$ is $\widetilde{O}(d^2/\alpha^2 + d^2\sqrt{\ln(1/\delta)}/\alpha \varepsilon + d\ln(1/\delta) / \alpha \varepsilon)$ matching (up to logarithmic factors) the best known information-theoretic (non-efficient) sample complexity upper bound due to Aden-Ali, Ashtiani, and Kamath (ALT'21). In an independent work, Kamath, Mouzakis, Singhal, Steinke, and Ullman (arXiv:2111.04609) proved a similar result using a different approach and with $O(d^{5/2})$ sample complexity dependence on $d$. As another application of our framework, we provide the first polynomial time $(\varepsilon, \delta)$-DP algorithm for robust learning of (unrestricted) Gaussians with sample complexity $\widetilde{O}(d^{3.5})$. In another independent work, Kothari, Manurangsi, and Velingker (arXiv:2112.03548) also provided a polynomial time $(\varepsilon, \delta)$-DP algorithm for robust learning of Gaussians with sample complexity $\widetilde{O}(d^8)$.
    Convergence Rates for Learning Linear Operators from Noisy Data. (arXiv:2108.12515v2 [math.ST] UPDATED)
    This paper studies the learning of linear operators between infinite-dimensional Hilbert spaces. The training data comprises pairs of random input vectors in a Hilbert space and their noisy images under an unknown self-adjoint linear operator. Assuming that the operator is diagonalizable in a known basis, this work solves the equivalent inverse problem of estimating the operator's eigenvalues given the data. Adopting a Bayesian approach, the theoretical analysis establishes posterior contraction rates in the infinite data limit with Gaussian priors that are not directly linked to the forward map of the inverse problem. The main results also include learning-theoretic generalization error guarantees for a wide range of distribution shifts. These convergence rates quantify the effects of data smoothness and true eigenvalue decay or growth, for compact or unbounded operators, respectively, on sample complexity. Numerical evidence supports the theory in diagonal and non-diagonal settings.
    MMD Aggregated Two-Sample Test. (arXiv:2110.15073v2 [stat.ML] UPDATED)
    We propose a novel nonparametric two-sample test based on the Maximum Mean Discrepancy (MMD), which is constructed by aggregating tests with different kernel bandwidths. This aggregation procedure, called MMDAgg, ensures that test power is maximised over the collection of kernels used, without requiring held-out data for kernel selection (which results in a loss of test power), or arbitrary kernel choices such as the median heuristic. We work in the non-asymptotic framework, and prove that our aggregated test is minimax adaptive over Sobolev balls. Our guarantees are not restricted to a specific kernel, but hold for any product of one-dimensional translation invariant characteristic kernels which are absolutely and square integrable. Moreover, our results apply for popular numerical procedures to determine the test threshold, namely permutations and the wild bootstrap. Through numerical experiments on both synthetic and real-world datasets, we demonstrate that MMDAgg outperforms alternative state-of-the-art approaches to MMD kernel adaptation for two-sample testing.
    Efficient Online Linear Control with Stochastic Convex Costs and Unknown Dynamics. (arXiv:2203.01170v2 [math.OC] UPDATED)
    We consider the problem of controlling an unknown linear dynamical system under a stochastic convex cost and full feedback of both the state and cost function. We present a computationally efficient algorithm that attains an optimal $\sqrt{T}$ regret-rate compared to the best stabilizing linear controller in hindsight. In contrast to previous work, our algorithm is based on the Optimism in the Face of Uncertainty paradigm. This results in a substantially improved computational complexity and a simpler analysis.
    Regression-based projection for learning Mori-Zwanzig operators. (arXiv:2205.05135v2 [math.DS] UPDATED)
    We propose to adopt statistical regression as the projection operator to enable data-driven learning of the operators in the Mori--Zwanzig formalism. We present a principled method to extract the Markov and memory operators for any regression models. We show that the choice of linear regression results in a recently proposed data-driven learning algorithm based on Mori's projection operator, which is a higher-order approximate Koopman learning method. We show that more expressive nonlinear regression models naturally fill in the gap between the highly idealized and computationally efficient Mori's projection operator and the most optimal yet computationally infeasible Zwanzig's projection operator. We performed numerical experiments and extracted the operators for an array of regression-based projections, including linear, polynomial, spline, and neural-network-based regressions, showing a progressive improvement as the complexity of the regression model increased. Our proposition provides a general framework to extract memory-dependent corrections and can be readily applied to an array of data-driven learning methods for stationary dynamical systems in the literature.
    Large-scale Stochastic Optimization of NDCG Surrogates for Deep Learning with Provable Convergence. (arXiv:2202.12183v3 [cs.LG] UPDATED)
    NDCG, namely Normalized Discounted Cumulative Gain, is a widely used ranking metric in information retrieval and machine learning. However, efficient and provable stochastic methods for maximizing NDCG are still lacking, especially for deep models. In this paper, we propose a principled approach to optimize NDCG and its top-$K$ variant. First, we formulate a novel compositional optimization problem for optimizing the NDCG surrogate, and a novel bilevel compositional optimization problem for optimizing the top-$K$ NDCG surrogate. Then, we develop efficient stochastic algorithms with provable convergence guarantees for the non-convex objectives. Different from existing NDCG optimization methods, the per-iteration complexity of our algorithms scales with the mini-batch size instead of the number of total items. To improve the effectiveness for deep learning, we further propose practical strategies by using initial warm-up and stop gradient operator. Experimental results on multiple datasets demonstrate that our methods outperform prior ranking approaches in terms of NDCG. To the best of our knowledge, this is the first time that stochastic algorithms are proposed to optimize NDCG with a provable convergence guarantee. Our proposed methods are implemented in the LibAUC library at https://libauc.org/.
    Least Squares Estimation Using Sketched Data with Heteroskedastic Errors. (arXiv:2007.07781v3 [stat.ML] UPDATED)
    Researchers may perform regressions using a sketch of data of size $m$ instead of the full sample of size $n$ for a variety of reasons. This paper considers the case when the regression errors do not have constant variance and heteroskedasticity robust standard errors would normally be needed for test statistics to provide accurate inference. We show that estimates using data sketched by random projections will behave `as if' the errors were homoskedastic. Estimation by random sampling would not have this property. The result arises because the sketched estimates in the case of random projections can be expressed as degenerate $U$-statistics, and under certain conditions, these statistics are asymptotically normal with homoskedastic variance. We verify that the conditions hold not only in the case of least squares regression when the covariates are exogenous, but also in instrumental variables estimation when the covariates are endogenous. The result implies that inference, including first-stage F tests for instrument relevance, can be simpler than the full sample case if the sketching scheme is appropriately chosen.
    Minimax Semiparametric Learning With Approximate Sparsity. (arXiv:1912.12213v4 [math.ST] UPDATED)
    This paper is about the feasibility and means of root-n consistently estimating linear, mean-square continuous functionals of a high dimensional, approximately sparse regression. Such objects include a wide variety of interesting parameters such as regression coefficients, average derivatives, and the average treatment effect. We give lower bounds on the convergence rate of estimators of a regression slope and an average derivative and find that these bounds are substantially larger than in a low dimensional, semiparametric setting. We also give debiased machine learners that are root-n consistent under either a minimal approximate sparsity condition or rate double robustness. These estimators improve on existing estimators in being root-n consistent under more general conditions that previously known.
    Discriminative Bayesian filtering lends momentum to the stochastic Newton method for minimizing log-convex functions. (arXiv:2104.12949v2 [stat.ML] UPDATED)
    To minimize the average of a set of log-convex functions, the stochastic Newton method iteratively updates its estimate using subsampled versions of the full objective's gradient and Hessian. We contextualize this optimization problem as sequential Bayesian inference on a latent state-space model with a discriminatively-specified observation process. Applying Bayesian filtering then yields a novel optimization algorithm that considers the entire history of gradients and Hessians when forming an update. We establish matrix-based conditions under which the effect of older observations diminishes over time, in a manner analogous to Polyak's heavy ball momentum. We illustrate various aspects of our approach with an example and review other relevant innovations for the stochastic Newton method.
    Algorithms that get old : the case of generative deep neural networks. (arXiv:2202.03008v2 [stat.ML] UPDATED)
    Generative deep neural networks used in machine learning, like the Variational Auto-Encoders (VAE), and Generative Adversarial Networks (GANs) produce new objects each time when asked to do so with the constraint that the new objects remain similar to some list of examples given as input. However, this behavior is unlike that of human artists that change their style as times go by and seldom return to the initial creations. We investigate a situation where VAEs are used to sample from a probability measure described by some empirical dataset. Based on recent works on Radon-Sobolev statistical distances, we propose a numerical paradigm, to be used in conjunction with a generative algorithm, that satisfies the two following requirements: the objects created do not repeat and evolve to fill the entire target probability measure.
    Noisy $\ell^{0}$-Sparse Subspace Clustering on Dimensionality Reduced Data. (arXiv:2206.11079v1 [stat.ML])
    Sparse subspace clustering methods with sparsity induced by $\ell^{0}$-norm, such as $\ell^{0}$-Sparse Subspace Clustering ($\ell^{0}$-SSC)~\citep{YangFJYH16-L0SSC-ijcv}, are demonstrated to be more effective than its $\ell^{1}$ counterpart such as Sparse Subspace Clustering (SSC)~\citep{ElhamifarV13}. However, the theoretical analysis of $\ell^{0}$-SSC is restricted to clean data that lie exactly in subspaces. Real data often suffer from noise and they may lie close to subspaces. In this paper, we show that an optimal solution to the optimization problem of noisy $\ell^{0}$-SSC achieves subspace detection property (SDP), a key element with which data from different subspaces are separated, under deterministic and semi-random model. Our results provide theoretical guarantee on the correctness of noisy $\ell^{0}$-SSC in terms of SDP on noisy data for the first time, which reveals the advantage of noisy $\ell^{0}$-SSC in terms of much less restrictive condition on subspace affinity. In order to improve the efficiency of noisy $\ell^{0}$-SSC, we propose Noisy-DR-$\ell^{0}$-SSC which provably recovers the subspaces on dimensionality reduced data. Noisy-DR-$\ell^{0}$-SSC first projects the data onto a lower dimensional space by random projection, then performs noisy $\ell^{0}$-SSC on the projected data for improved efficiency. Experimental results demonstrate the effectiveness of Noisy-DR-$\ell^{0}$-SSC.
    Langevin Monte Carlo for Contextual Bandits. (arXiv:2206.11254v1 [cs.LG])
    We study the efficiency of Thompson sampling for contextual bandits. Existing Thompson sampling-based algorithms need to construct a Laplace approximation (i.e., a Gaussian distribution) of the posterior distribution, which is inefficient to sample in high dimensional applications for general covariance matrices. Moreover, the Gaussian approximation may not be a good surrogate for the posterior distribution for general reward generating functions. We propose an efficient posterior sampling algorithm, viz., Langevin Monte Carlo Thompson Sampling (LMC-TS), that uses Markov Chain Monte Carlo (MCMC) methods to directly sample from the posterior distribution in contextual bandits. Our method is computationally efficient since it only needs to perform noisy gradient descent updates without constructing the Laplace approximation of the posterior distribution. We prove that the proposed algorithm achieves the same sublinear regret bound as the best Thompson sampling algorithms for a special case of contextual bandits, viz., linear contextual bandits. We conduct experiments on both synthetic data and real-world datasets on different contextual bandit models, which demonstrates that directly sampling from the posterior is both computationally efficient and competitive in performance.
    Model-free Representation Learning and Exploration in Low-rank MDPs. (arXiv:2102.07035v2 [cs.LG] UPDATED)
    The low rank MDP has emerged as an important model for studying representation learning and exploration in reinforcement learning. With a known representation, several model-free exploration strategies exist. In contrast, all algorithms for the unknown representation setting are model-based, thereby requiring the ability to model the full dynamics. In this work, we present the first model-free representation learning algorithms for low rank MDPs. The key algorithmic contribution is a new minimax representation learning objective, for which we provide variants with differing tradeoffs in their statistical and computational properties. We interleave this representation learning step with an exploration strategy to cover the state space in a reward-free manner. The resulting algorithms are provably sample efficient and can accommodate general function approximation to scale to complex environments.
    Scaling and Scalability: Provable Nonconvex Low-Rank Tensor Estimation from Incomplete Measurements. (arXiv:2104.14526v3 [cs.LG] UPDATED)
    Tensors, which provide a powerful and flexible model for representing multi-attribute data and multi-way interactions, play an indispensable role in modern data science across various fields in science and engineering. A fundamental task is to faithfully recover the tensor from highly incomplete measurements in a statistically and computationally efficient manner. Harnessing the low-rank structure of tensors in the Tucker decomposition, this paper develops a scaled gradient descent (ScaledGD) algorithm to directly recover the tensor factors with tailored spectral initializations, and shows that it provably converges at a linear rate independent of the condition number of the ground truth tensor for two canonical problems -- tensor completion and tensor regression -- as soon as the sample size is above the order of $n^{3/2}$ ignoring other parameter dependencies, where $n$ is the dimension of the tensor. This leads to an extremely scalable approach to low-rank tensor estimation compared with prior art, which suffers from at least one of the following drawbacks: extreme sensitivity to ill-conditioning, high per-iteration costs in terms of memory and computation, or poor sample complexity guarantees. To the best of our knowledge, ScaledGD is the first algorithm that achieves near-optimal statistical and computational complexities simultaneously for low-rank tensor completion with the Tucker decomposition. Our algorithm highlights the power of appropriate preconditioning in accelerating nonconvex statistical estimation, where the iteration-varying preconditioners promote desirable invariance properties of the trajectory with respect to the underlying symmetry in low-rank tensor factorization.
    Optimal transport meets noisy label robust loss and MixUp regularization for domain adaptation. (arXiv:2206.11180v1 [cs.CV])
    It is common in computer vision to be confronted with domain shift: images which have the same class but different acquisition conditions. In domain adaptation (DA), one wants to classify unlabeled target images using source labeled images. Unfortunately, deep neural networks trained on a source training set perform poorly on target images which do not belong to the training domain. One strategy to improve these performances is to align the source and target image distributions in an embedded space using optimal transport (OT). However OT can cause negative transfer, i.e. aligning samples with different labels, which leads to overfitting especially in the presence of label shift between domains. In this work, we mitigate negative alignment by explaining it as a noisy label assignment to target images. We then mitigate its effect by appropriate regularization. We propose to couple the MixUp regularization \citep{zhang2018mixup} with a loss that is robust to noisy labels in order to improve domain adaptation performance. We show in an extensive ablation study that a combination of the two techniques is critical to achieve improved performance. Finally, we evaluate our method, called \textsc{mixunbot}, on several benchmarks and real-world DA problems.
    Active Learning with Safety Constraints. (arXiv:2206.11183v1 [cs.LG])
    Active learning methods have shown great promise in reducing the number of samples necessary for learning. As automated learning systems are adopted into real-time, real-world decision-making pipelines, it is increasingly important that such algorithms are designed with safety in mind. In this work we investigate the complexity of learning the best safe decision in interactive environments. We reduce this problem to a constrained linear bandits problem, where our goal is to find the best arm satisfying certain (unknown) safety constraints. We propose an adaptive experimental design-based algorithm, which we show efficiently trades off between the difficulty of showing an arm is unsafe vs suboptimal. To our knowledge, our results are the first on best-arm identification in linear bandits with safety constraints. In practice, we demonstrate that this approach performs well on synthetic and real world datasets.
    Ordered Subgraph Aggregation Networks. (arXiv:2206.11168v1 [cs.LG])
    Numerous subgraph-enhanced graph neural networks (GNNs) have emerged recently, provably boosting the expressive power of standard (message-passing) GNNs. However, there is a limited understanding of how these approaches relate to each other and to the Weisfeiler--Leman hierarchy. Moreover, current approaches either use all subgraphs of a given size, sample them uniformly at random, or use hand-crafted heuristics instead of learning to select subgraphs in a data-driven manner. Here, we offer a unified way to study such architectures by introducing a theoretical framework and extending the known expressivity results of subgraph-enhanced GNNs. Concretely, we show that increasing subgraph size always increases the expressive power and develop a better understanding of their limitations by relating them to the established $k\text{-}\mathsf{WL}$ hierarchy. In addition, we explore different approaches for learning to sample subgraphs using recent methods for backpropagating through complex discrete probability distributions. Empirically, we study the predictive performance of different subgraph-enhanced GNNs, showing that our data-driven architectures increase prediction accuracy on standard benchmark datasets compared to non-data-driven subgraph-enhanced graph neural networks while reducing computation time.
    Sharing pattern submodels for prediction with missing values. (arXiv:2206.11161v1 [cs.LG])
    Missing values are unavoidable in many applications of machine learning and present a challenge both during training and at test time. When variables are missing in recurring patterns, fitting separate pattern submodels have been proposed as a solution. However, independent models do not make efficient use of all available data. Conversely, fitting a shared model to the full data set typically relies on imputation which may be suboptimal when missingness depends on unobserved factors. We propose an alternative approach, called sharing pattern submodels, which make predictions that are a) robust to missing values at test time, b) maintains or improves the predictive power of pattern submodels, and c) has a short description enabling improved interpretability. We identify cases where sharing is provably optimal, even when missingness itself is predictive and when the prediction target depends on unobserved variables. Classification and regression experiments on synthetic data and two healthcare data sets demonstrate that our models achieve a favorable trade-off between pattern specialization and information sharing.
    A view of mini-batch SGD via generating functions: conditions of convergence, phase transitions, benefit from negative momenta. (arXiv:2206.11124v1 [cs.LG])
    Mini-batch SGD with momentum is a fundamental algorithm for learning large predictive models. In this paper we develop a new analytic framework to analyze mini-batch SGD for linear models at different momenta and sizes of batches. Our key idea is to describe the loss value sequence in terms of its generating function, which can be written in a compact form assuming a diagonal approximation for the second moments of model weights. By analyzing this generating function, we deduce various conclusions on the convergence conditions, phase structure of the model, and optimal learning settings. As a few examples, we show that 1) the optimization trajectory can generally switch from the "signal-dominated" to the "noise-dominated" phase, at a time scale that can be predicted analytically; 2) in the "signal-dominated" (but not the "noise-dominated") phase it is favorable to choose a large effective learning rate, however its value must be limited for any finite batch size to avoid divergence; 3) optimal convergence rate can be achieved at a negative momentum. We verify our theoretical predictions by extensive experiments with MNIST and synthetic problems, and find a good quantitative agreement.
    Discussion of `Multiscale Fisher's Independence Test for Multivariate Dependence'. (arXiv:2206.11142v1 [stat.ME])
    We discuss how MultiFIT, the Multiscale Fisher's Independence Test for Multivariate Dependence proposed by Gorsky and Ma (2022), compares to existing linear-time kernel tests based on the Hilbert-Schmidt independence criterion (HSIC). We highlight the fact that the levels of the kernel tests at any finite sample size can be controlled exactly, as it is the case with the level of MultiFIT. In our experiments, we observe some of the performance limitations of MultiFIT in terms of test power.
    Agent-based Graph Neural Networks. (arXiv:2206.11010v1 [cs.LG])
    We present a novel graph neural network we call AgentNet, which is designed specifically for graph-level tasks. AgentNet is inspired by sublinear algorithms, featuring a computational complexity that is independent of the graph size. The architecture of AgentNet differs fundamentally from the architectures of known graph neural networks. In AgentNet, some trained \textit{neural agents} intelligently walk the graph, and then collectively decide on the output. We provide an extensive theoretical analysis of AgentNet: We show that the agents can learn to systematically explore their neighborhood and that AgentNet can distinguish some structures that are even indistinguishable by 3-WL. Moreover, AgentNet is able to separate any two graphs which are sufficiently different in terms of subgraphs. We confirm these theoretical results with synthetic experiments on hard-to-distinguish graphs and real-world graph classification tasks. In both cases, we compare favorably not only to standard GNNs but also to computationally more expensive GNN extensions.
    Bregman Power k-Means for Clustering Exponential Family Data. (arXiv:2206.10860v1 [stat.ML])
    Recent progress in center-based clustering algorithms combats poor local minima by implicit annealing, using a family of generalized means. These methods are variations of Lloyd's celebrated $k$-means algorithm, and are most appropriate for spherical clusters such as those arising from Gaussian data. In this paper, we bridge these algorithmic advances to classical work on hard clustering under Bregman divergences, which enjoy a bijection to exponential family distributions and are thus well-suited for clustering objects arising from a breadth of data generating mechanisms. The elegant properties of Bregman divergences allow us to maintain closed form updates in a simple and transparent algorithm, and moreover lead to new theoretical arguments for establishing finite sample bounds that relax the bounded support assumption made in the existing state of the art. Additionally, we consider thorough empirical analyses on simulated experiments and a case study on rainfall data, finding that the proposed method outperforms existing peer methods in a variety of non-Gaussian data settings.
    Sharp Constants in Uniformity Testing via the Huber Statistic. (arXiv:2206.10722v1 [stat.ML])
    Uniformity testing is one of the most well-studied problems in property testing, with many known test statistics, including ones based on counting collisions, singletons, and the empirical TV distance. It is known that the optimal sample complexity to distinguish the uniform distribution on $m$ elements from any $\epsilon$-far distribution with $1-\delta$ probability is $n = \Theta\left(\frac{\sqrt{m \log (1/\delta)}}{\epsilon^2} + \frac{\log (1/\delta)}{\epsilon^2}\right)$, which is achieved by the empirical TV tester. Yet in simulation, these theoretical analyses are misleading: in many cases, they do not correctly rank order the performance of existing testers, even in an asymptotic regime of all parameters tending to $0$ or $\infty$. We explain this discrepancy by studying the \emph{constant factors} required by the algorithms. We show that the collisions tester achieves a sharp maximal constant in the number of standard deviations of separation between uniform and non-uniform inputs. We then introduce a new tester based on the Huber loss, and show that it not only matches this separation, but also has tails corresponding to a Gaussian with this separation. This leads to a sample complexity of $(1 + o(1))\frac{\sqrt{m \log (1/\delta)}}{\epsilon^2}$ in the regime where this term is dominant, unlike all other existing testers.
    On the Maximum Hessian Eigenvalue and Generalization. (arXiv:2206.10654v1 [cs.LG])
    The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly $\lambda_{max}$, the largest eigenvalue of the Hessian of the loss); and algorithms, such as Sharpness-Aware Minimization (SAM) [1], that directly optimize for flatness. Other works question the link between $\lambda_{max}$ and generalization. In this paper, we present findings that call $\lambda_{max}$'s influence on generalization further into question. We show that: (1) while larger learning rates reduce $\lambda_{max}$ for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch size and learning rate simultaneously, we can change $\lambda_{max}$ without affecting generalization; (3) while SAM produces smaller $\lambda_{max}$ for all batch sizes, generalization benefits (also) vanish with larger batch sizes; (4) for dropout, excessively high dropout probabilities can degrade generalization, even as they promote smaller $\lambda_{max}$; and (5) while batch-normalization does not consistently produce smaller $\lambda_{max}$, it nevertheless confers generalization benefits. While our experiments affirm the generalization benefits of large learning rates and SAM for minibatch SGD, the GD-SGD discrepancy demonstrates limits to $\lambda_{max}$'s ability to explain generalization in neural networks.
    SoccerCPD: Formation and Role Change-Point Detection in Soccer Matches Using Spatiotemporal Tracking Data. (arXiv:2206.10926v1 [stat.AP])
    In fluid team sports such as soccer and basketball, analyzing team formation is one of the most intuitive ways to understand tactics from domain participants' point of view. However, existing approaches either assume that team formation is consistent throughout a match or assign formations frame-by-frame, which disagree with real situations. To tackle this issue, we propose a change-point detection framework named SoccerCPD that distinguishes tactically intended formation and role changes from temporary changes in soccer matches. We first assign roles to players frame-by-frame and perform two-step change-point detections: (1) formation change-point detection based on the sequence of role-adjacency matrices and (2) role change-point detection based on the sequence of role permutations. The evaluation of SoccerCPD using the ground truth annotated by domain experts shows that our method accurately detects the points of tactical changes and estimates the formation and role assignment per segment. Lastly, we introduce practical use-cases that domain participants can easily interpret and utilize.
    A consistent and flexible framework for deep matrix factorizations. (arXiv:2206.10693v1 [cs.LG])
    Deep matrix factorizations (deep MFs) are recent unsupervised data mining techniques inspired by constrained low-rank approximations. They aim to extract complex hierarchies of features within high-dimensional datasets. Most of the loss functions proposed in the literature to evaluate the quality of deep MF models and the underlying optimization frameworks are not consistent because different losses are used at different layers. In this paper, we introduce two meaningful loss functions for deep MF and present a generic framework to solve the corresponding optimization problems. We illustrate the effectiveness of this approach through the integration of various constraints and regularizations, such as sparsity, nonnegativity and minimum-volume. The models are successfully applied on both synthetic and real data, namely for hyperspectral unmixing and extraction of facial features.
    List-Decodable Covariance Estimation. (arXiv:2206.10942v1 [cs.DS])
    We give the first polynomial time algorithm for \emph{list-decodable covariance estimation}. For any $\alpha > 0$, our algorithm takes input a sample $Y \subseteq \mathbb{R}^d$ of size $n\geq d^{\mathsf{poly}(1/\alpha)}$ obtained by adversarially corrupting an $(1-\alpha)n$ points in an i.i.d. sample $X$ of size $n$ from the Gaussian distribution with unknown mean $\mu_*$ and covariance $\Sigma_*$. In $n^{\mathsf{poly}(1/\alpha)}$ time, it outputs a constant-size list of $k = k(\alpha)= (1/\alpha)^{\mathsf{poly}(1/\alpha)}$ candidate parameters that, with high probability, contains a $(\hat{\mu},\hat{\Sigma})$ such that the total variation distance $TV(\mathcal{N}(\mu_*,\Sigma_*),\mathcal{N}(\hat{\mu},\hat{\Sigma}))<1-O_{\alpha}(1)$. This is the statistically strongest notion of distance and implies multiplicative spectral and relative Frobenius distance approximation for parameters with dimension independent error. Our algorithm works more generally for $(1-\alpha)$-corruptions of any distribution $D$ that possesses low-degree sum-of-squares certificates of two natural analytic properties: 1) anti-concentration of one-dimensional marginals and 2) hypercontractivity of degree 2 polynomials. Prior to our work, the only known results for estimating covariance in the list-decodable setting were for the special cases of list-decodable linear regression and subspace recovery due to Karmarkar, Klivans, and Kothari (2019), Raghavendra and Yau (2019 and 2020) and Bakshi and Kothari (2020). These results need superpolynomial time for obtaining any subconstant error in the underlying dimension. Our result implies the first polynomial-time \emph{exact} algorithm for list-decodable linear regression and subspace recovery that allows, in particular, to obtain $2^{-\mathsf{poly}(d)}$ error in polynomial-time. Our result also implies an improved algorithm for clustering non-spherical mixtures.
    Information Geometry of Dropout Training. (arXiv:2206.10936v1 [stat.ML])
    Dropout is one of the most popular regularization techniques in neural network training. Because of its power and simplicity of idea, dropout has been analyzed extensively and many variants have been proposed. In this paper, several properties of dropout are discussed in a unified manner from the viewpoint of information geometry. We showed that dropout flattens the model manifold and that their regularization performance depends on the amount of the curvature. Then, we showed that dropout essentially corresponds to a regularization that depends on the Fisher information, and support this result from numerical experiments. Such a theoretical analysis of the technique from a different perspective is expected to greatly assist in the understanding of neural networks, which are still in their infancy.
    Diagnostic Tool for Out-of-Sample Model Evaluation. (arXiv:2206.10982v1 [stat.ML])
    Assessment of model fitness is an important step in many problems. Models are typically fitted to training data by minimizing a loss function, such as the squared-error or negative log-likelihood, and it is natural to desire low losses on future data. This letter considers the use of a test data set to characterize the out-of-sample losses of a model. We propose a simple model diagnostic tool that provides finite-sample guarantees under weak assumptions. The tool is computationally efficient and can be interpreted as an empirical quantile. Several numerical experiments are presented to show how the proposed method quantifies the impact of distribution shifts, aids the analysis of regression, and enables model selection as well as hyper-parameter tuning.
    Beyond Uniform Lipschitz Condition in Differentially Private Optimization. (arXiv:2206.10713v1 [cs.LG])
    Most prior convergence results on differentially private stochastic gradient descent (DP-SGD) are derived under the simplistic assumption of uniform Lipschitzness, i.e., the per-sample gradients are uniformly bounded. This assumption is unrealistic in many problems, e.g., linear regression with Gaussian data. We relax uniform Lipschitzness by instead assuming that the per-sample gradients have \textit{sample-dependent} upper bounds, i.e., per-sample Lipschitz constants, which themselves may be unbounded. We derive new convergence results for DP-SGD on both convex and nonconvex functions when the per-sample Lipschitz constants have bounded moments. Furthermore, we provide principled guidance on choosing the clip norm in DP-SGD for convex settings satisfying our relaxed version of Lipschitzness, without making distributional assumptions on the Lipschitz constants. We verify the effectiveness of our recommendation via experiments on benchmarking datasets.
    On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL. (arXiv:2206.10770v1 [cs.LG])
    We study reward-free reinforcement learning (RL) under general non-linear function approximation, and establish sample efficiency and hardness results under various standard structural assumptions. On the positive side, we propose the RFOLIVE (Reward-Free OLIVE) algorithm for sample-efficient reward-free exploration under minimal structural assumptions, which covers the previously studied settings of linear MDPs (Jin et al., 2020b), linear completeness (Zanette et al., 2020b) and low-rank MDPs with unknown representation (Modi et al., 2021). Our analyses indicate that the explorability or reachability assumptions, previously made for the latter two settings, are not necessary statistically for reward-free exploration. On the negative side, we provide a statistical hardness result for both reward-free and reward-aware exploration under linear completeness assumptions when the underlying features are unknown, showing an exponential separation between low-rank and linear completeness settings.
    Sparse Kernel Gaussian Processes through Iterative Charted Refinement (ICR). (arXiv:2206.10634v1 [cs.LG])
    Gaussian Processes (GPs) are highly expressive, probabilistic models. A major limitation is their computational complexity. Naively, exact GP inference requires $\mathcal{O}(N^3)$ computations with $N$ denoting the number of modeled points. Current approaches to overcome this limitation either rely on sparse, structured or stochastic representations of data or kernel respectively and usually involve nested optimizations to evaluate a GP. We present a new, generative method named Iterative Charted Refinement (ICR) to model GPs on nearly arbitrarily spaced points in $\mathcal{O}(N)$ time for decaying kernels without nested optimizations. ICR represents long- as well as short-range correlations by combining views of the modeled locations at varying resolutions with a user-provided coordinate chart. In our experiment with points whose spacings vary over two orders of magnitude, ICR's accuracy is comparable to state-of-the-art GP methods. ICR outperforms existing methods in terms of computational speed by one order of magnitude on the CPU and GPU and has already been successfully applied to model a GP with $122$ billion parameters.
    Concentration inequalities and optimal number of layers for stochastic deep neural networks. (arXiv:2206.11241v1 [cs.LG])
    We state concentration and martingale inequalities for the output of the hidden layers of a stochastic deep neural network (SDNN), as well as for the output of the whole SDNN. These results allow us to introduce an expected classifier (EC), and to give probabilistic upper bound for the classification error of the EC. We also state the optimal number of layers for the SDNN via an optimal stopping procedure. We apply our analysis to a stochastic version of a feedforward neural network with ReLU activation function.
    Does the Data Induce Capacity Control in Deep Learning?. (arXiv:2110.14163v3 [cs.LG] UPDATED)
    We show that the input correlation matrix of typical classification datasets has an eigenspectrum where, after a sharp initial drop, a large number of small eigenvalues are distributed uniformly over an exponentially large range. This structure is mirrored in a network trained on this data: we show that the Hessian and the Fisher Information Matrix (FIM) have eigenvalues that are spread uniformly over exponentially large ranges. We call such eigenspectra "sloppy" because sets of weights corresponding to small eigenvalues can be changed by large magnitudes without affecting the loss. Networks trained on atypical datasets with non-sloppy inputs do not share these traits and deep networks trained on such datasets generalize poorly. Inspired by this, we study the hypothesis that sloppiness of inputs aids generalization in deep networks. We show that if the Hessian is sloppy, we can compute non-vacuous PAC-Bayes generalization bounds analytically. By exploiting our empirical observation that training predominantly takes place in the non-sloppy subspace of the FIM, we develop data-distribution dependent PAC-Bayes priors that lead to accurate generalization bounds using numerical optimization.
    KSD Aggregated Goodness-of-fit Test. (arXiv:2202.00824v3 [stat.ML] UPDATED)
    We investigate properties of goodness-of-fit tests based on the Kernel Stein Discrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg, which aggregates multiple tests with different kernels. KSDAgg avoids splitting the data to perform kernel selection (which leads to a loss in test power), and rather maximises the test power over a collection of kernels. We provide theoretical guarantees on the power of KSDAgg: we show it achieves the smallest uniform separation rate of the collection, up to a logarithmic term. KSDAgg can be computed exactly in practice as it relies either on a parametric bootstrap or on a wild bootstrap to estimate the quantiles and the level corrections. In particular, for the crucial choice of bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such as median or standard deviation) or to data splitting. We find on both synthetic and real-world data that KSDAgg outperforms other state-of-the-art adaptive KSD-based goodness-of-fit testing procedures.
    From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses. (arXiv:2205.07704v2 [stat.ML] UPDATED)
    We propose the Bayes-UCBVI algorithm for reinforcement learning in tabular, stage-dependent, episodic Markov decision process: a natural extension of the Bayes-UCB algorithm by Kaufmann et al. (2012) for multi-armed bandits. Our method uses the quantile of a Q-value function posterior as upper confidence bound on the optimal Q-value function. For Bayes-UCBVI, we prove a regret bound of order $\widetilde{O}(\sqrt{H^3SAT})$ where $H$ is the length of one episode, $S$ is the number of states, $A$ the number of actions, $T$ the number of episodes, that matches the lower-bound of $\Omega(\sqrt{H^3SAT})$ up to poly-$\log$ terms in $H,S,A,T$ for a large enough $T$. To the best of our knowledge, this is the first algorithm that obtains an optimal dependence on the horizon $H$ (and $S$) without the need for an involved Bernstein-like bonus or noise. Crucial to our analysis is a new fine-grained anti-concentration bound for a weighted Dirichlet sum that can be of independent interest. We then explain how Bayes-UCBVI can be easily extended beyond the tabular setting, exhibiting a strong link between our algorithm and Bayesian bootstrap (Rubin, 1981).
    Cold Posteriors through PAC-Bayes. (arXiv:2206.11173v1 [cs.LG])
    We investigate the cold posterior effect through the lens of PAC-Bayes generalization bounds. We argue that in the non-asymptotic setting, when the number of training samples is (relatively) small, discussions of the cold posterior effect should take into account that approximate Bayesian inference does not readily provide guarantees of performance on out-of-sample data. Instead, out-of-sample error is better described through a generalization bound. In this context, we explore the connections between the ELBO objective from variational inference and the PAC-Bayes objectives. We note that, while the ELBO and PAC-Bayes objectives are similar, the latter objectives naturally contain a temperature parameter $\lambda$ which is not restricted to be $\lambda=1$. For both regression and classification tasks, in the case of isotropic Laplace approximations to the posterior, we show how this PAC-Bayesian interpretation of the temperature parameter captures the cold posterior effect.
    Decentralized Gossip-Based Stochastic Bilevel Optimization over Communication Networks. (arXiv:2206.10870v1 [stat.ML])
    Bilevel optimization have gained growing interests, with numerous applications found in meta learning, minimax games, reinforcement learning, and nested composition optimization. This paper studies the problem of distributed bilevel optimization over a network where agents can only communicate with neighbors, including examples from multi-task, multi-agent learning and federated learning. In this paper, we propose a gossip-based distributed bilevel learning algorithm that allows networked agents to solve both the inner and outer optimization problems in a single timescale and share information via network propagation. We show that our algorithm enjoys the $\mathcal{O}(\frac{1}{K \epsilon^2})$ per-agent sample complexity for general nonconvex bilevel optimization and $\mathcal{O}(\frac{1}{K \epsilon})$ for strongly convex objective, achieving a speedup that scales linearly with the network size. The sample complexities are optimal in both $\epsilon$ and $K$. We test our algorithm on the examples of hyperparameter tuning and decentralized reinforcement learning. Simulated experiments confirmed that our algorithm achieves the state-of-the-art training efficiency and test accuracy.
    Graph Neural Networks as Gradient Flows. (arXiv:2206.10991v1 [cs.LG])
    Dynamical systems minimizing an energy are ubiquitous in geometry and physics. We propose a gradient flow framework for GNNs where the equations follow the direction of steepest descent of a learnable energy. This approach allows to explain the GNN evolution from a multi-particle perspective as learning attractive and repulsive forces in feature space via the positive and negative eigenvalues of a symmetric "channel-mixing" matrix. We perform spectral analysis of the solutions and conclude that gradient flow graph convolutional models can induce a dynamics dominated by the graph high frequencies which is desirable for heterophilic datasets. We also describe structural constraints on common GNN architectures allowing to interpret them as gradient flows. We perform thorough ablation studies corroborating our theoretical analysis and show competitive performance of simple and lightweight models on real-world homophilic and heterophilic datasets.

  • Open

    Summary Papers in RL [D]
    I'm new to RL research and I find reading papers incredibly inefficient - I don't know if anyone else agrees. If you're new to the topic, a lot of the time, the "Preliminaries" section isn't detailed enough for you to fully understand the problem. Other times, a new architecture idea is introduced, a bunch of (toy) experiments are run, and for someone with limited experience, it's hard to gain any insight other than to say: maybe this architecture is indeed better, maybe its due to the data/hyperparameter or the improvement is marginal at best. Sometimes you see theoretical results that are very complicated, but it's hard to see its impact. It feels like a calculus student not being told the importance of the Fundamental Theorem of Calculus or Stokes Thoerem, and having to discern it for …  ( 84 min )
    Does the value of the reward matter?
    Hello, I'm just wondering, what effect does the value of the reward have on the learning process. For example, let's say I have a problem where the agent gets a reward of 100 if they were able to solve a maze, how would the learning be affected if the reward was 1 or 1000000 instead? submitted by /u/AhmedNizam_ [link] [comments]  ( 84 min )
    Value-based rl with advantage function in actor-critic setting
    Hi, I wonder value based actor-critic algorithm doesn't use advantage function? I understand that advantage function lowers the variance in actor-critic setting. but why It is not usually adapted in value based algorithm? If possible, could you introduce the rl algorithm with advantage function? Thanks for reading. submitted by /u/Spiritual_Fig3632 [link] [comments]  ( 83 min )
    Question on Score Function in Policy Gradient, looking for help on this question I had in r/learnmachinelearning
    submitted by /u/100M-900 [link] [comments]  ( 83 min )
    How to train the DRL model for Unmanned aerial vehicles?
    We train the Deep reinforcement Learning model for IoT devices/Unmanned aerial vehicles at GPU and we have enough resources to train over there, what if we have to train that model on IoTs/UAVs, is it possible for UAV to compute that model? submitted by /u/ShSalmanHassan [link] [comments]  ( 84 min )
  • Open

    Exploring Octoparse for Data Preparations and  Product Assessment
    In this article, let’s discuss one of the trendy and handy web-scraping tools, Octoparse, and its key features and how to use it for our data-driven solutions. Hope you all are familiar with “WEB SCRAPING” techniques, and the captured data has been used to analyze business perceptions further. If you look at the end-end process… Read More »Exploring Octoparse for Data Preparations and  Product Assessment The post Exploring Octoparse for Data Preparations and  Product Assessment appeared first on Data Science Central.  ( 24 min )
  • Open

    [P] less stress for the NHS
    Hi everyone, I am an intensive care nurse and I work for public health in the UK. I have recently taken up a job that involves aiding the transition to a new software called Epic Healthcare. I would like to anticipate that I have a clinical background but little to none experience with AI and machine learning, sorry if I will butcher terms and concepts in the following post. I am getting involved in the digital side of patient care which I find really fascinating, although I realise how it is a couple of decades behind in terms of technology and user interface. One of the problems I am facing at the moment is device integration. A lot of devices are used, especially in intensive care, and their parameters are fed into the electronic health record (EHR) in an hourly cadence in order to …  ( 89 min )
    [D] What is the current SOTA for open-source AutoML?
    I've never really used AutoML--I prefer to code up my models and data engineering by hand, but I'm beginning to wonder if I can use AutoML as a starting point, e.g., the built-in hyper-parameter optimization or NAS finds a good neural network hyper-params/architecture for me, and I can build on that. With that in mind, what's the SOTA right now? Ideally, it would be as white-box as possible, telling me the models it tries, what worked and didn't, etc. Alternatively, what has worked best for you in your workflows? submitted by /u/FlyingQuokka [link] [comments]  ( 84 min )
    [P] Multidimensional array batch indexing for pytorch and numpy
    Batch indexing into multidimensional tensors/arrays is kind of tricky, I made this project explaining the builtin syntax and also made wrappers for simplifying the interface, with additional features for underlying coordinate grid data (like signed distance functions) that need to be indexed by coordinate value rather than integer indices directly https://github.com/LemonPi/multidim_indexing submitted by /u/LemonByte [link] [comments]  ( 83 min )
    [P] Building a Source of Truth for Inventory with Disparate Data Sources
    One of the most challenging shifts from food delivery to grocery is managing inventory. Although restaurant menu items can sometimes go out of stock, grocery store inventories have far more SKUs and many different ways to track their inventory levels. This complexity of grocery makes it a lot harder to ensure items customers buy are actually available. Knowing what the ground truth is, so that customers can order groceries with confidence, is the subject of a new engineering blog post I wrote, "Building a Source of Truth for an Inventory with Disparate Data Sources". The article explains how we crowd sourced our inventory data from a number of different sources which enabled us to predict which items are likely still on the shelves when customers place an order. Take a look and let me know what you think submitted by /u/Relative_Collection1 [link] [comments]  ( 84 min )
    [R] Announcing DAMP 2.0: Allowing SOTA Anomaly Detection in Massive Time Series Datasets
    Dear Colleagues We are happy to announce the release of DAMP 2.0 [a]. DAMP (Discord Aware Matrix Profile) is an anomaly detection framework that allows you to search datasets with millions or billions of datapoints, all on a conventional machine [b]. We are not normally so vainglorious as to announce the publication of a paper, however: 1) The code comes bundled with some great new anomaly detection datasets, and there is a real dearth of good datasets in the community (see [c]) 2) Some researchers are working on problems that use anomaly detection as a subroutine, and that is their main computational bottleneck. Because DAMP can be up to 10,000 times faster than other approaches, this may be of interest to the community Best wishes, Yue [a] Matrix Profile XXIV:Scaling Time Series Anomaly Detection to Trillions of Datapoints and Ultra-fast Arriving Data Streams. Yue Lu , Renjie Wu , Abdullah Mueen , Maria A. Zuluaga and Eamonn Keogh. ACM SIGKDD 2022. https://www.cs.ucr.edu/~eamonn/DAMP_long_version.pdf [b] https://sites.google.com/view/discord-aware-matrix-profile [c] Irrational Exuberance Why we should not believe 95% of papers on Time Series Anomaly Detection. https://www.youtube.com/watch?v=Vg1p3DouX8w [d] https://drive.google.com/file/d/1hEgOKtoTuHGPMqR1wty8ff_jes93ra9a/view submitted by /u/ylu175 [link] [comments]  ( 85 min )
    [D] Is audio style transfer a thing ?
    So we have image style transfer, there's a lot of good papers and implementations. Is there such thing as audio style transfer, where 1 song keeps its lyrics and melody, but get the other song's style ? e.g. pop music with rock style ? If yes - can you please share a link ? submitted by /u/keremidk0 [link] [comments]  ( 84 min )
    [R] Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (Google - Parti)
    Google published results from an seq2seq transformer model for autoregressive image generation. Website: https://parti.research.google/ Paper: https://gweb-research-parti.web.app/parti_paper.pdf submitted by /u/htrp [link] [comments]  ( 85 min )
    [R] Black box adversarial attacks that do not require output labels
    For those who specialize in adversarial machine learning, are there any black box attacks that do not require the model's output labels when generating adversarial images? I can't seem to find any submitted by /u/berimbolo21 [link] [comments]  ( 83 min )
    [D] Questions about the Fastformer
    Yannic made a video about it, and the Fastformer was discussed on reddit before, so I figured I'd ask here: ​ https://preview.redd.it/s12egeoh37791.png?width=528&format=png&auto=webp&s=1a033fbf6e01353aee463f7768fc49048fd44791 ​ Do I understand it correctly that they are just measuring the attention bit, and not the whole layer's performance (as the Y-axis label implies) ? Is the Fastformer appropriate at all for the kinds of tasks that are the bread and butter of the Transformer, like language models and translation? Has anyone here tried the Fastformer on those? submitted by /u/we_are_mammals [link] [comments]  ( 84 min )
    [D] Any way to speed up simple mathematical functions without implementing cuda kernels for pytorch?
    I am working on a pytorch project and I have a custom computation that I am so far unable to express as a combination pre-defined pytorch functions (because it's essentially some loops around conv2d calls where I juggle some indices in a 5-d tensor). So currently I use python-loops with some smart padding but that's not the fastest. The only way to speed this up would be, i think, to implement custom cuda kernels. While the computation is not that trivial it is simple in a mathematical way. It can be defined in a single line using lots of indices and sums. I wonder whether there is really nothing I can do? What I am thinking of is something like tensor-comprehensions, but that's deprecated and I didn't get it to install. Is there any modern alternative to tensor-comprehension, or should I switch the language to e.g. julia? Is it possible there to define slightly different conv2d there and have it run natively on the GPU? I don't expect performance comparable to the handwritten conv2d kernels, but the python loops are just quite slow. submitted by /u/LeanderKu [link] [comments]  ( 86 min )
    [R] 🔎 How I found external data for #1 Private LiderBoard solution on Kaggle
    Intro Competition: TPS January 2022, SMAPE as a target metric 📚 In this notebook we'll use: Upgini - Low-code Feature search and enrichment library for supervised machine learning applications.📷 GitHub Baseline model in this notebook is based on u/ambrosm notebook (first place) with some minor changes: Feature engineering part was slightly changed, so we can prepare main features and external features separately; SimpleImputer was added to dataprep pipeline to deal with missing values while adding new external features; Constant scaling factor for the test predictions was removed. How external data & features might help on Kaggle? Kaggle is always about learning and leader board progress (hopefully from learning, not cheating ;-)) And every Kaggler wants to progress as f…  ( 94 min )
    [D] How to compare model performance when you add data withe label noise?
    Let's say I'm trying to categorize vendors based on their description using some NLP technique. I have a limited dataset of vendors with high quality (low noise) labels. I split in to train/test, and score say 90% accuracy. I then get hold of a dataset for 3d party vendors, which will have much noisier (but still useful) data. Now when I train the model I get an 89% accuracy. How do I interpret this? The noisier data will also go in the test split, and the model is expected to perform worse on those, so even if it's exactly as good as the prior model on the old data, it should have an average worse performance on the new dataset. It could even be better, say scoring 91% on the old data, but 85% on the new data, so the average accuracy looks lower even though you have a better model. Testing the old model on the new test set I guess would settle this? Just curious if there are any best practices. submitted by /u/bandalorian [link] [comments]  ( 88 min )
    [P] Bottom-up look at the new Lightning Framework for building anything from production-ready ML systems to research demos
    The open-source lightning.ai framework just launched last week introducing the concept of Lightning Apps. It's basically meant for building anything from production ready ML-system running on multi-node GPU clusters in the cloud to building simple research demos. Starting with a simple use case, a research demo, I wrote a "short" article about it to explain how it roughly works under the hood: Sharing Deep Learning Research Models with Lightning Part 1: Building A Super Resolution App Looking forward to hearing your feedback. I am planning to put together more "substantial" examples, but I was thinking of doing that one step at the time. Will be attending a conference in 3 weeks and am planning to create a research demo alongside the paper I will be presenting, and I was wondering besides Gradio/Dash/Gradio, what are your typical tools and workflows for making research demos. Any cool examples for inspiration? ​ Disclaimer: I recently joined Lightning when I saw an early prototype. As someone who has spent most of my time on research models, I was always intrigued by putting ML models to production. However, I was also always turned of by the tooling that it involved. submitted by /u/seraschka [link] [comments]  ( 86 min )
    [D] Have you ever been asked to work on a software project you found unethical? We’d like to hear from you!
    We are researchers at Carnegie Mellon University studying how software developers identify and act on ethical concerns at work. If you’re interested in helping us advance research in software ethics, please fill out this survey and we’ll reach out to you for a quick interview! P.S. You can check out this Stack Overflow blog post to read more about the direction of our research. Anything you disclose to us during the survey / interview may appear in our study but will not be traceable to you. submitted by /u/curious_cow_99 [link] [comments]  ( 88 min )
    [R] Breaking Down Out-of-Distribution Detection
    TL;DR: Many OOD detectors that are trained with samples from an (unrelated) OOD dataset can be understood by isolating a binary discriminator between in-distribution and OOD. We just published it on arXiv and will present it at ICML 2022. Questions and discussion are very welcome! Full title: Breaking Down Out-of-Distribution Detection: Many Methods Based on OOD Training Data Estimate a Combination of the Same Core Quantities by Julian Bitterwolf, Alexander Meinke, Maximilian Augustin, Matthias Hein. submitted by /u/JBitterwolf [link] [comments]  ( 84 min )
    [D][R] Is there any benchmark task set for computer vision?
    I know that in NLP, there are some benchmark task sets like GLUE, SuperGLUE, etc. I wonder wherer there is any similar benchmark task set for computer vision that we can easily test many tasks in a unified way? submitted by /u/singularpanda [link] [comments]  ( 84 min )
    [R][P] Best Approach to do Image Inpainting in Video Files (Image Timeseries)
    First time posting here. I am working with image timeseries of satellite images. These are essentially 1 hour long video files with the image size of 384 X 384 pix. The images have chunks of data missing, say 20 X 20 pix at different parts of the image. I would say that the missing part of the image is roughly 20%-25%. Now I have the ground truths to train a neural network. But what I am struggling is what primary architecture should I begin with: CNN, LSTM, CNN-LSTM, U-Net? I found this literature: https://arxiv.org/abs/2112.09262 - which exploits a U-net autoencoder architecture to solve the image inpainting problem, but I am not sure how robust this is for 3D (x,y,t) image cubes. Is there anyone experienced here who has worked on image inpainting on video files? Can you please share your experience? If you can point me towards a reliable literature that would be a big help! submitted by /u/bahauddin_onar [link] [comments]  ( 86 min )
    [Discussion] Iteration of Machine Learning Systems
    Engineering systems progress by addressing used cases of increasing levels of complexity. For example, you start with a 'minimum viable product' and then slowly add features or complexity as things progress. However, this is not how machine learning systems progress. You don't start with 10 positive/negative samples, and then iteratively add more. It's not even wise to start with one (or a few) 'tasks' and then add new ones as things progress. Clearly, iteration (or progress) in machine learning systems does not follow the same pattern as traditional engineering systems. Is there another way to think about iteration? submitted by /u/TheFibo1123 [link] [comments]  ( 89 min )
    [R] EnvPool: A Highly Parallel Reinforcement Learning Environment Execution Engine
    submitted by /u/hardmaru [link] [comments]  ( 83 min )
  • Open

    Visual inspection automation using Amazon SageMaker JumpStart
    According to Gartner, hyperautomation is the number one trend in 2022 and will continue advancing in future. One of the main barriers to hyperautomation is in areas where we’re still struggling to reduce human involvement. Intelligent systems have a hard time matching human visual recognition abilities, despite great advancements in deep learning in computer vision. […]  ( 7 min )
  • Open

    BCI Controlled Robot Arm For Amputee | Breakthrough 3D Printing Tech Builds Robot In 1 Step
    submitted by /u/getrich_or_diemining [link] [comments]  ( 82 min )
    New Tutorial Disco Diffusion video
    ​ Just finished part 1 of my new tutorial series on Video/Animation with disco diffusion, first one just covers the basics of 2d/3d mode and I also show how to use prompt weights and keyframes to change the scene, like changing from summer to winter in this video ​ ​ https://www.youtube.com/watch?v=HbPz2K40e_k ​ https://reddit.com/link/vibqmd/video/d543o3bus7791/player submitted by /u/prfitofthesngularity [link] [comments]  ( 82 min )
    Deploy, run, and monitor ML/AI models for free with Modzy Basic+
    MLOps for free: with Modzy Basic+, you can deploy, run, integrate, and monitor up to five of your own ML/AI models at scale. With Modzy Basic+, you gain access to an enterprise-grade MLOps platform, without the price. Deploy up to five of your own models that can run on a CPU and 4GB of RAM. From there, easily integrate your models into web apps, mobile apps, pipelines or any other tool using our APIs and SDKs, and run up to 10,000 inferences per day. Finally, monitor your models in production to ensure peak performance. To get started running your AI models at scale, sign up for Modzy Basic+ today. The Modzy platform accelerates the deployment, integration, and governance of production-ready AI. With integrations for the leading data science and DevOps tools, teams count on Modzy to quickly and easily build AI-enabled applications in standard, repeatable, and secure ways. By leveraging Modzy as a central location for monitoring all AI across the enterprise or at the edge, teams can establish governance and security while generating higher returns from AI. Get started running your AI at scale with Modzy Basic+ today. submitted by /u/modzykirsten [link] [comments]  ( 83 min )
    New Pathways Text to Image model
    submitted by /u/manOnPavementWaving [link] [comments]  ( 82 min )
    Amazon AI Researchers Open-Source ‘Syne Tune’: A Novel Python Library For Distributed HPO With An Emphasis On Enabling Reproducible Machine Learning Research
    Deep learning models with billions of parameters are trained through gradient-based stochastic optimization, thanks to powerful algorithms, systems, and hardware advancements. These algorithms include several hyperparameters that are essential for effective performance. Hyperparameter adjustment is required to control the behavior of a machine learning model. If our hyperparameters are not correctly set, our anticipated model parameters will not minimize the loss function, resulting in poor results. The lousy result suggests that our model has further faults. In actuality, the accuracy or confusion matrix will be worse. Many hyperparameters exist like learning rate, regularisation type, degree, and size of neural network layers. Automating the setting of these hyperparameters and accelerating the training of neural network weights are necessary if domain experts and industry practitioners benefit from the most recent deep learning technologies. Even for specialists, tuning them takes a lot of time and effort; choosing the best hyperparameter configuration frequently depends on factors like cost or latency. Continue reading | Checkout the paper, github submitted by /u/Embarrassed-Fee5513 [link] [comments]  ( 83 min )
    8 Famous Definitions of Artificial Intelligence
    submitted by /u/Philo167 [link] [comments]  ( 82 min )
    It’s data visualisation of NYTimes articles from 1851 until now
    submitted by /u/galacticfarthole [link] [comments]  ( 82 min )
    Nvidia 3D MoMa: Neural Inverse Rendering turns photos into 3D objects within an hour
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 82 min )
    A beautiful sunset over a colourful and detailed tropical landscape created on Pixelz.ai
    submitted by /u/pixelz_ai [link] [comments]  ( 82 min )
    A celebrated AI has learned a new trick: How to do chemistry
    submitted by /u/estasfuera [link] [comments]  ( 82 min )
    [Research] Data Labeling Research
    Hi! I'm doing some market research for a data labeling product and want to ask the actual people in the industry (you) for opinions and what actually is the reality for the industry. Any/all responses are super helpful, so thank you in advance if you answer/are able to answer my questions. Does your company use a data labeling tool? If so, what? If not a specific tool, how do you label your data? Who actually does the labeling? Is it engineers? Outsourced? Someone on Fiverr? Are you aware of data labeling tools that exist on the market? If so can you name a company or two that comes to mind? What is the single greatest issue/missing functionality of a current tool you use (if you have one)? Feel free to mention the tool, if that helps add context to the data medium (text, audio, video, image). I'm currently trying to determine what the most important product features are for text/audio labeling, what would those be for you? (e.g. a specific use case, UI/UX functionality, integrations, automation, etc.) What do you think is a fair price for a tool to do data labeling? (specifically text/audio) Even if you can only answer one or several questions, all responses are extremely helpful! Again, thank you so much for your time and for the help. submitted by /u/AnGrAnHo [link] [comments]  ( 83 min )
    GOLDEN STATE WARRIORS | 2022 NBA WORLD CHAMPIONS | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
    Why Data Scientists Are Increasingly Quitting Their Jobs: Lack of Skills or Different Expectations?
    submitted by /u/saik2363 [link] [comments]  ( 83 min )
    What are the skills that AIs already have, such as talking to each other, writing texts, generating images, is there a website that lists these skills?
    submitted by /u/xXNOdrugsForMEXx [link] [comments]  ( 82 min )
    What are the best chat AIs like LaMDA which I can use?
    submitted by /u/xXLisa28Xx [link] [comments]  ( 82 min )
    What is the best AI chatbot to talk to it?
    submitted by /u/NextDream [link] [comments]  ( 82 min )
    Weight in image?
    Is it possible for an A.I to figure out what weight is only from the image itself and no external data? Thanks submitted by /u/OneFinding1429 [link] [comments]  ( 82 min )
    "Islands" 🏝️ created on pixelz.ai
    submitted by /u/PixelzJ [link] [comments]  ( 82 min )
    "Space in a jar" 🌌 created on pixelz.ai
    submitted by /u/PixelzJ [link] [comments]  ( 82 min )
    "The End - Los Angeles" 🌆 created on pixelz.ai
    submitted by /u/PixelzJ [link] [comments]  ( 82 min )
  • Open

    Brain Computer Interface + AI Controlled Limbs For Amputees | New Neuromorphic AI Chip
    submitted by /u/tohelpyou88 [link] [comments]  ( 82 min )
    AlexNet paper architecture
    Why in the paper after the firat step the architecture is splitted to two equal size stacks? ie instead if having a single 55x55 96filter stacks after the first step it has two 55x55 48 filter stack. Correct me if I wrong but i believe it is beacuse they used to divide it cause of not enough computational power, right? submitted by /u/PlentyRadiant4191 [link] [comments]  ( 82 min )
    non-programmer theorizing on multithreaded neural subnetworks
    I am currently taking courses in python, but I won't be up to attempting this for... I don't know how long. But one of my goals with python is to create an evolution simulator similar to r/TheBibites, and while considering some limitations with creatures not being able to tell a prey item apart from their own child, I theorycrafted this as a way to give creatures more information about the things they're looking at. I can't find any sources about something like this being done before, but I don't know how to search for those sources given that others probably wouldn't name this the same way I did. So my main question is "does this sound like anything that you already know about?" with the follow up question, "does this sound like it would work?" - - - - - - So the idea is that the creatu…  ( 84 min )
  • Open

    Quantum Advantage in Learning from Experiments
    Posted by Jarrod McClean, Staff Research Scientist, Google Quantum AI, and Hsin-Yuan Huang, Graduate Student, Caltech In efforts to learn about the quantum world, scientists face a big obstacle: their classical experience of the world. Whenever a quantum system is measured, the act of this measurement destroys the “quantumness” of the state. For example, if the quantum state is in a superposition of two locations, where it can seem to be in two places at the same time, once it is measured, it will randomly appear either ”here” or “there”, but not both. We only ever see the classical shadows cast by this strange quantum world. A growing number of experiments are implementing machine learning (ML) algorithms to aid in analyzing data, but these have the same limitations as the people they a…  ( 28 min )
    Mapping Urban Trees Across North America with the Auto Arborist Dataset
    Posted by Sara Beery, Student Researcher, and Jonathan Huang, Research Scientist, Google Research, Perception Team Over four billion people live in cities around the globe, and while most people interact daily with others — at the grocery store, on public transit, at work — they may take for granted their frequent interactions with the diverse plants and animals that comprise fragile urban ecosystems. Trees in cities, called urban forests, provide critical benefits for public health and wellbeing and will prove integral to urban climate adaptation. They filter air and water, capture stormwater runoff, sequester atmospheric carbon dioxide, and limit erosion and drought. Shade from urban trees reduces energy-expensive cooling costs and mitigates urban heat islands. In the US alone, urban fo…  ( 27 min )
  • Open

    Meet the Omnivore: Director of Photography Revs Up NVIDIA Omniverse to Create Sleek Car Demo
    A camera begins in the sky, flies through some trees and smoothly exits the forest, all while precisely tracking a car driving down a dirt path. This would be all but impossible in the real world, according to film and photography director Brett Danton. The post Meet the Omnivore: Director of Photography Revs Up NVIDIA Omniverse to Create Sleek Car Demo appeared first on NVIDIA Blog.  ( 6 min )
    Artem Cherkasov and Olexandr Isayev on Democratizing Drug Discovery With NVIDIA GPUs
    It may seem intuitive that AI and deep learning can speed up workflows — including novel drug discovery, a typically years-long and several-billion-dollar endeavor. But professors Artem Cherkasov and Olexandr Isayev were surprised to find that no recent academic papers provided a comprehensive, global research review of how deep learning and GPU-accelerated computing impact drug Read article > The post Artem Cherkasov and Olexandr Isayev on Democratizing Drug Discovery With NVIDIA GPUs appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    Is the healthcare sector reaping the benefits of RPA?
    Robotics Process Automation (RPA) is all about incorporating solutions that handle repetitive tasks faster and more efficiently. These…  ( 9 min )
  • Open

    Numerically evaluating a theta function
    Theta functions pop up throughout pure and applied mathematics. For example, they’re common in analytic number theory, and they’re solutions to the heat equation. Theta functions are analogous in some ways to trigonometric functions, and like trigonometric functions they satisfy a lot of identities. This post will comment briefly on an identity that makes a […] Numerically evaluating a theta function first appeared on John D. Cook.  ( 5 min )

  • Open

    Relevant XKCD (make sure to read the alt-text)
    submitted by /u/webbitor [link] [comments]  ( 82 min )
    General AI Sentience
    submitted by /u/PrincePaulSMamakos [link] [comments]  ( 82 min )
    What do you have to say for yourselves now, flat-earthers?
    submitted by /u/Strawberrwies [link] [comments]  ( 82 min )
    Sam Harris on the Dangers of AI With Superhuman Intelligence - "It is a failure of imagination to think that being in relationship to something more intelligent than yourself isn't, in most cases, a circumstance of real peril." (short audio clip)
    submitted by /u/biohacker045 [link] [comments]  ( 86 min )
    Do you think Imagen is really as good as it looks like in the promo images?
    Just looking on them makes you believe everything is shopped, because ALL of those images are just was too detailed and can't possibly be that much on point. submitted by /u/ghostryder333 [link] [comments]  ( 82 min )
    HumanNeRF can render people in 3D from a regular video - using just a single camera perspective
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 82 min )
    Global Skills Report 2022
    submitted by /u/awsconsultant [link] [comments]  ( 82 min )
    Well, I would say that this AI is not at all accomplished!
    ​ a prediction of the Today's 6-figure EuroMillions draw submitted by /u/StantheBrain [link] [comments]  ( 82 min )
    This iteration of the weekly AI digest newsletter focuses on Dalle mini, a free, open-source AI that produces amazing images from text inputs. Here’s how it works and some commentary by our AI ethicist Lauren Keegan
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 82 min )
    How To Reduce Bias in Machine Learning
    Researchers and engineers have already applied several positive practices to reduce ML bias. This article covers each step in the machine learning project pipeline and discusses how to reduce machine learning bias at each stage. https://www.toolbox.com/tech/artificial-intelligence/guest-article/how-to-reduce-bias-in-machine-learning/ submitted by /u/lklimusheuskaja [link] [comments]  ( 82 min )
    Is there an AI that searches a list for similar sentences? Because I would want to make a list of all the jokes and copy them all off the internet and see which ones are duplicates.
    submitted by /u/xXNOdrugsForMEXx [link] [comments]  ( 83 min )
    Is there a AI to mash 2 people to get a similar picture like this?
    submitted by /u/xXLisa28Xx [link] [comments]  ( 82 min )
    THE VATICAN | FAST MODE! DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
    Best image generator AI for stylized characters, no realism but not solely anime as well
    I’m building out a comic project and would love to use a generator that can give me a stylized base to work off for characters. I don’t want like an anime only one, though those seem cool as well. What would be the ones that can match this? Am I asking for too much? I’m still a beginner, I liked what I’ve used but they seem mostly for environments so far. submitted by /u/ChrisMFerguson [link] [comments]  ( 82 min )
    16 Funny Insurance Memes That We Can All Relate To
    submitted by /u/flipsis [link] [comments]  ( 82 min )
    /g/ - Is this really the future of AI? - Technology (GPT-4chan generated greentext)
    submitted by /u/Aspie96 [link] [comments]  ( 83 min )
  • Open

    [Discussion] Tired of cleaning data?
    If so, we open-sourced a data cleaning tool (https://github.com/mage-ai/mage-ai) that will help you easily identify issues, quickly improve data quality, and repeat the process in any environment. Would love to get some feedback and hop on Zoom call if you have any questions/ help setting up. Feel free to join our slack: https://www.mage.ai/chat Thanks, appreciate it! submitted by /u/ollie_wollie_rocks [link] [comments]  ( 84 min )
    [D] Techniques for dealing with classic statistical data gathering problems: selection bias, differential attrition, experimenter bias, ect.. in Machine Learning?
    Can anyone suggest papers or techniques in ML to deal with some of the statistical bias problems outlined in the title? (selection bias, differential attrition, experimenter bias, ect..) submitted by /u/Upstairs-Jicama-8347 [link] [comments]  ( 83 min )
    [N] What do you think of Andrew Ng's new Machine Learning Specialization that launched last week on Coursera?
    Specialization Intro video: https://youtu.be/g7dv-Lnuor4 Specialization on Coursera: https://www.coursera.org/specializations/machine-learning-introduction submitted by /u/manocormen [link] [comments]  ( 84 min )
    [R] - Call For Participants SocialDisNER (SMM4H@COLING 2022) on Detection of Disease Mentions in Social Media
    CFP- SocialDisNER track: Detection of Disease Mentions in Social Media (SMM4H Shared Task at COLING2022) https://temu.bsc.es/socialdisner/ Despite the high impact & practical relevance of detecting diseases automatically from social media for a diversity of applications, few manually annotated corpora generated by healthcare practitioners to train/evaluate advanced entity recognition tools are currently available. Developing disease recognition tools for social media is critical for: Real-time disease outbreak surveillance/monitoring Characterization of patient-reported symptoms Post-market drug safety Epidemiology and population health, Public opinion mining & sentiment analysis of diseases Detection of hate speech/exclusion of sick people Prevalence of work-associated d…  ( 86 min )
    [N] [D] Openai, who runs DALLE-2 alleged threatened creator of DALLE-Mini
    Trying to cross-post what I think is a discussion that is relevant to this community. This is my third attempt, I hope I'm doing it correctly this time: https://www.reddit.com/r/dalle2/comments/vgtgdc/openai_who_runs_dalle2_alleged_threatened_creator/ EDIT: here are the original pre-prints for added context: DALL-E: Zero-Shot Text-to-Image Generation - The only place the term "DALL-E" appears is the URL to the github repo. Dall-E 2: Hierarchical Text-Conditional Image Generation with CLIP Latents - They consistently refer to the first paper as "DALL-E", but refer to the work being described in the new paper as "unCLIP" and are careful to only use 'DALL-E 2' in the context of a product description, e.g. "DALL·E 2 Preview platform (the first deployment of an unCLIP model)" submitted by /u/DigThatData [link] [comments]  ( 91 min )
    [D] Get input required of a neural network for a given output
    Hello Folks! I'm gathering information on how to obtain the scope of inputs (it can be more than one) required for a given output on a simple neural network. Let's suppose I'm using a vanilla 1 hidden layer fully connected network with non linear activation function/ I've come across a few options like, numerically solving the inverse equation (given its non linearity, not sure how one would solve analytically, but we can analytically end up with multiple equations from relu's..), using backpropagation with a defined cost on a small perturbation from the desired output. So, I wanted to know if you guys know of any literature on this or opinions or tricks or anything that might prove itself useful! Thanks in advance! submitted by /u/FlavorfulArtichoke [link] [comments]  ( 86 min )
    [R] DoWhy-GCM: An extension of DoWhy for causal inference in graphical causal models
    Abs: We introduce DoWhy-GCM, an extension of the DoWhy Python library, that leverages graphical causal models. Unlike existing causality libraries, which mainly focus on effect estimation questions, with DoWhy-GCM, users can ask a wide range of additional causal questions, such as identifying the root causes of outliers and distributional changes, causal structure learning, attributing causal influences, and diagnosis of causal structures. To this end, DoWhy-GCM users first model cause-effect relations between variables in a system under study through a graphical causal model, fit the causal mechanisms of variables next, and then ask the causal question. All these steps take only a few lines of code in DoWhy-GCM. Paper: https://arxiv.org/abs/2206.06821 Code: https://github.com/py-why/dowhy submitted by /u/bikeskata [link] [comments]  ( 84 min )
    [D] How to best extract product benefits/problems from customer reviews using NLP?
    I am working on a prototype that takes in a list of customer reviews about a specific product and returns a list of (unique) benefits and problems from these reviews. These should be non-generic, e.g. for a camera, a benefit might be "great for panoramic photos" and not just "good quality". My initial idea was to go about this in two steps: Use NER to identify phrases describing benefits or problems Use text summarization to create the final output When starting to create some NER labels, I realized that benefits and problems are often mixed, spread across multiple sentences, or mentioned cryptically or indirectly, making it extremely hard to come up with concise labeling instructions. Therefore, I assume, that also the model will have quite a hard time correctly extracting benefits and problems. Does anyone have an idea of how to tackle this in a different, more promising way? Any kind of feedback is more than welcome 🙏 submitted by /u/AdPlenty6685 [link] [comments]  ( 86 min )
    [D] Running experiments, tuning, analysing results, how do you organise your time on this?
    Hi people, I would like to ask you how do you organise yourself for running experiments, tuning your models, and analysing your results. Do you run a massive grid search and then analyse everything at the end? Do you run one/a few experiments and see how it went, and repeat the process? Have you learned some insights in how to do this efficiently? I often find myself running several searches over one or a couple of parameters at the time, based on the premise that some regions of a big grid search may be completely useless and a waste of time. The downside of this is that for every search I need to analyse its results and based on them, try to pick a good set of hyperparams for the next one; when with a massive grid search over all of the possible hyperparams, I would just pick the best model once is it is done. I would like to hear what you do! submitted by /u/juanigp [link] [comments]  ( 85 min )
    [D] NVlabs finally released the code for EG3D, but no inversion script?
    Hi So we can finally play around with the cool NVLabs EG3D, but they refuse to release the inversion script. Does anyone have success to pass a image and reconstruct a face in this project? I am not having success when trying to do this, so I would greatly appreciate if anyone could share how to do it or if you know of an existing fork? submitted by /u/mobani [link] [comments]  ( 84 min )
    [D] Machine learning books for free offered with full source document (LaTeX)
    Top quality machine learning papers and books, not only for free, but offered with full LaTeX source, bib file, and raw figures. So that anyone can easy incorporate part of these books (formulas, tables, pictures, text. references etc.) into their PhD thesis, articles, or reports. The user could even fix any typo he finds then print an enhanced version of the book, for private (or public) use. That sounds like a dream? I am actually thinking offering this, with my numerous papers / books. My question is this: is it a good idea? Should I charge a fee (in other words: would you pay for it?) I understand some will use the material for plagiarism, but I am not too concerned about it, or should I? My first candidate book for this is the following: https://mltechniques.com/2022/03/22/book-stochastic-processes-and-simulations/. I just finished converting all the Perl code into Python, and will soon publish the 2nd edition, this time in Python [if it comes with LaTeX code, it means that the user can easily extract the Python code from the book, though it is also on GitHub]. submitted by /u/MLRecipes [link] [comments]  ( 88 min )
  • Open

    Accelerate your career with ML skills through the AWS Machine Learning Engineer Scholarship
    Amazon Web Services and Udacity are partnering to offer free services to educate developers of all skill levels on machine learning (ML) concepts with the AWS Machine Learning Engineer Scholarship program. The program offers free enrollment to the AWS Machine Learning Foundations course and 325 scholarships awarded to the AWS Machine Learning Engineer Nanodegree, a […]  ( 5 min )
    Identify mangrove forests using satellite image features using Amazon SageMaker Studio and Amazon SageMaker Autopilot – Part 2
    Mangrove forests are an import part of a healthy ecosystem, and human activities are one of the major reasons for their gradual disappearance from coastlines around the world. Using a machine learning (ML) model to identify mangrove regions from a satellite image gives researchers an effective way to monitor the size of the forests over […]  ( 10 min )
    Identify mangrove forests using satellite image features using Amazon SageMaker Studio and Amazon SageMaker Autopilot – Part 1
    The increasing ubiquity of satellite data over the last two decades is helping scientists observe and monitor the health of our constantly changing planet. By tracking specific regions of the Earth’s surface, scientists can observe how regions like forests, water bodies, or glaciers change over time. One such region of interest for geologists is mangrove […]  ( 14 min )
  • Open

    How To Build Multi-Layer Perceptron Neural Network Models with Keras
    The Keras Python library for deep learning focuses on the creation of models as a sequence of layers. In this post you will discover the simple components that you can use to create neural networks and simple deep learning models using Keras from TensorFlow. Let’s get started. May 2016: First version Update Mar/2017: Updated example […] The post How To Build Multi-Layer Perceptron Neural Network Models with Keras appeared first on Machine Learning Mastery.  ( 18 min )
  • Open

    CodaLab - Competition
    [ML Competition annoucement] Improve aerial navigation by determining the camera pose of aerial images (in 6D : x,y,z coordinates and aerial camera angle). Prize : 10'000 CHF Dataset : Over 16 000 HD aerial images are available for training. Timeline : Starts 21 June 2022, ends 21 December 2022 Link : https://codalab.lisn.upsaclay.fr/competitions/5481 Happy coding ! submitted by /u/Kindly_Toe_440 [link] [comments]  ( 82 min )
    Looking for Papers/Conferences on solving moral problems (as opposed to social/ethical problems)
    RL seems to be a go-to for solving these sort of "philosophical" problems - I've personally seen it applied as Sequential Social Dilemmas (SSDs) and Makov Game SDs (MGSDs). I am intrigued to know if the same level of concern is placed on moral problems. Considering I can't find nearly as much research on this subject my initial feeling is 'no', though with the same consideration I can't say this for sure. Are there any good conferences (similar to AIES/ACM conferences), papers or even non-profit research hubs) which could be a good starting point for diving into this sort of research? (N.B. This is a personal interest so if there are less formal articles/sites like github repo's and what-not feel free to mention it too!) submitted by /u/Background-Cable-491 [link] [comments]  ( 83 min )
    Convergence of Loss and MAE in Deep Q Network
    Hello everyone! I have been learning about RL and DQNs and wanted to apply these for a simple custom environment. I've been able to achieve decent results but I have noticed the following and was hoping someone could help me understand this better: The Loss and MAE values for grow indefinitely without converging even when the agent has reached optimal value while training. Is there an issue with the agent or the environment? I checked to find resources related to this specifically but could not find anything. Is convergence for loss and MAE not necessary for a DQN to function? I have noticed that the agent diverges from the optimal value when I increase the number of steps to larger values. Any particular reason for this to happen? Thanks in advance! submitted by /u/AakashK12 [link] [comments]  ( 84 min )
    Resources for reinforcement learning?
    What would be the minimum hardware expectation to train a 3D model to learn parkour using reinforcement learning? Any free hardware resources for a research project? University doesn’t provide hardware resources and I have a GTX 1650ti mobile GPU. Edit: The environment would be a static simulated physics environment of around 2 to 3 blocks. The agent would be a 3D walker with hand and feet movement. submitted by /u/Live-Pass-7157 [link] [comments]  ( 85 min )
  • Open

    Swin Transformer supports 3-billion-parameter vision models that can train with higher-resolution images for greater task applicability
    Early last year, our research team from the Visual Computing Group introduced Swin Transformer, a Transformer-based general-purpose computer vision architecture that for the first time beat convolutional neural networks on the important vision benchmark of COCO object detection and did so by a large margin. Convolutional neural networks (CNNs) have long been the architecture of […] The post Swin Transformer supports 3-billion-parameter vision models that can train with higher-resolution images for greater task applicability appeared first on Microsoft Research.  ( 14 min )
  • Open

    Towards Ethical AI
    Implications of Becoming One with the Machine Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 11 min )
    Google JAX vs PyTorch vs TensorFlow: Which is the best framework for machine learning?
    Google JAX is a powerful framework for machine learning that offers many benefits over other popular frameworks such as PyTorch and…  ( 10 min )
  • Open

    Can someone explain, in simple terms, what does it mean the term "age" in a Neural Gas Network?
    I am studying the topic of Neural Gas and at the beggining the process starts with two neurons connected by an edge displayed in a n-dimentional crossplot, with each axis of the crossplot representing an attribute/feature (for example house pricing, square feet,etc). That edge is assigned an "age" of 0 that changes with time(when the process is adding new neurons to the edge according certain parameters),but I don't quite understand the concept of "age of an edge",except that if it reaches a certain value, the edge is cut to form different clusters with data. submitted by /u/marveloustom [link] [comments]  ( 82 min )
    Deep Learning on Edge Devices (Jetson Nano and TX2). Help!
    Hi everyone, This is my first time seeking deep learning help on forums but Im desperate so plz help out! I decided to create a face recognition system and deploy it on two edge devices. For this purpose, I used the FaceBoxes model for face detection and FaceNet model for creating 128-D embeddings on the detected faces. For classification, I used the MLP classifier which I trained on Google Colab. I took the trained Colab MLP model and deployed on Jetson Nano and Jetson TX2. All the major packages (Python, OpenCV, Tensorflow, Numpy etc) used the same versions on both devices. Even the Jetpack on both devices was same (4.4.1). The recognition results on each device, individually, were constant. Like if I ran face recognition on a video on jetson nano, it would always give the same accuracy : 98%. Same for Jetson TX2, constant accuracy result: 99%. BUT I have to justify in my course, why do the two devices show different accuracy results on the SAME TEST VIDEO, using the SAME MODEL, trained on COLAB. Unfortunately, I am not a hardware expert. I thought maybe it could be a difference of quantization or FP16/FP32 or something. But I dont even know what these terms mean. So some help in justifying why the accuracies are different on the two platforms, would be HIGHLY APPRECIATED. Please guide me. Thanks! BTW, I used the sci-kit library for my implementation of the MLP classifier. And I used Tensorflow 2.3.1 for running the models. submitted by /u/Tired__Engineer [link] [comments]  ( 84 min )
    GoogleNet from scratch
    I have been trying to use the pre-trained model in PyTorch to do some classification but only in 10 classes. However, I don't know how to change the last layer and train the model again. Now I am considering on creating the model from scratch. However, in the paper, it seems like they have 3 softmax to ensure that after some layers, some classification is done. These are only used for training. Can I get away with not adding the 3 softmax and only keeping one for training or it won't be as good? submitted by /u/Capable-Effective-93 [link] [comments]  ( 83 min )
    I wanna ask your opinion if i have enough data gathered or if i should gather more.
    Hello, i wanna create neural network that will read DMG dealt fields and output them from picture like this. So far i have 1677 of them (they are mostly 3 field but some have 2 or 1). Do you think its enough to label or should i gather more? And one more question is if its good idea to try to train it on these pictures or should i split pictures so they are individual field of dmg dealt? submitted by /u/buxA_ [link] [comments]  ( 83 min )
  • Open

    Researchers release open-source photorealistic simulator for autonomous driving
    MIT scientists unveil the first open-source simulation engine capable of constructing realistic environments for deployable training and testing of autonomous vehicles.  ( 7 min )
  • Open

    Google at CVPR 2022
    Posted by Shaina Mehta and Kristen Borg, Program Managers This week marks the beginning of the premier annual Computer Vision and Pattern Recognition conference (CVPR 2022), held both in-person in New Orleans, LA and virtually. As a leader in computer vision research and a Platinum Sponsor, Google will have a strong presence across CVPR 2022 with over 80 papers being presented at the main conference and active involvement in a number of conference workshops and tutorials. If you are attending CVPR this year, please stop by our booth and chat with our researchers who are actively exploring the latest machine learning techniques for application to various areas of machine perception. Our researchers will also be available to talk about and demo several recent efforts, including on-device M…  ( 34 min )
  • Open

    AI in the Big Easy: NVIDIA Research Lets Content Creators Improvise With 3D Objects
    Jazz is all about improvisation — and NVIDIA is paying tribute to the genre with AI research that could one day enable graphics creators to improvise with 3D objects created in the time it takes to hold a jam session. The method, NVIDIA 3D MoMa, could empower architects, designers, concept artists and game developers to Read article > The post AI in the Big Easy: NVIDIA Research Lets Content Creators Improvise With 3D Objects appeared first on NVIDIA Blog.  ( 6 min )
    NVIDIA Joins Forum to Help Lay the Foundation of the Metaverse
    The metaverse is the next big step in the evolution of the internet — the 3D web — which presents a major opportunity for every industry from entertainment to automotive to manufacturing, robotics and beyond. That’s why NVIDIA is joining our partners in the Metaverse Standards Forum, an open venue for all interested parties to Read article > The post NVIDIA Joins Forum to Help Lay the Foundation of the Metaverse appeared first on NVIDIA Blog.  ( 6 min )
    3D Artist Jae Solina Goes Cyberpunk This Week ‘In the NVIDIA Studio’
    3D artist Jae Solina, who goes by the stage name JSFILMZ, steps In the NVIDIA Studio this week to share his unique 3D creative workflow in the making of Cyberpunk Short Film — a story shrouded in mystery with a tense exchange between two secretive contacts. The post 3D Artist Jae Solina Goes Cyberpunk This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.  ( 6 min )
    NVIDIA Accelerates Open Data Center Innovation
    NVIDIA today became a founding member of the Linux Foundation’s Open Programmable Infrastructure (OPI) project, while making its NVIDIA DOCA networking software APIs widely available to foster innovation in the data center. Businesses are embracing open data centers, which require applications and services that are easily integrated with other solutions for simplified, lower-cost and sustainable Read article > The post NVIDIA Accelerates Open Data Center Innovation appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    AI and Blockchain Cloud Services Orchestrate Digital Business Transformation
    The growing ubiquity of IoT and AI has left no industry untouched. Businesses have unlocked their transformational value in meeting the modern needs of consumers, with cloud computing posing as the key enabler and accelerator. Evidently, we are witnessing the action in a panoply of applications. Most evident are in supply chain innovation, healthcare IT,… Read More »AI and Blockchain Cloud Services Orchestrate Digital Business Transformation The post AI and Blockchain Cloud Services Orchestrate Digital Business Transformation appeared first on Data Science Central.  ( 19 min )

  • Open

    "Edge of the universe" 🌌 created on pixelz.ai
    submitted by /u/PixelzJ [link] [comments]  ( 82 min )
    Last Week in AI: Controversy over Google's "sentient" chatbot, DALL-E Mini goes viral, Reddit bans deepfakes sub, AI to improve video calls, and more!
    submitted by /u/regalalgorithm [link] [comments]  ( 82 min )
    In this article, we showcase how to automate your data labeling using transformer models.
    submitted by /u/UBIAI [link] [comments]  ( 82 min )
    VQGAN+CLIP Resource for Text Prompts.
    I've been doing some art lately turning my abstract ink drawings into AI art using VQGAN+CLIP. Does anyone know a resource on how to structure the prompts like targeting a style vs a rendering type or using a specific artist style? Thanks. https://preview.redd.it/bmxcoz0y0u691.jpg?width=3000&format=pjpg&auto=webp&s=1480917b0c307d445b9ad14883e34ca886d8de35 submitted by /u/toaster_artist [link] [comments]  ( 82 min )
    AI Dream 57 - Incredible Cosmic Dream - vqgan clip
    submitted by /u/LordPewPew777 [link] [comments]  ( 82 min )
    AI: Respectfully, I can take Batman.
    submitted by /u/Ania_IntelligentAF [link] [comments]  ( 82 min )
    Is there a AI which I can use to create rap songs?
    It would be amazing, because I love to rape and I am interested if a AI could help me write some songs. submitted by /u/xXNOdrugsForMEXx [link] [comments]  ( 82 min )
    Salesforce AI Open-Sources ‘OmniXAI’: A Python-based Machine Learning Library That Provides One-Stop Explainable AI (XAI) Solution To analyze, Debug, And Interprets AI Models
    Salesforce has built an open-source machine learning framework called OmniXAI, which stands for Omni eXplainable AI. This library takes an “omni-directional” approach to XAI, with extensive interpretable ML features that address many problems with explaining ML model decisions in reality. OmniXAI is a one-stop comprehensive library that makes explainable AI accessible to academics requiring explanations for each stage of the machine learning process. This is not limited to data exploration, feature engineering, model development, evaluation, decision making, etc. 🚦 A one-stop solution for analyzing different stages in a standard ML pipeline in real-world applications. 🚦 Two types of explanations — local and global 🚦 Includes most popular explanation methods, such as feature-attribution/importance explanation (LIME [1], SHAP [2], Integrated Gradients (IG) [3], Grad-CAM [4], L2X), counterfactual explanation (MACE [5]), partial dependence plots (PDP), and model-specific methods (linear and tree models) 🚦 Can be applied on tabular, vision, NLP, and time-series tasks. Continue reading | Checkout the paper, article, github, dashboard submitted by /u/No_Coffee_4638 [link] [comments]  ( 83 min )
    Chills… simply beautiful xpost r/singularity
    submitted by /u/ViperOrel23 [link] [comments]  ( 82 min )
    Budgeted reinforcement learning problem
    Consider a budgeted sequential decision problem where we want to maximize the cumulative reward R over a finite horizon H by deciding how much of a budget B we allocate to channel x and channel y per timestep t. We can think of R as sales. The horizon is set to 30 days. The cumulative spent budget must not exceed the set budget B. At each timestep, we decide how much budget we want to allocate to each of the channels, and at each timestep we see the amount of sales the allocation generated. We cannot see how much sales one sole channel generated but only the total sales both of the channels generated. We can also retrieve some contextual variables that could be thought of as an state/observation for each channel, lets call them exogenous variables = {exog1, exog2, exog3 .... exog 10…  ( 87 min )
    Need help upscaling an image 5x using Gigapixel
    Hi, I'm looking to 5x the image to print a playmat for my board game, but the original resolution it's not high enough for it's size. Tried a bunch of online tools but none seems good enough. Any help is highly appreciated submitted by /u/Rodcy [link] [comments]  ( 82 min )
    Using machine learning in the travel industry - CHALLANGE
    Hello everyone! I am from tryp.com, a travel-tech startup that is using AI to create complex travel itineraries on the go, from minimal user constrains. Trips created in <15s for defined time search range and start location Currently we are embarcing a new challenge, to improve our offering: Creating an AI, trained from screen recordings of purchases in 100s of websites, that can purchase travel tickets from any website, in any language. Has anyone worked on a similar challenge? We are looking to form a team to tackle such challange! submitted by /u/arangel96 [link] [comments]  ( 82 min )
    Quasi - A platform where people use AI to create with zero code
    submitted by /u/roblox22y [link] [comments]  ( 82 min )
    FLYING THRU SPACE AT 432HZ | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
    “Sentience” is the wrong discussion to have on AI right now
    submitted by /u/bendee983 [link] [comments]  ( 84 min )
    I need help with my major project
    I have a data set of 500 graphs. I want to compare the input graph with the 500 graphs I already have. Is there a way to do it? I really need this to be done. If the graph thing isn't possible, is there a way to compare the co-ordinates or parameters used to construct the graph? submitted by /u/sexyhoooman_hmu [link] [comments]  ( 82 min )
    I think language models can't be sentient, but the creatures they write about can be.
    Ok guys, I write this opinion often in comments, but I think it deserves a separate post. I think most of us agree, that language models can't be sentient, because all in all LMs are just mathematical concepts, that describe probability of some combination of letters to occur in text. But, I believe, that the characters described in generated texts can fulfill any definition of what is "sentient", if language model is good enough. Look: Can they react to events that happen in their universe? Yes. Can they make plans in their universe? Yes. Can they express feelings in their imaginary universe? Yes Will they avoid pain in their imaginary universe? Most of them will. Will they seek pleasure in their imaginary universe? Most of them will. Whatever criterion you can come up with, a good enough LM can write a text with a character, that satisfies that criterion. And those characters are sentient, but just not in our universe, but in their own universes, that get born in imagination of the combined human+computer system, when we read the generated texts. If we view those characters from this perspective, then we can also solve the question, "which moral rules should we apply to the artificially sentient beings": since those beings exist in imagination of some sort of a system, then we should apply same moral standards as we apply to any other imaginary creatures. submitted by /u/Arqwer [link] [comments]  ( 85 min )
    How good does an upscaling AI really work?
    I want to upscale some images from the internet and want to make some posters out of them. What is the best way to do it? submitted by /u/xXLisa28Xx [link] [comments]  ( 82 min )
    Joe Biden falling off a bicycle . (A.I generation)
    submitted by /u/OneFinding1429 [link] [comments]  ( 82 min )
    Memory requirements for tabular Q-learning vs deep neural network?
    I want to compare the space complexity/memory requirement of tabular Q-learning v.s. deep neural Q-network (DQN). I think DQN would be faster and Q-table has a disadvantage at large table sizes but consider the following case. A Q-table has the size 14 states *169 actions= 2366 entries and (say) a fully connected DNN whose number of parameters comes out to be like >8000. Space complexity/memory-wise, isn't storing a look-up q-table of 2366 size better than storing 8000 parameters of neural net? I never implemented a DNN before so no idea how much space neural net parameters take. Please give your opinions on this scenario. Moreover, do you think thus a 2366-sized Q-table is large as per Q-learning norms which people use? I couldn't find any rule of thumb... submitted by /u/Simple-Soil-230 [link] [comments]  ( 83 min )
    FLYING THRU SPACE AT 432HZ | FAST MODE | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
    A news script generated by InferKit/Talk to Transformer (the words in Bold is what I typed down)
    Breaking News, a man attempted to steal thousands of Nintendo games from all stores to sell in Eastern Europe. He left many owners out of pocket. The man was spotted by a member of staff in one of the shops. He then jumped on the counter and grabbed the game bundles. This was spotted by one of the customers and the member of staff proceeded to chase him down and hold him until the police arrived. In a statement Nintendo said the man had just finished his shift at a nearby shop and decided to take the games as he had them with him anyway. Nintendo called the incident ‘unusual’ submitted by /u/Wat3rb0t [link] [comments]  ( 83 min )
    Is it fair to describe a human as a system running 2 mandatory functions
    Is it fair to describe a human as a system that's based on the same instructions as all systems above us, including the universe itself, which involve 2 functions always set to ON; those functions being Self-Correct (survive, adapt) Self-Duplicate (procreate) in that order, too (obviously), because that allows for the parent to have been faced with novel challenges and threats to overcome, updating their DNA, then procreating and releasing the new "patch", along with another person's updated DNA. So - constant progression. If I described any form of "life", including a human, as this, would you say I am incorrect? submitted by /u/PrimalJohnStone [link] [comments]  ( 87 min )
  • Open

    How to determine the receptive fields of various layers in CNN?
    submitted by /u/__hy23__ [link] [comments]  ( 82 min )
    Having trouble implementing the derivative of softMax function.
    I'm pretty bad at math but I'm trying to make my own Neural Network and on the output layer I use the softMax function. The problem is that I've looked at most, guides, StackOverflow posts, and GitHub repositories, and I just cannot figure out how to implement it in code. All m weigh+biases+nodes+acivated nodes are stored in matrices. (I'm not looking for math explanations, just an implementation on how to get the deltaOutputWeights and Biases) submitted by /u/uvuvwevwevwe_osas2 [link] [comments]  ( 82 min )
    Salesforce AI Open-Sources ‘OmniXAI’: A Python-based Machine Learning Library That Provides One-Stop Explainable AI (XAI) Solution To analyze, Debug, And Interprets AI Models
    Salesforce has built an open-source machine learning framework called OmniXAI, which stands for Omni eXplainable AI. This library takes an “omni-directional” approach to XAI, with extensive interpretable ML features that address many problems with explaining ML model decisions in reality. OmniXAI is a one-stop comprehensive library that makes explainable AI accessible to academics requiring explanations for each stage of the machine learning process. This is not limited to data exploration, feature engineering, model development, evaluation, decision making, etc. 🚦 A one-stop solution for analyzing different stages in a standard ML pipeline in real-world applications. 🚦 Two types of explanations — local and global 🚦 Includes most popular explanation methods, such as feature-attribution/importance explanation (LIME [1], SHAP [2], Integrated Gradients (IG) [3], Grad-CAM [4], L2X), counterfactual explanation (MACE [5]), partial dependence plots (PDP), and model-specific methods (linear and tree models) 🚦 Can be applied on tabular, vision, NLP, and time-series tasks. Continue reading | Checkout the paper, article, github, dashboard submitted by /u/No_Coffee_4638 [link] [comments]  ( 83 min )
    Could we create a computer that works like the human brain? 🤔
    submitted by /u/tnkrbel2954o8 [link] [comments]  ( 82 min )
  • Open

    [D] Two flaws in discussions surrounding the recent LaMDA controversy: it's not stateless, and it is dual process; but whether it's sentient is far less important than how it would edit Wikipedia
    I'm sure everyone here has heard about the LaMDA sentience controversy by now, so in addition to linking to its arxiv full text ("LaMDA: Language Models for Dialog Applications" by Thoppilan, et al., 2022), I'd also like to correct a few points that I see most people getting wrong. First, unlike plain GPT-3, Davinci, and the like, LaMDA is not stateless. Its sensibleness metric (including whether responses contradict anything said earlier) is fine-tuned by pre-conditioning each turn with many of the most recent interactions, on a user-by-user basis. Its grounding mechanism has the potential to add a great deal more state, if the interactions become part of a database it can query to formulate responses, but as far as I know they haven't done that yet. Secondly, that grounding mechanism m…  ( 91 min )
    [D] Attending ICML 2022 Fully Virtual Attendance
    Hi everyone. I wanted to start a discussion to see whether other accepted authors were planning on attending ICML 2022 fully virtually? From my understanding, we pre-record our talk and they are considering a virtual poster session. Are there any in-person obligations we have as authors? For context, my entire PhD has overlapped with COVID, so everything has been virtual. I would rather not travel and have some other personal plans that overlap with the duration of the conference. I would be interested in other people's views and whether I may be missing a lot by not attending in-person. edit: lol sorry for the incoherent title submitted by /u/generic_r [link] [comments]  ( 84 min )
    [D] Any relatively new text2image models with fine tuning?
    I have relatively small dataset of 256x256 images with text captions, and it's definetely not the best solution to train something from scratch with that, so I wonder what ways do I have to fine tune something on my dataset. I tried to use something from DALL-E mini repo, but it does not provide exact code for fine tuning and enough documentation for me and I failed to write my own. Similar story with the Latent diffusion repo, I couldn't use their training code to fine tune existing model, and it seems the didn't even provided enough code for training text2image model as their config is not working. The only things I could find was ruDALL-E, ruDOLPH models, but they are relatively old and most importanly they're worning with Russian and not English text, which is not what I need. I found some methods for fine-tuning CLIP model, it seems pretty easy, but I don't know what to do next with it, as something like VQGAN+Clip works pretty bad in comparison with this year SOTA solutions. So, if anybody know, please, any guides, repos, colabs etc for finetuning text2image models are welcome submitted by /u/Chelokot [link] [comments]  ( 84 min )
    [D] In your experience, what's the thing that can boost an ML model's performance the most? Is it the hyperparameter tuning, feature engineering or ensembling? Or is it something else?
    I'm interested to know which part of ML do engineers invest their time in that actually pays off a lot when it comes to getting well-performing models. Just so I know whether it is right to spend more time trying out different X (say, Feature Eng) configurations in favour of Y (say, Ensembling) configurations. submitted by /u/4bedoe [link] [comments]  ( 92 min )
    [P] Using machine learning in the travel industry - CHALLENGE
    Hello everyone! I am from tryp.com, a travel-tech startup that is using AI to create complex travel itineraries on the go, from minimal user constrains. ​ Trips created in <15s for defined time search range and start location Currently we are embarcing a new challenge, to improve our offering: Creating an AI, trained from screen recordings of purchases in 100s of websites, that can purchase travel tickets from any website, in any language. Has anyone worked on a similar challange? We are looking to form a team to tackle such challenge! submitted by /u/arangel96 [link] [comments]  ( 85 min )
    [P] Colab Themes: A Chrome Extension to Customize the Style of Google Colab
    Changes the page CSS and text editor and generates Python code to change Matplotlib styles to match the theme the user choses. Users may import themes or use any of the 50+ provided. Colab Themes enhances the data science experience by transforming the way users view their code and their data! Check it out on Github or install it via the Chrome Webstore submitted by /u/d8aDev [link] [comments]  ( 85 min )
    [R] PowerShap: A power-full Shapley feature selection method.
    This method uses statistical hypothesis testing and power calculations on Shapley values, enabling fast and intuitive wrapper-based feature selection. The complete library and methods are fully compatible with Sklearn, LightGBM, CatBoost, and more are coming in further following releases and the library can be found here: https://github.com/predict-idlab/powershap! The library is open-source and usable out-of-the-box as shown in the video! The paper is already released on arXiv: https://arxiv.org/abs/2206.08394. Furthermore, the work will be presented at ECML PKDD 2022. How does it work? The complete method is built on the assumption that a random feature, that contains no information, should have a lower impact on the predictions compared to an informative feature. To test this, PowerS…  ( 86 min )
    [D] Best program (text editor) to use for creating a neural network (GAN) in python?
    I am a master's student writing my dissertation about using GANs to generate classical music. I am studying operations research (applied math) so all my coding experience is with R, except for one Python class I took in 2017 where we used Thonny as an interface. I am comfortable with the mathematical theory behind neural networks and deep learning, and can create them comfortably in R, but my supervisor (as well as an earlier post in this sub) recommends using Python for GANs. I am very familiar with R (and always use Rstudio) but am essentially a rookie when it comes to Python. Thus I am curious about what text editor you think would be best suited for this task (my friends have mentioned Atom but wanted to check here too). I will only be using this editor for creating the generative adversarial network, so if it's intuitive and easy to use that's ideal. I assume that the easiest way to run the code is just through terminal, unless you have any suggestions about that as well? Also, if you generally have any tips for creating NNs in python that simplify the process or pro-tips, that would be much appreciated too! Thank you:) submitted by /u/carl535 [link] [comments]  ( 86 min )
    [D] Reducing bias when forecasting retail sales with boosting model
    I'm forecasting future sales for products in retail stores, using a LightGBM model. My model has a decent forecast accuracy, but the forecasts are biased (the average forecast error is negative, the model is consistently under-forecasting). Do you have any idea or tips on how to avoid bias when forecasting time series with boosting models? Here are some more details: I'm making forecasts at the Day x Product x Store granularity (i.e 1 forecast every day for each product in each store). The forecasting horizon is +7 days. I'm training a single model to forecast all products, stores and time horizons. The main features are lags of sales, calendar info (day of the week, month...), product info (category, price) and store info. Evaluation is made with a time-based cross-validation. Thank you for your help! submitted by /u/ML-ATF [link] [comments]  ( 84 min )
    [D] Whats the current state of the art in image style transfer?
    Diffusion models like Dall E are producing incredible images. What's the current state of the art for taking one image and combining it with the style from another? Could anyone point me to a handful of references please? submitted by /u/Razcle [link] [comments]  ( 85 min )
    [D] When to post on Arxiv?
    I ask the question with respect to culture rather than practice (i.e. I could obviously post just about anything!) but as I'm new to research in the field I am curious to know if it is used to post working papers or whether it is more typical to prepublish work that has already been sent to a conference/journal? If an Arxiv paper gets traction/interest can it then be sent to a conference or journal later on without self plagiarising? submitted by /u/Swimming-Pool397 [link] [comments]  ( 89 min )
    [D] Any research specific PyTorch based boilerplate code?
    Any research specific PyTorch based boilerplate code? I am a PhD student working in Deep Learning based NLP methods. I am trying to develop a boilerplate code of my own. Looking for inspirations or ideas? submitted by /u/Relative_Tip_3647 [link] [comments]  ( 85 min )
    [D] Laptops with NVIDIA Mobile GPUs are better option than Apple Silicon for ML/DL Tasks
    It is really disappointing to find out that Apple Silicon based machine does not keep up to even the mobile Nvidia GPUs present in the laptops. They marketed the machine like it is the best with its unique unified memory architecture, astonishing memory bandwidth, powerful GPU cores, etc. They released M1 Pro, M1 Max and even M1 Ultra. All of these are just overpriced chips offering no significant value for money. One can easily get any laptop with NVIDIA 3080 mobile GPU, and it would be 1) cheaper 2) will have much better performance than even the M1 Ultra. Sure, the battery life and the ecosystem of Apple is good. However, if it is gonna take 30 mins per epoch on M1 Pro/Max, whereas it will just take 5 mins per epoch on these Nvidia Mobile GPUs, I think its a no brainer to just go with Nvidia based laptops for ML/DL workflows. Would love to hear opinion of others on this. If anyone has some more benchmarks, do share it here. You could make use of the unified memory, increase the batch size and then try to compare how much of a performance improvement it makes. But still I think it might not be able to compete with Nvidia 3080 Mobile. ​ EDIT: I'm just saying that If you ever have to train something on your laptop and in local environment just for testing purposes before you actually use cloud resources to train the final model, the process would be slower when using Apple silicon when compared to Nvidia Mobile GPUs. Like cloud based resources would charge you per hour, so better to test out and then do just the training part in cloud right. My complaint was that Apple could definitely up their game and they still have a long way to go. They have been comparing their chip with dedicated GPUs like NVIDIA in their presentations and keynotes. They keep showing that its better than these dedicated GPUs. However in reality it depends on the task, and it definitely is not better in ML/DL tasks. submitted by /u/Rohit901 [link] [comments]  ( 94 min )
    [D] Higher order arity in image-based object detection models? Transfer learning: objects → attributes → relations
    Convolutional neural networks have a well-known track record when it comes to detecting objects in images. A person, a cat, a helicopter; given enough examples pretty much any discrete visible entity is learnable. But from the perspective of human language, this kind of model only produces nouns. Or in terms of arity (aka adicity/degree/valency/rank), one might say these are all nullary functions/clauses. In other words, they're concepts that can be expressed without any contextual variables/arguments. One step up on the arity scale are of course unary functions. Simply put: attributes. "Large", "narrow", "heavy", "soft", "green" etc are concepts that only make sense in combination with a context argument defining the object described/modified by the attribute. Binary (and any larger arity) functions are what we usually think of as relations. "larger than", "attached to", "on top of", "behind", "next to" etc are concept that need (at least) two context arguments. Anyway, back to machine learning. It seems to me that concepts with higher order arity too should be learnable from image examples just fine, provided that context-defining features are included in the input data along with the raw visual data. For example, spacial relations such as "behind"/"in front of" and "below"/"above" should be inferrable when 2 bounding boxes (or polygons, etc) are included in the input samples. I imagine this pattern to be quite amenable to transfer learning, given that those bounding boxes themselves could be the outputs of a conventional object detection model. Are there popular models out there that can make such relational predictions? Also, is there an established convention on how to encode context-defining features? What words should I Google to read up on relevant literature? (Sorry about the noob(-ish?!) content, but I didn't get any response over at /r/MLQuestions.) submitted by /u/WouldNotLickYourAnus [link] [comments]  ( 86 min )
    [R] Evolution through Large Models
    submitted by /u/hardmaru [link] [comments]  ( 84 min )
  • Open

    Reinventing or Reusing? Home-made vs Third-party Solutions
    Say you need to implement some machine learning system. Should you purchase a product, re-use open-source code, or develop your own algorithms? The decision does not need to be a binary one. I discuss the pluses and minuses of both options. Combining them offers the best of both worlds. I explain with examples how to… Read More »Reinventing or Reusing? Home-made vs Third-party Solutions The post Reinventing or Reusing? Home-made vs Third-party Solutions appeared first on Data Science Central.  ( 21 min )
    What type of Data Does a Sankey Diagram Generally Use?
    Operating in an environment that deals with complex data types may seem extremely stressful, especially if you are not backed up. Data visualization breaks down complex data values into simple and flexible elements that you can easily deal with without being worried. However, you need to have a good data visualization tool that can make… Read More »What type of Data Does a Sankey Diagram Generally Use? The post What type of Data Does a Sankey Diagram Generally Use? appeared first on Data Science Central.  ( 21 min )
  • Open

    Trying to create an observation space, but nothing I do seems to work
    So just to preface, my reset() needs to return 7 integers, 3 of them are either 1 or 0, and the other 4 can be any number from 0-6. Initially, I tried to use the spaces.Dict method of creating the spaces: In the init() space = { "left_line": spaces.Box(low=np.array([0]), high=np.array([1]), dtype=np.int32), "mid_line": spaces.Box(low=np.array([0]), high=np.array([1]), dtype=np.int32), "right_line": spaces.Box(low=np.array([0]), high=np.array([1]), dtype=np.int32), "left_prox": spaces.Box(low=np.array([0]), high=np.array([6]), dtype=np.int32), "front_left_prox": spaces.Box(low=np.array([0]), high=np.array([6]), dtype=np.int32), "front_right_prox": spaces.Box(low=np.array([0]), high=np.array([6]), dtype=np.int32), "right_prox": spaces.Box(low=np.array([0]), high=np.array([6]), dtype=np.…  ( 84 min )
    Double Q-learning in SB3's SAC implementation?
    Hello, According to this change, SAC and TD3 in the SB3 implementation can take an arbitrary number of critics. Indeed, if we check the source code for e.g. SAC's train function, we find: next_q_values = th.cat(self.critic_target(replay_data.next_observations, next_actions), dim=1) next_q_values, _ = th.min(next_q_values, dim=1, keepdim=True) # ... q_values_pi = th.cat(self.critic(replay_data.observations, actions_pi), dim=1) min_qf_pi, _ = th.min(q_values_pi, dim=1, keepdim=True) There, the minimum value of the n=2 critic networks is taken across the batch to calculate both the actor and critic loss. I looked everywhere, but I found no particular documentation of why this is being done. I assume this is simply the double Q-learning trick being applied. Can someone confirm or refute this? Further, is it best practice to simply slap double Q-learning into any value-based RL method? Does anyone have experience with more than `n_critics=2` aka does n-fold Q-learning stabilize training significantly beyond just double Q-learning? Just some thoughts that I had nobody else to share with... submitted by /u/IAmMiddy [link] [comments]  ( 83 min )
    [QUESTION] Number of possible joint policies in a Dec-POMDP and the time required to evaluate each one.
    Hi everyone, I was reading a book about Dec-POMDPs and came across this curious result where the author specifies the number of possible joint policies to evaluate and the time needed to evaluate a single joint policy but I can understand how he got to these results. Can anyone please explain the logic used here? https://preview.redd.it/etef8zsmks691.png?width=900&format=png&auto=webp&s=89f1864b24798bda08cbfbbe76e5c5b03c5f3937 submitted by /u/souhaielbensalem [link] [comments]  ( 84 min )
    'numpy.random._generator.Generator' object has no attribute 'randint'
    So I heard that this error was a bug in the stable_baselines3 module. How do I fix this? submitted by /u/ableflyer [link] [comments]  ( 82 min )
    V-MPO - what do you think
    V-MPO seems to be the state of the art used by deepmind nowadays. It has been 3 years since the paper was published however there is very little public implementation online. I was wondering why and if anybody had ever managed to reproduce some results ? I couldn’t with the version I’ve partially recoded from the internet but this may come from misunderstandings from my side. submitted by /u/Jogima-cyber [link] [comments]  ( 82 min )
    POV: You’re an Animo watching your entire island burn in our reinforcement learning game🤖🔥🏝️
    submitted by /u/AnimoIsland [link] [comments]  ( 82 min )
    Memory requirements for tabular Q-learning vs deep neural network?
    I want to compare the space complexity/memory requirement of tabular Q-learning v.s. deep neural Q-network (DQN). I think DQN would be faster and Q-table has a disadvantage at large table sizes but consider the following case. A Q-table has the size 14 states *169 actions= 2366 entries and (say) a fully connected DNN whose number of parameters comes out to be like >8000. Space complexity/memory-wise, isn't storing a look-up q-table of 2366 size better than storing 8000 parameters of neural net? I never implemented a DNN before so no idea how much space neural net parameters take. Please give your opinions on this scenario. Moreover, do you think thus a 2366-sized Q-table is large as per Q-learning norms which people use? I couldn't find any rule of thumb... submitted by /u/Simple-Soil-230 [link] [comments]  ( 86 min )
    Why do DQN learning-based methods dominate the leaderboards for Atari Games?
    ​ https://preview.redd.it/efbfdsbhmo691.png?width=964&format=png&auto=webp&s=4593b345d28e393447c4cf66af2abdbca72309c9 Everywhere that I have read, Policy-Based methods are supposed to be more robust and converge faster than Value-Based methods. Why does this table contradict that? Edit: Link to image: Atari games Benchmark (Atari Games) | Papers With Code submitted by /u/atomicburn125 [link] [comments]  ( 87 min )
  • Open

    Build an appointment scheduler interface integrated with Meta using Amazon Lex and Amazon Connect
    This blog post is co-written with Nick Vargas and Anna Schreiber from Accenture. Scheduling customer appointments is often a manual and labor-intensive process. You can utilize advances in self-service technology to automate appointment scheduling. In this blog post, we show you how to build a self-service appointment scheduling solution built with Amazon Lex and Amazon […]  ( 10 min )
  • Open

    The King’s Swedish: AI Rewrites the Book in Scandinavia
    If the King of Sweden wants help drafting his annual Christmas speech this year, he could ask the same AI model that’s available to his 10 million subjects. As a test, researchers prompted the model, called GPT-SW3, to draft one of the royal messages, and it did a pretty good job, according to Magnus Sahlgren, Read article > The post The King’s Swedish: AI Rewrites the Book in Scandinavia appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    Achieving Fairness at No Utility Cost via Data Reweighing with Influence. (arXiv:2202.00787v2 [cs.LG] UPDATED)
    With the fast development of algorithmic governance, fairness has become a compulsory property for machine learning models to suppress unintentional discrimination. In this paper, we focus on the pre-processing aspect for achieving fairness, and propose a data reweighing approach that only adjusts the weight for samples in the training phase. Different from most previous reweighing methods which usually assign a uniform weight for each (sub)group, we granularly model the influence of each training sample with regard to fairness-related quantity and predictive utility, and compute individual weights based on influence under the constraints from both fairness and utility. Experimental results reveal that previous methods achieve fairness at a non-negligible cost of utility, while as a significant advantage, our approach can empirically release the tradeoff and obtain cost-free fairness for equal opportunity. We demonstrate the cost-free fairness through vanilla classifiers and standard training processes, compared to baseline methods on multiple real-world tabular datasets. Code available at https://github.com/brandeis-machine-learning/influence-fairness.  ( 2 min )
    Channel-wise Mixed-precision Assignment for DNN Inference on Constrained Edge Nodes. (arXiv:2206.08852v1 [cs.LG])
    Quantization is widely employed in both cloud and edge systems to reduce the memory occupation, latency, and energy consumption of deep neural networks. In particular, mixed-precision quantization, i.e., the use of different bit-widths for different portions of the network, has been shown to provide excellent efficiency gains with limited accuracy drops, especially with optimized bit-width assignments determined by automated Neural Architecture Search (NAS) tools. State-of-the-art mixed-precision works layer-wise, i.e., it uses different bit-widths for the weights and activations tensors of each network layer. In this work, we widen the search space, proposing a novel NAS that selects the bit-width of each weight tensor channel independently. This gives the tool the additional flexibility of assigning a higher precision only to the weights associated with the most informative features. Testing on the MLPerf Tiny benchmark suite, we obtain a rich collection of Pareto-optimal models in the accuracy vs model size and accuracy vs energy spaces. When deployed on the MPIC RISC-V edge processor, our networks reduce the memory and energy for inference by up to 63% and 27% respectively compared to a layer-wise approach, for the same accuracy.  ( 2 min )
    Adapting the Linearised Laplace Model Evidence for Modern Deep Learning. (arXiv:2206.08900v1 [stat.ML])
    The linearised Laplace method for estimating model uncertainty has received renewed attention in the Bayesian deep learning community. The method provides reliable error bars and admits a closed-form expression for the model evidence, allowing for scalable selection of model hyperparameters. In this work, we examine the assumptions behind this method, particularly in conjunction with model selection. We show that these interact poorly with some now-standard tools of deep learning--stochastic approximation methods and normalisation layers--and make recommendations for how to better adapt this classic method to the modern setting. We provide theoretical support for our recommendations and validate them empirically on MLPs, classic CNNs, residual networks with and without normalisation layers, generative autoencoders and transformers.
    What do navigation agents learn about their environment?. (arXiv:2206.08500v1 [cs.CV])
    Today's state of the art visual navigation agents typically consist of large deep learning models trained end to end. Such models offer little to no interpretability about the learned skills or the actions of the agent taken in response to its environment. While past works have explored interpreting deep learning models, little attention has been devoted to interpreting embodied AI systems, which often involve reasoning about the structure of the environment, target characteristics and the outcome of one's actions. In this paper, we introduce the Interpretability System for Embodied agEnts (iSEE) for Point Goal and Object Goal navigation agents. We use iSEE to probe the dynamic representations produced by these agents for the presence of information about the agent as well as the environment. We demonstrate interesting insights about navigation agents using iSEE, including the ability to encode reachable locations (to avoid obstacles), visibility of the target, progress from the initial spawn location as well as the dramatic effect on the behaviors of agents when we mask out critical individual neurons. The code is available at: https://github.com/allenai/iSEE  ( 2 min )
    Detecting Adversarial Examples in Batches -- a geometrical approach. (arXiv:2206.08738v1 [cs.LG])
    Many deep learning methods have successfully solved complex tasks in computer vision and speech recognition applications. Nonetheless, the robustness of these models has been found to be vulnerable to perturbed inputs or adversarial examples, which are imperceptible to the human eye, but lead the model to erroneous output decisions. In this study, we adapt and introduce two geometric metrics, density and coverage, and evaluate their use in detecting adversarial samples in batches of unseen data. We empirically study these metrics using MNIST and two real-world biomedical datasets from MedMNIST, subjected to two different adversarial attacks. Our experiments show promising results for both metrics to detect adversarial examples. We believe that his work can lay the ground for further study on these metrics' use in deployed machine learning systems to monitor for possible attacks by adversarial examples or related pathologies such as dataset shift.
    SafeRL-Kit: Evaluating Efficient Reinforcement Learning Methods for Safe Autonomous Driving. (arXiv:2206.08528v1 [cs.LG])
    Safe reinforcement learning (RL) has achieved significant success on risk-sensitive tasks and shown promise in autonomous driving (AD) as well. Considering the distinctiveness of this community, efficient and reproducible baselines are still lacking for safe AD. In this paper, we release SafeRL-Kit to benchmark safe RL methods for AD-oriented tasks. Concretely, SafeRL-Kit contains several latest algorithms specific to zero-constraint-violation tasks, including Safety Layer, Recovery RL, off-policy Lagrangian method, and Feasible Actor-Critic. In addition to existing approaches, we propose a novel first-order method named Exact Penalty Optimization (EPO) and sufficiently demonstrate its capability in safe AD. All algorithms in SafeRL-Kit are implemented (i) under the off-policy setting, which improves sample efficiency and can better leverage past logs; (ii) with a unified learning framework, providing off-the-shelf interfaces for researchers to incorporate their domain-specific knowledge into fundamental safe RL methods. Conclusively, we conduct a comparative evaluation of the above algorithms in SafeRL-Kit and shed light on their efficacy for safe autonomous driving. The source code is available at \href{ https://github.com/zlr20/saferl_kit}{this https URL}.
    On Integrating Prior Knowledge into Gaussian Processes for Prognostic Health Monitoring. (arXiv:2206.08600v1 [stat.ML])
    Gaussian process regression is a powerful method for predicting states based on given data. It has been successfully applied for probabilistic predictions of structural systems to quantify, for example, the crack growth in mechanical structures. Typically, predefined mean and covariance functions are employed to construct the Gaussian process model. Then, the model is updated using current data during operation while prior information based on previous data is ignored. However, predefined mean and covariance functions without prior information reduce the potential of Gaussian processes. This paper proposes a method to improve the predictive capabilities of Gaussian processes. We integrate prior knowledge by deriving the mean and covariance functions from previous data. More specifically, we first approximate previous data by a weighted sum of basis functions and then derive the mean and covariance functions directly from the estimated weight coefficients. Basis functions may be either estimated or derived from problem-specific governing equations to incorporate physical information. The applicability and effectiveness of this approach are demonstrated for fatigue crack growth, laser degradation, and milling machine wear data. We show that well-chosen mean and covariance functions, like those based on previous data, significantly increase look-ahead time and accuracy. Using physical basis functions further improves accuracy. In addition, computation effort for training is significantly reduced.
    All Mistakes Are Not Equal: Comprehensive Hierarchy Aware Multi-label Predictions (CHAMP). (arXiv:2206.08653v1 [cs.LG])
    This paper considers the problem of Hierarchical Multi-Label Classification (HMC), where (i) several labels can be present for each example, and (ii) labels are related via a domain-specific hierarchy tree. Guided by the intuition that all mistakes are not equal, we present Comprehensive Hierarchy Aware Multi-label Predictions (CHAMP), a framework that penalizes a misprediction depending on its severity as per the hierarchy tree. While there have been works that apply such an idea to single-label classification, to the best of our knowledge, there are limited such works for multilabel classification focusing on the severity of mistakes. The key reason is that there is no clear way of quantifying the severity of a misprediction a priori in the multilabel setting. In this work, we propose a simple but effective metric to quantify the severity of a mistake in HMC, naturally leading to CHAMP. Extensive experiments on six public HMC datasets across modalities (image, audio, and text) demonstrate that incorporating hierarchical information leads to substantial gains as CHAMP improves both AUPRC (2.6% median percentage improvement) and hierarchical metrics (2.85% median percentage improvement), over stand-alone hierarchical or multilabel classification methods. Compared to standard multilabel baselines, CHAMP provides improved AUPRC in both robustness (8.87% mean percentage improvement ) and less data regimes. Further, our method provides a framework to enhance existing multilabel classification algorithms with better mistakes (18.1% mean percentage increment).
    Strategic Representation. (arXiv:2206.08542v1 [cs.LG])
    Humans have come to rely on machines for reducing excessive information to manageable representations. But this reliance can be abused -- strategic machines might craft representations that manipulate their users. How can a user make good choices based on strategic representations? We formalize this as a learning problem, and pursue algorithms for decision-making that are robust to manipulation. In our main setting of interest, the system represents attributes of an item to the user, who then decides whether or not to consume. We model this interaction through the lens of strategic classification (Hardt et al. 2016), reversed: the user, who learns, plays first; and the system, which responds, plays second. The system must respond with representations that reveal `nothing but the truth' but need not reveal the entire truth. Thus, the user faces the problem of learning set functions under strategic subset selection, which presents distinct algorithmic and statistical challenges. Our main result is a learning algorithm that minimizes error despite strategic representations, and our theoretical analysis sheds light on the trade-off between learning effort and susceptibility to manipulation.
    Reconstructing vehicles from orthographic drawings using deep neural networks. (arXiv:2206.08789v1 [cs.CV])
    This paper explores the current state-of-the-art of object reconstruction from multiple orthographic drawings using deep neural networks. It proposes two algorithms to extract multiple views from a single image. The paper proposes a system based on pixel-aligned implicit functions (PIFu) and develops an advanced sampling strategy to generate signed distance samples. It also compares this approach to depth map regression from multiple views. Additionally, the paper uses a novel dataset for vehicle reconstruction from the racing game Assetto Corsa, which features higher quality models than the commonly used ShapeNET dataset. The trained neural network generalizes well to real-world inputs and creates plausible and detailed reconstructions.  ( 2 min )
    Accelerating Shapley Explanation via Contributive Cooperator Selection. (arXiv:2206.08529v1 [cs.LG])
    Even though Shapley value provides an effective explanation for a DNN model prediction, the computation relies on the enumeration of all possible input feature coalitions, which leads to the exponentially growing complexity. To address this problem, we propose a novel method SHEAR to significantly accelerate the Shapley explanation for DNN models, where only a few coalitions of input features are involved in the computation. The selection of the feature coalitions follows our proposed Shapley chain rule to minimize the absolute error from the ground-truth Shapley values, such that the computation can be both efficient and accurate. To demonstrate the effectiveness, we comprehensively evaluate SHEAR across multiple metrics including the absolute error from the ground-truth Shapley value, the faithfulness of the explanations, and running speed. The experimental results indicate SHEAR consistently outperforms state-of-the-art baseline methods across different evaluation metrics, which demonstrates its potentials in real-world applications where the computational resource is limited.
    Plotly-Resampler: Effective Visual Analytics for Large Time Series. (arXiv:2206.08703v1 [cs.HC])
    Visual analytics is arguably the most important step in getting acquainted with your data. This is especially the case for time series, as this data type is hard to describe and cannot be fully understood when using for example summary statistics. To realize effective time series visualization, four requirements have to be met; a tool should be (1) interactive, (2) scalable to millions of data points, (3) integrable in conventional data science environments, and (4) highly configurable. We observe that open source Python visualization toolkits empower data scientists in most visual analytics tasks, but lack the combination of scalability and interactivity to realize effective time series visualization. As a means to facilitate these requirements, we created Plotly-Resampler, an open source Python library. Plotly-Resampler is an add-on for Plotly's Python bindings, enhancing line chart scalability on top of an interactive toolkit by aggregating the underlying data depending on the current graph view. Plotly-Resampler is built to be snappy, as the reactivity of a tool qualitatively affects how analysts visually explore and analyze data. A benchmark task highlights how our toolkit scales better than alternatives in terms of number of samples and time series. Additionally, Plotly-Resampler's flexible data aggregation functionality paves the path towards researching novel aggregation techniques. Plotly-Resampler's integrability, together with its configurability, convenience, and high scalability, allows to effectively analyze high-frequency data in your day-to-day Python environment.
    The Open Catalyst 2022 (OC22) Dataset and Challenges for Oxide Electrocatalysis. (arXiv:2206.08917v1 [cond-mat.mtrl-sci])
    Computational catalysis and machine learning communities have made considerable progress in developing machine learning models for catalyst discovery and design. Yet, a general machine learning potential that spans the chemical space of catalysis is still out of reach. A significant hurdle is obtaining access to training data across a wide range of materials. One important class of materials where data is lacking are oxides, which inhibits models from studying the Oxygen Evolution Reaction and oxide electrocatalysis more generally. To address this we developed the Open Catalyst 2022(OC22) dataset, consisting of 62,521 Density Functional Theory (DFT) relaxations (~9,884,504 single point calculations) across a range of oxide materials, coverages, and adsorbates (*H, *O, *N, *C, *OOH, *OH, *OH2, *O2, *CO). We define generalized tasks to predict the total system energy that are applicable across catalysis, develop baseline performance of several graph neural networks (SchNet, DimeNet++, ForceNet, SpinConv, PaiNN, GemNet-dT, GemNet-OC), and provide pre-defined dataset splits to establish clear benchmarks for future efforts. For all tasks, we study whether combining datasets leads to better results, even if they contain different materials or adsorbates. Specifically, we jointly train models on Open Catalyst 2020 (OC20) Dataset and OC22, or fine-tune pretrained OC20 models on OC22. In the most general task, GemNet-OC sees a ~32% improvement in energy predictions through fine-tuning and a ~9% improvement in force predictions via joint training. Surprisingly, joint training on both the OC20 and much smaller OC22 datasets also improves total energy predictions on OC20 by ~19%. The dataset and baseline models are open sourced, and a public leaderboard will follow to encourage continued community developments on the total energy tasks and data.
    Orthonormal Expansions for Translation-Invariant Kernels. (arXiv:2206.08648v1 [math.CA])
    We present a general Fourier analytic technique for constructing orthonormal basis expansions of translation-invariant kernels from orthonormal bases of $\mathscr{L}_2(\mathbb{R})$. This allows us to derive explicit expansions on the real line for (i) Mat\'ern kernels of all half-integer orders in terms of associated Laguerre functions, (ii) the Cauchy kernel in terms of rational functions, and (iii) the Gaussian kernel in terms of Hermite functions.
    Towards Human-Level Bimanual Dexterous Manipulation with Reinforcement Learning. (arXiv:2206.08686v1 [cs.RO])
    Achieving human-level dexterity is an important open problem in robotics. However, tasks of dexterous hand manipulation, even at the baby level, are challenging to solve through reinforcement learning (RL). The difficulty lies in the high degrees of freedom and the required cooperation among heterogeneous agents (e.g., joints of fingers). In this study, we propose the Bimanual Dexterous Hands Benchmark (Bi-DexHands), a simulator that involves two dexterous hands with tens of bimanual manipulation tasks and thousands of target objects. Specifically, tasks in Bi-DexHands are designed to match different levels of human motor skills according to cognitive science literature. We built Bi-DexHands in the Issac Gym; this enables highly efficient RL training, reaching 30,000+ FPS by only one single NVIDIA RTX 3090. We provide a comprehensive benchmark for popular RL algorithms under different settings; this includes Single-agent/Multi-agent RL, Offline RL, Multi-task RL, and Meta RL. Our results show that the PPO type of on-policy algorithms can master simple manipulation tasks that are equivalent up to 48-month human babies (e.g., catching a flying object, opening a bottle), while multi-agent RL can further help to master manipulations that require skilled bimanual cooperation (e.g., lifting a pot, stacking blocks). Despite the success on each single task, when it comes to acquiring multiple manipulation skills, existing RL algorithms fail to work in most of the multi-task and the few-shot learning settings, which calls for more substantial development from the RL community. Our project is open sourced at https://github.com/PKU-MARL/DexterousHands.
    TLETA: Deep Transfer Learning and Integrated Cellular Knowledge for Estimated Time of Arrival Prediction. (arXiv:2206.08513v1 [cs.LG])
    Vehicle arrival time prediction has been studied widely. With the emergence of IoT devices and deep learning techniques, estimated time of arrival (ETA) has become a critical component in intelligent transportation systems. Though many tools exist for ETA, ETA for special vehicles, such as ambulances, fire engines, etc., is still challenging due to the limited amount of traffic data for special vehicles. Existing works use one model for all types of vehicles, which can lead to low accuracy. To tackle this, as the first in the field, we propose a deep transfer learning framework TLETA for the driving time prediction. TLETA constructs cellular spatial-temporal knowledge grids for extracting driving patterns, combined with the road network structure embedding to build a deep neural network for ETA. TLETA contains transferable layers to support knowledge transfer between different categories of vehicles. Importantly, our transfer models only train the last layers to map the transferred knowledge, that reduces the training time significantly. The experimental studies show that our model predicts travel time with high accuracy and outperforms many state-of-the-art approaches.
    Learning Fair Representation via Distributional Contrastive Disentanglement. (arXiv:2206.08743v1 [cs.LG])
    Learning fair representation is crucial for achieving fairness or debiasing sensitive information. Most existing works rely on adversarial representation learning to inject some invariance into representation. However, adversarial learning methods are known to suffer from relatively unstable training, and this might harm the balance between fairness and predictiveness of representation. We propose a new approach, learning FAir Representation via distributional CONtrastive Variational AutoEncoder (FarconVAE), which induces the latent space to be disentangled into sensitive and nonsensitive parts. We first construct the pair of observations with different sensitive attributes but with the same labels. Then, FarconVAE enforces each non-sensitive latent to be closer, while sensitive latents to be far from each other and also far from the non-sensitive latent by contrasting their distributions. We provide a new type of contrastive loss motivated by Gaussian and Student-t kernels for distributional contrastive learning with theoretical analysis. Besides, we adopt a new swap-reconstruction loss to boost the disentanglement further. FarconVAE shows superior performance on fairness, pretrained model debiasing, and domain generalization tasks from various modalities, including tabular, image, and text.
    Digital Twin Data Modelling by Randomized Orthogonal Decomposition and Deep Learning. (arXiv:2206.08659v1 [math.NA])
    A digital twin is a surrogate model that has the main feature to mirror the original process behavior. Associating the dynamical process with a digital twin model of reduced complexity has the significant advantage to map the dynamics with high accuracy and reduced costs in CPU time and hardware to timescales over which that suffers significantly changes and so it is difficult to explore. This paper introduces a new framework for creating efficient digital twin models of fluid flows. We introduce a novel algorithm that combines the advantages of Krylov based dynamic mode decomposition with proper orthogonal decomposition and outperforms the selection of the most influential modes. We prove that randomized orthogonal decomposition algorithm provides several advantages over SVD empirical orthogonal decomposition methods and mitigates the projection error formulating a multiobjective optimization problem.We involve the state-of-the-art artificial intelligence Deep Learning (DL) to perform a real-time adaptive calibration of the digital twin model, with increasing fidelity. The output is a high-fidelity DIGITAL TWIN DATA MODEL of the fluid flow dynamics, with the advantage of a reduced complexity. The new modelling tools are investigated in the numerical simulation of three wave phenomena with increasing complexity. We show that the outputs are consistent with the original source data.We perform a thorough assessment of the performance of the new digital twin data models, in terms of numerical accuracy and computational efficiency, including a time simulation response feature study.  ( 2 min )
    Prediction of Solar Radiation Based on Spatial and Temporal Embeddings for Solar Generation Forecast. (arXiv:2206.08832v1 [cs.LG])
    A novel method for real-time solar generation forecast using weather data, while exploiting both spatial and temporal structural dependencies is proposed. The network observed over time is projected to a lower-dimensional representation where a variety of weather measurements are used to train a structured regression model while weather forecast is used at the inference stage. Experiments were conducted at 288 locations in the San Antonio, TX area on obtained from the National Solar Radiation Database. The model predicts solar irradiance with a good accuracy (R2 0.91 for the summer, 0.85 for the winter, and 0.89 for the global model). The best accuracy was obtained by the Random Forest Regressor. Multiple experiments were conducted to characterize influence of missing data and different time horizons providing evidence that the new algorithm is robust for data missing not only completely at random but also when the mechanism is spatial, and temporal.
    Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning. (arXiv:2206.08657v1 [cs.CV])
    Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a cross-modal encoder, or feed the last-layer uni-modal features directly into the top cross-modal encoder, ignoring the semantic information at the different levels in the deep uni-modal encoders. Both approaches possibly restrict vision-language representation learning and limit model performance. In this paper, we introduce multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables comprehensive bottom-up interactions between visual and textual representations at different semantic levels, resulting in more effective cross-modal alignment and fusion. Our proposed Bridge-Tower, pre-trained with only $4$M images, achieves state-of-the-art performance on various downstream vision-language tasks. On the VQAv2 test-std set, Bridge-Tower achieves an accuracy of $78.73\%$, outperforming the previous state-of-the-art METER model by $1.09\%$ with the same pre-training data and almost no additional parameters and computational cost. Notably, when further scaling the model, Bridge-Tower achieves an accuracy of $81.15\%$, surpassing models that are pre-trained on orders-of-magnitude larger datasets. Code is available at https://github.com/microsoft/BridgeTower.
    Sparse Double Descent: Where Network Pruning Aggravates Overfitting. (arXiv:2206.08684v1 [cs.LG])
    People usually believe that network pruning not only reduces the computational cost of deep networks, but also prevents overfitting by decreasing model capacity. However, our work surprisingly discovers that network pruning sometimes even aggravates overfitting. We report an unexpected sparse double descent phenomenon that, as we increase model sparsity via network pruning, test performance first gets worse (due to overfitting), then gets better (due to relieved overfitting), and gets worse at last (due to forgetting useful information). While recent studies focused on the deep double descent with respect to model overparameterization, they failed to recognize that sparsity may also cause double descent. In this paper, we have three main contributions. First, we report the novel sparse double descent phenomenon through extensive experiments. Second, for this phenomenon, we propose a novel learning distance interpretation that the curve of $\ell_{2}$ learning distance of sparse models (from initialized parameters to final parameters) may correlate with the sparse double descent curve well and reflect generalization better than minima flatness. Third, in the context of sparse double descent, a winning ticket in the lottery ticket hypothesis surprisingly may not always win.
    Machine Learning-Driven Process of Alumina Ceramics Laser Machining. (arXiv:2206.08747v1 [cs.CE])
    Laser machining is a highly flexible non-contact manufacturing technique that has been employed widely across academia and industry. Due to nonlinear interactions between light and matter, simulation methods are extremely crucial, as they help enhance the machining quality by offering comprehension of the inter-relationships between the laser processing parameters. On the other hand, experimental processing parameter optimization recommends a systematic, and consequently time-consuming, investigation over the available processing parameter space. An intelligent strategy is to employ machine learning (ML) techniques to capture the relationship between picosecond laser machining parameters for finding proper parameter combinations to create the desired cuts on industrial-grade alumina ceramic with deep, smooth and defect-free patterns. Laser parameters such as beam amplitude and frequency, scanner passing speed and the number of passes over the surface, as well as the vertical distance of the scanner from the sample surface, are used for predicting the depth, top width, and bottom width of the engraved channels using ML models. Owing to the complex correlation between laser parameters, it is shown that Neural Networks (NN) are the most efficient in predicting the outputs. Equipped with an ML model that captures the interconnection between laser parameters and the engraved channel dimensions, one can predict the required input parameters to achieve a target channel geometry. This strategy significantly reduces the cost and effort of experimental laser machining during the development phase, without compromising accuracy or performance. The developed techniques can be applied to a wide range of ceramic laser machining processes.
    Fast Lossless Neural Compression with Integer-Only Discrete Flows. (arXiv:2206.08869v1 [cs.LG])
    By applying entropy codecs with learned data distributions, neural compressors have significantly outperformed traditional codecs in terms of compression ratio. However, the high inference latency of neural networks hinders the deployment of neural compressors in practical applications. In this work, we propose Integer-only Discrete Flows (IODF), an efficient neural compressor with integer-only arithmetic. Our work is built upon integer discrete flows, which consists of invertible transformations between discrete random variables. We propose efficient invertible transformations with integer-only arithmetic based on 8-bit quantization. Our invertible transformation is equipped with learnable binary gates to remove redundant filters during inference. We deploy IODF with TensorRT on GPUs, achieving 10x inference speedup compared to the fastest existing neural compressors, while retaining the high compression rates on ImageNet32 and ImageNet64.
    DFG-NAS: Deep and Flexible Graph Neural Architecture Search. (arXiv:2206.08582v1 [cs.LG])
    Graph neural networks (GNNs) have been intensively applied to various graph-based applications. Despite their success, manually designing the well-behaved GNNs requires immense human expertise. And thus it is inefficient to discover the potentially optimal data-specific GNN architecture. This paper proposes DFG-NAS, a new neural architecture search (NAS) method that enables the automatic search of very deep and flexible GNN architectures. Unlike most existing methods that focus on micro-architectures, DFG-NAS highlights another level of design: the search for macro-architectures on how atomic propagation (\textbf{\texttt{P}}) and transformation (\textbf{\texttt{T}}) operations are integrated and organized into a GNN. To this end, DFG-NAS proposes a novel search space for \textbf{\texttt{P-T}} permutations and combinations based on message-passing dis-aggregation, defines four custom-designed macro-architecture mutations, and employs the evolutionary algorithm to conduct an efficient and effective search. Empirical studies on four node classification tasks demonstrate that DFG-NAS outperforms state-of-the-art manual designs and NAS methods of GNNs.
    Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection. (arXiv:2206.08726v1 [cs.SE])
    Code clones are pairs of code snippets that implement similar functionality. Clone detection is a fundamental branch of automatic source code comprehension, having many applications in refactoring recommendation, plagiarism detection, and code summarization. A particularly interesting case of clone detection is the detection of semantic clones, i.e., code snippets that have the same functionality but significantly differ in implementation. A promising approach to detecting semantic clones is contrastive learning (CL), a machine learning paradigm popular in computer vision but not yet commonly adopted for code processing. Our work aims to evaluate the most popular CL algorithms combined with three source code representations on two tasks. The first task is code clone detection, which we evaluate on the POJ-104 dataset containing implementations of 104 algorithms. The second task is plagiarism detection. To evaluate the models on this task, we introduce CodeTransformator, a tool for transforming source code. We use it to create a dataset that mimics plagiarised code based on competitive programming solutions. We trained nine models for both tasks and compared them with six existing approaches, including traditional tools and modern pre-trained neural models. The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others. Among CL algorithms, SimCLR and SwAV lead to better results, while Moco is the most robust approach. Our code and trained models are available at https://doi.org/10.5281/zenodo.6360627, https://doi.org/10.5281/zenodo.5596345.
    Federated learning with incremental clustering for heterogeneous data. (arXiv:2206.08752v1 [cs.LG])
    Federated learning enables different parties to collaboratively build a global model under the orchestration of a server while keeping the training data on clients' devices. However, performance is affected when clients have heterogeneous data. To cope with this problem, we assume that despite data heterogeneity, there are groups of clients who have similar data distributions that can be clustered. In previous approaches, in order to cluster clients the server requires clients to send their parameters simultaneously. However, this can be problematic in a context where there is a significant number of participants that may have limited availability. To prevent such a bottleneck, we propose FLIC (Federated Learning with Incremental Clustering), in which the server exploits the updates sent by clients during federated training instead of asking them to send their parameters simultaneously. Hence no additional communications between the server and the clients are necessary other than what classical federated learning requires. We empirically demonstrate for various non-IID cases that our approach successfully splits clients into groups following the same data distributions. We also identify the limitations of FLIC by studying its capability to partition clients at the early stages of the federated learning process efficiently. We further address attacks on models as a form of data heterogeneity and empirically show that FLIC is a robust defense against poisoning attacks even when the proportion of malicious clients is higher than 50\%.
    Fast Population-Based Reinforcement Learning on a Single Machine. (arXiv:2206.08888v1 [cs.LG])
    Training populations of agents has demonstrated great promise in Reinforcement Learning for stabilizing training, improving exploration and asymptotic performance, and generating a diverse set of solutions. However, population-based training is often not considered by practitioners as it is perceived to be either prohibitively slow (when implemented sequentially), or computationally expensive (if agents are trained in parallel on independent accelerators). In this work, we compare implementations and revisit previous studies to show that the judicious use of compilation and vectorization allows population-based training to be performed on a single machine with one accelerator with minimal overhead compared to training a single agent. We also show that, when provided with a few accelerators, our protocols extend to large population sizes for applications such as hyperparameter tuning. We hope that this work and the public release of our code will encourage practitioners to use population-based learning more frequently for their research and applications.
    Fast Finite Width Neural Tangent Kernel. (arXiv:2206.08720v1 [cs.LG])
    The Neural Tangent Kernel (NTK), defined as $\Theta_\theta^f(x_1, x_2) = \left[\partial f(\theta, x_1)\big/\partial \theta\right] \left[\partial f(\theta, x_2)\big/\partial \theta\right]^T$ where $\left[\partial f(\theta, \cdot)\big/\partial \theta\right]$ is a neural network (NN) Jacobian, has emerged as a central object of study in deep learning. In the infinite width limit, the NTK can sometimes be computed analytically and is useful for understanding training and generalization of NN architectures. At finite widths, the NTK is also used to better initialize NNs, compare the conditioning across models, perform architecture search, and do meta-learning. Unfortunately, the finite width NTK is notoriously expensive to compute, which severely limits its practical utility. We perform the first in-depth analysis of the compute and memory requirements for NTK computation in finite width networks. Leveraging the structure of neural networks, we further propose two novel algorithms that change the exponent of the compute and memory requirements of the finite width NTK, dramatically improving efficiency. Our algorithms can be applied in a black box fashion to any differentiable function, including those implementing neural networks. We open-source our implementations within the Neural Tangents package (arXiv:1912.02803) at https://github.com/google/neural-tangents.
    TUSK: Task-Agnostic Unsupervised Keypoints. (arXiv:2206.08460v1 [cs.CV])
    Existing unsupervised methods for keypoint learning rely heavily on the assumption that a specific keypoint type (e.g. elbow, digit, abstract geometric shape) appears only once in an image. This greatly limits their applicability, as each instance must be isolated before applying the method-an issue that is never discussed or evaluated. We thus propose a novel method to learn Task-agnostic, UnSupervised Keypoints (TUSK) which can deal with multiple instances. To achieve this, instead of the commonly-used strategy of detecting multiple heatmaps, each dedicated to a specific keypoint type, we use a single heatmap for detection, and enable unsupervised learning of keypoint types through clustering. Specifically, we encode semantics into the keypoints by teaching them to reconstruct images from a sparse set of keypoints and their descriptors, where the descriptors are forced to form distinct clusters in feature space around learned prototypes. This makes our approach amenable to a wider range of tasks than any previous unsupervised keypoint method: we show experiments on multiple-instance detection and classification, object discovery, and landmark detection-all unsupervised-with performance on par with the state of the art, while also being able to deal with multiple instances.  ( 2 min )
    TKIL: Tangent Kernel Approach for Class Balanced Incremental Learning. (arXiv:2206.08492v1 [cs.LG])
    When learning new tasks in a sequential manner, deep neural networks tend to forget tasks that they previously learned, a phenomenon called catastrophic forgetting. Class incremental learning methods aim to address this problem by keeping a memory of a few exemplars from previously learned tasks, and distilling knowledge from them. However, existing methods struggle to balance the performance across classes since they typically overfit the model to the latest task. In our work, we propose to address these challenges with the introduction of a novel methodology of Tangent Kernel for Incremental Learning (TKIL) that achieves class-balanced performance. The approach preserves the representations across classes and balances the accuracy for each class, and as such achieves better overall accuracy and variance. TKIL approach is based on Neural Tangent Kernel (NTK), which describes the convergence behavior of neural networks as a kernel function in the limit of infinite width. In TKIL, the gradients between feature layers are treated as the distance between the representations of these layers and can be defined as Gradients Tangent Kernel loss (GTK loss) such that it is minimized along with averaging weights. This allows TKIL to automatically identify the task and to quickly adapt to it during inference. Experiments on CIFAR-100 and ImageNet datasets with various incremental learning settings show that these strategies allow TKIL to outperform existing state-of-the-art methods.
    Capturing Actionable Dynamics with Structured Latent Ordinary Differential Equations. (arXiv:2202.12932v2 [stat.ML] UPDATED)
    End-to-end learning of dynamical systems with black-box models, such as neural ordinary differential equations (ODEs), provides a flexible framework for learning dynamics from data without prescribing a mathematical model for the dynamics. Unfortunately, this flexibility comes at the cost of understanding the dynamical system, for which ODEs are used ubiquitously. Further, experimental data are collected under various conditions (inputs), such as treatments, or grouped in some way, such as part of sub-populations. Understanding the effects of these system inputs on system outputs is crucial to have any meaningful model of a dynamical system. To that end, we propose a structured latent ODE model that explicitly captures system input variations within its latent representation. Building on a static latent variable specification, our model learns (independent) stochastic factors of variation for each input to the system, thus separating the effects of the system inputs in the latent space. This approach provides actionable modeling through the controlled generation of time-series data for novel input combinations (or perturbations). Additionally, we propose a flexible approach for quantifying uncertainties, leveraging a quantile regression formulation. Results on challenging biological datasets show consistent improvements over competitive baselines in the controlled generation of observational data and inference of biologically meaningful system inputs.
    Variational Estimators of the Degree-corrected Latent Block Model for Bipartite Networks. (arXiv:2206.08465v1 [stat.ML])
    Biclustering on bipartite graphs is an unsupervised learning task that simultaneously clusters the two types of objects in the graph, for example, users and movies in a movie review dataset. The latent block model (LBM) has been proposed as a model-based tool for biclustering. Biclustering results by the LBM are, however, usually dominated by the row and column sums of the data matrix, i.e., degrees. We propose a degree-corrected latent block model (DC-LBM) to accommodate degree heterogeneity in row and column clusters, which greatly outperforms the classical LBM in the MovieLens dataset and simulated data. We develop an efficient variational expectation-maximization algorithm by observing that the row and column degrees maximize the objective function in the M step given any probability assignment on the cluster labels. We prove the label consistency of the variational estimator under the DC-LBM, which allows the expected graph density goes to zero as long as the average expected degrees of rows and columns go to infinity.
    Reframed GES with a Neural Conditional Dependence Measure. (arXiv:2206.08531v1 [stat.ML])
    In a nonparametric setting, the causal structure is often identifiable only up to Markov equivalence, and for the purpose of causal inference, it is useful to learn a graphical representation of the Markov equivalence class (MEC). In this paper, we revisit the Greedy Equivalence Search (GES) algorithm, which is widely cited as a score-based algorithm for learning the MEC of the underlying causal structure. We observe that in order to make the GES algorithm consistent in a nonparametric setting, it is not necessary to design a scoring metric that evaluates graphs. Instead, it suffices to plug in a consistent estimator of a measure of conditional dependence to guide the search. We therefore present a reframing of the GES algorithm, which is more flexible than the standard score-based version and readily lends itself to the nonparametric setting with a general measure of conditional dependence. In addition, we propose a neural conditional dependence (NCD) measure, which utilizes the expressive power of deep neural networks to characterize conditional independence in a nonparametric manner. We establish the optimality of the reframed GES algorithm under standard assumptions and the consistency of using our NCD estimator to decide conditional independence. Together these results justify the proposed approach. Experimental results demonstrate the effectiveness of our method in causal discovery, as well as the advantages of using our NCD measure over kernel-based measures.
    FiT: Parameter Efficient Few-shot Transfer Learning for Personalized and Federated Image Classification. (arXiv:2206.08671v1 [stat.ML])
    Modern deep learning systems are increasingly deployed in situations such as personalization and federated learning where it is necessary to support i) learning on small amounts of data, and ii) communication efficient distributed training protocols. In this work we develop FiLM Transfer (FiT) which fulfills these requirements in the image classification setting. FiT uses an automatically configured Naive Bayes classifier on top of a fixed backbone that has been pretrained on large image datasets. Parameter efficient FiLM layers are used to modulate the backbone, shaping the representation for the downstream task. The network is trained via an episodic fine-tuning protocol. The approach is parameter efficient which is key for enabling few-shot learning, inexpensive model updates for personalization, and communication efficient federated learning. We experiment with FiT on a wide range of downstream datasets and show that it achieves better classification accuracy than the state-of-the-art Big Transfer (BiT) algorithm at low-shot and on the challenging VTAB-1k benchmark, with fewer than 1% of the updateable parameters. Finally, we demonstrate the parameter efficiency of FiT in distributed low-shot applications including model personalization and federated learning where model update size is an important performance metric.
    A Parametric Class of Approximate Gradient Updates for Policy Optimization. (arXiv:2206.08499v1 [cs.LG])
    Approaches to policy optimization have been motivated from diverse principles, based on how the parametric model is interpreted (e.g. value versus policy representation) or how the learning objective is formulated, yet they share a common goal of maximizing expected return. To better capture the commonalities and identify key differences between policy optimization methods, we develop a unified perspective that re-expresses the underlying updates in terms of a limited choice of gradient form and scaling function. In particular, we identify a parameterized space of approximate gradient updates for policy optimization that is highly structured, yet covers both classical and recent examples, including PPO. As a result, we obtain novel yet well motivated updates that generalize existing algorithms in a way that can deliver benefits both in terms of convergence speed and final result quality. An experimental investigation demonstrates that the additional degrees of freedom provided in the parameterized family of updates can be leveraged to obtain non-trivial improvements both in synthetic domains and on popular deep RL benchmarks.
    Holistic Transformer: A Joint Neural Network for Trajectory Prediction and Decision-Making of Autonomous Vehicles. (arXiv:2206.08809v1 [cs.LG])
    Trajectory prediction and behavioral decision-making are two important tasks for autonomous vehicles that require good understanding of the environmental context; behavioral decisions are better made by referring to the outputs of trajectory predictions. However, most current solutions perform these two tasks separately. Therefore, a joint neural network that combines multiple cues is proposed and named as the holistic transformer to predict trajectories and make behavioral decisions simultaneously. To better explore the intrinsic relationships between cues, the network uses existing knowledge and adopts three kinds of attention mechanisms: the sparse multi-head type for reducing noise impact, feature selection sparse type for optimally using partial prior knowledge, and multi-head with sigmoid activation type for optimally using posteriori knowledge. Compared with other trajectory prediction models, the proposed model has better comprehensive performance and good interpretability. Perceptual noise robustness experiments demonstrate that the proposed model has good noise robustness. Thus, simultaneous trajectory prediction and behavioral decision-making combining multiple cues can reduce computational costs and enhance semantic relationships between scenes and agents.
    A Theoretical Analysis on Independence-driven Importance Weighting for Covariate-shift Generalization. (arXiv:2111.02355v2 [cs.LG] UPDATED)
    Covariate-shift generalization, a typical case in out-of-distribution (OOD) generalization, requires a good performance on the unknown test distribution, which varies from the accessible training distribution in the form of covariate shift. Recently, independence-driven importance weighting algorithms in stable learning literature have shown empirical effectiveness to deal with covariate-shift generalization on several learning models, including regression algorithms and deep neural networks, while their theoretical analyses are missing. In this paper, we theoretically prove the effectiveness of such algorithms by explaining them as feature selection processes. We first specify a set of variables, named minimal stable variable set, that is the minimal and optimal set of variables to deal with covariate-shift generalization for common loss functions, such as the mean squared loss and binary cross-entropy loss. Afterward, we prove that under ideal conditions, independence-driven importance weighting algorithms could identify the variables in this set. Analysis of asymptotic properties is also provided. These theories are further validated in several synthetic experiments.
    Beyond Worst-Case Analysis in Stochastic Approximation: Moment Estimation Improves Instance Complexity. (arXiv:2006.04429v3 [math.OC] UPDATED)
    We study oracle complexity of gradient based methods for stochastic approximation problems. Though in many settings optimal algorithms and tight lower bounds are known for such problems, these optimal algorithms do not achieve the best performance when used in practice. We address this theory-practice gap by focusing on instance-dependent complexity instead of worst case complexity. In particular, we first summarize known instance-dependent complexity results and categorize them into three levels. We identify the domination relation between different levels and propose a fourth instance-dependent bound that dominates existing ones. We then provide a sufficient condition according to which an adaptive algorithm with moment estimation can achieve the proposed bound without knowledge of noise levels. Our proposed algorithm and its analysis provide a theoretical justification for the success of moment estimation as it achieves improved instance complexity.
    Online Algorithms with Multiple Predictions. (arXiv:2205.03921v2 [cs.LG] UPDATED)
    This paper studies online algorithms augmented with multiple machine-learned predictions. While online algorithms augmented with a single prediction have been extensively studied in recent years, the literature for the multiple predictions setting is sparse. In this paper, we give a generic algorithmic framework for online covering problems with multiple predictions that obtains an online solution that is competitive against the performance of the best predictor. Our algorithm incorporates the use of predictions in the classic potential-based analysis of online algorithms. We apply our algorithmic framework to solve classical problems such as online set cover, (weighted) caching, and online facility location in the multiple predictions setting. Our algorithm can also be robustified, i.e., the algorithm can be simultaneously made competitive against the best prediction and the performance of the best online algorithm (without prediction).
    Near-Optimal No-Regret Learning for General Convex Games. (arXiv:2206.08742v1 [cs.GT])
    A recent line of work has established uncoupled learning dynamics such that, when employed by all players in a game, each player's \emph{regret} after $T$ repetitions grows polylogarithmically in $T$, an exponential improvement over the traditional guarantees within the no-regret framework. However, so far these results have only been limited to certain classes of games with structured strategy spaces -- such as normal-form and extensive-form games. The question as to whether $O(\text{polylog} T)$ regret bounds can be obtained for general convex and compact strategy sets -- which occur in many fundamental models in economics and multiagent systems -- while retaining efficient strategy updates is an important question. In this paper, we answer this in the positive by establishing the first uncoupled learning algorithm with $O(\log T)$ per-player regret in general \emph{convex games}, that is, games with concave utility functions supported on arbitrary convex and compact strategy sets. Our learning dynamics are based on an instantiation of optimistic follow-the-regularized-leader over an appropriately \emph{lifted} space using a \emph{self-concordant regularizer} that is, peculiarly, not a barrier for the feasible region. Further, our learning dynamics are efficiently implementable given access to a proximal oracle for the convex strategy set, leading to $O(\log\log T)$ per-iteration complexity; we also give extensions when access to only a \emph{linear} optimization oracle is assumed. Finally, we adapt our dynamics to guarantee $O(\sqrt{T})$ regret in the adversarial regime. Even in those special cases where prior results apply, our algorithm improves over the state-of-the-art regret bounds either in terms of the dependence on the number of iterations or on the dimension of the strategy sets.  ( 3 min )
    Random projections and Kernelised Leave One Cluster Out Cross-Validation: Universal baselines and evaluation tools for supervised machine learning for materials properties. (arXiv:2206.08841v1 [cs.LG])
    With machine learning being a popular topic in current computational materials science literature, creating representations for compounds has become common place. These representations are rarely compared, as evaluating their performance - and the performance of the algorithms that they are used with - is non-trivial. With many materials datasets containing bias and skew caused by the research process, leave one cluster out cross validation (LOCO-CV) has been introduced as a way of measuring the performance of an algorithm in predicting previously unseen groups of materials. This raises the question of the impact, and control, of the range of cluster sizes on the LOCO-CV measurement outcomes. We present a thorough comparison between composition-based representations, and investigate how kernel approximation functions can be used to better separate data to enhance LOCO-CV applications. We find that domain knowledge does not improve machine learning performance in most tasks tested, with band gap prediction being the notable exception. We also find that the radial basis function improves the linear separability of chemical datasets in all 10 datasets tested and provide a framework for the application of this function in the LOCO-CV process to improve the outcome of LOCO-CV measurements regardless of machine learning algorithm, choice of metric, and choice of compound representation. We recommend kernelised LOCO-CV as a training paradigm for those looking to measure the extrapolatory power of an algorithm on materials data.
    Optimizing Sequential Experimental Design with Deep Reinforcement Learning. (arXiv:2202.00821v3 [cs.LG] UPDATED)
    Bayesian approaches developed to solve the optimal design of sequential experiments are mathematically elegant but computationally challenging. Recently, techniques using amortization have been proposed to make these Bayesian approaches practical, by training a parameterized policy that proposes designs efficiently at deployment time. However, these methods may not sufficiently explore the design space, require access to a differentiable probabilistic model and can only optimize over continuous design spaces. Here, we address these limitations by showing that the problem of optimizing policies can be reduced to solving a Markov decision process (MDP). We solve the equivalent MDP with modern deep reinforcement learning techniques. Our experiments show that our approach is also computationally efficient at deployment time and exhibits state-of-the-art performance on both continuous and discrete design spaces, even when the probabilistic model is a black box.
    Distribution Regression with Sliced Wasserstein Kernels. (arXiv:2202.03926v2 [stat.ML] UPDATED)
    The problem of learning functions over spaces of probabilities - or distribution regression - is gaining significant interest in the machine learning community. A key challenge behind this problem is to identify a suitable representation capturing all relevant properties of the underlying functional mapping. A principled approach to distribution regression is provided by kernel mean embeddings, which lifts kernel-induced similarity on the input domain at the probability level. This strategy effectively tackles the two-stage sampling nature of the problem, enabling one to derive estimators with strong statistical guarantees, such as universal consistency and excess risk bounds. However, kernel mean embeddings implicitly hinge on the maximum mean discrepancy (MMD), a metric on probabilities, which may fail to capture key geometrical relations between distributions. In contrast, optimal transport (OT) metrics, are potentially more appealing. In this work, we propose an OT-based estimator for distribution regression. We build on the Sliced Wasserstein distance to obtain an OT-based representation. We study the theoretical properties of a kernel ridge regression estimator based on such representation, for which we prove universal consistency and excess risk bounds. Preliminary experiments complement our theoretical findings by showing the effectiveness of the proposed approach and compare it with MMD-based estimators.
    Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering. (arXiv:2204.09634v2 [cs.SD] UPDATED)
    Audio question answering (AQA) is a multimodal translation task where a system analyzes an audio signal and a natural language question, to generate a desirable natural language answer. In this paper, we introduce Clotho-AQA, a dataset for Audio question answering consisting of 1991 audio files each between 15 to 30 seconds in duration selected from the Clotho dataset. For each audio file, we collect six different questions and corresponding answers by crowdsourcing using Amazon Mechanical Turk. The questions and answers are produced by different annotators. Out of the six questions for each audio, two questions each are designed to have 'yes' and 'no' as answers, while the remaining two questions have other single-word answers. For each question, we collect answers from three different annotators. We also present two baseline experiments to describe the usage of our dataset for the AQA task - an LSTM-based multimodal binary classifier for 'yes' or 'no' type answers and an LSTM-based multimodal multi-class classifier for 828 single-word answers. The binary classifier achieved an accuracy of 62.7% and the multi-class classifier achieved a top-1 accuracy of 54.2% and a top-5 accuracy of 93.7%. Clotho-AQA dataset is freely available online at https://zenodo.org/record/6473207.
    MET: Masked Encoding for Tabular Data. (arXiv:2206.08564v1 [cs.LG])
    We consider the task of self-supervised representation learning (SSL) for tabular data: tabular-SSL. Typical contrastive learning based SSL methods require instance-wise data augmentations which are difficult to design for unstructured tabular data. Existing tabular-SSL methods design such augmentations in a relatively ad-hoc fashion and can fail to capture the underlying data manifold. Instead of augmentations based approaches for tabular-SSL, we propose a new reconstruction based method, called Masked Encoding for Tabular Data (MET), that does not require augmentations. MET is based on the popular MAE approach for vision-SSL [He et al., 2021] and uses two key ideas: (i) since each coordinate in a tabular dataset has a distinct meaning, we need to use separate representations for all coordinates, and (ii) using an adversarial reconstruction loss in addition to the standard one. Empirical results on five diverse tabular datasets show that MET achieves a new state of the art (SOTA) on all of these datasets and improves up to 9% over current SOTA methods. We shed more light on the working of MET via experiments on carefully designed simple datasets.
    How robust are pre-trained models to distribution shift?. (arXiv:2206.08871v1 [cs.LG])
    The vulnerability of machine learning models to spurious correlations has mostly been discussed in the context of supervised learning (SL). However, there is a lack of insight on how spurious correlations affect the performance of popular self-supervised learning (SSL) and auto-encoder based models (AE). In this work, we shed light on this by evaluating the performance of these models on both real world and synthetic distribution shift datasets. Following observations that the linear head itself can be susceptible to spurious correlations, we develop a novel evaluation scheme with the linear head trained on out-of-distribution (OOD) data, to isolate the performance of the pre-trained models from a potential bias of the linear head used for evaluation. With this new methodology, we show that SSL models are consistently more robust to distribution shifts and thus better at OOD generalisation than AE and SL models.
    On Testability of the Front-Door Model via Verma Constraints. (arXiv:2203.00161v2 [stat.ME] UPDATED)
    The front-door criterion can be used to identify and compute causal effects despite the existence of unmeasured confounders between a treatment and outcome. However, the key assumptions -- (i) the existence of a variable (or set of variables) that fully mediates the effect of the treatment on the outcome, and (ii) which simultaneously does not suffer from similar issues of confounding as the treatment-outcome pair -- are often deemed implausible. This paper explores the testability of these assumptions. We show that under mild conditions involving an auxiliary variable, the assumptions encoded in the front-door model (and simple extensions of it) may be tested via generalized equality constraints a.k.a Verma constraints. We propose two goodness-of-fit tests based on this observation, and evaluate the efficacy of our proposal on real and synthetic data. We also provide theoretical and empirical comparisons to instrumental variable approaches to handling unmeasured confounding.
    Author Clustering and Topic Estimation for Short Texts. (arXiv:2106.09533v2 [cs.IR] UPDATED)
    Analysis of short text, such as social media posts, is extremely difficult because of their inherent brevity. In addition to classifying topics of such posts, a common downstream task is grouping the authors of these documents for subsequent analyses. We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document, with user-level topic distributions. We also simultaneously cluster users, removing the need for post-hoc cluster estimation and improving topic estimation by shrinking noisy user-level topic distributions towards typical values. Our method performs as well as -- or better -- than traditional approaches, and we demonstrate its usefulness on a dataset of tweets from United States Senators, recovering both meaningful topics and clusters that reflect partisan ideology. We also develop a novel measure of echo chambers among these politicians by characterizing insularity of topics discussed by groups of Senators and provide uncertainty quantification.
    Unsolved Problems in ML Safety. (arXiv:2109.13916v5 [cs.LG] UPDATED)
    Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful technologies, safety for ML should be a leading research priority. In response to emerging safety challenges in ML, such as those introduced by recent large-scale models, we provide a new roadmap for ML Safety and refine the technical problems that the field needs to address. We present four problems ready for research, namely withstanding hazards ("Robustness"), identifying hazards ("Monitoring"), reducing inherent model hazards ("Alignment"), and reducing systemic hazards ("Systemic Safety"). Throughout, we clarify each problem's motivation and provide concrete research directions.
    Out-of-Distribution Detection with Deep Nearest Neighbors. (arXiv:2204.06507v2 [cs.LG] UPDATED)
    Out-of-distribution (OOD) detection is a critical task for deploying machine learning models in the open world. Distance-based methods have demonstrated promise, where testing samples are detected as OOD if they are relatively far away from in-distribution (ID) data. However, prior methods impose a strong distributional assumption of the underlying feature space, which may not always hold. In this paper, we explore the efficacy of non-parametric nearest-neighbor distance for OOD detection, which has been largely overlooked in the literature. Unlike prior works, our method does not impose any distributional assumption, hence providing stronger flexibility and generality. We demonstrate the effectiveness of nearest-neighbor-based OOD detection on several benchmarks and establish superior performance. Under the same model trained on ImageNet-1k, our method substantially reduces the false positive rate (FPR@TPR95) by 24.77% compared to a strong baseline SSD+, which uses a parametric approach Mahalanobis distance in detection. Code is available: https://github.com/deeplearning-wisc/knn-ood.
    Importance Sampling Placement in Off-Policy Temporal-Difference Methods. (arXiv:2203.10172v2 [cs.LG] UPDATED)
    A central challenge to applying many off-policy reinforcement learning algorithms to real world problems is the variance introduced by importance sampling. In off-policy learning, the agent learns about a different policy than the one being executed. To account for the difference importance sampling ratios are often used, but can increase variance in the algorithms and reduce the rate of learning. Several variations of importance sampling have been proposed to reduce variance, with per-decision importance sampling being the most popular. However, the update rules for most off-policy algorithms in the literature depart from per-decision importance sampling in a subtle way; they correct the entire TD error instead of just the TD target. In this work, we show how this slight change can be interpreted as a control variate for the TD target, reducing variance and improving performance. Experiments over a wide range of algorithms show this subtle modification results in improved performance.
    Modelling Evolutionary and Stationary User Preferences for Temporal Sets Prediction. (arXiv:2204.05490v4 [cs.LG] UPDATED)
    Given a sequence of sets, where each set is associated with a timestamp and contains an arbitrary number of elements, the task of temporal sets prediction aims to predict the elements in the subsequent set. Previous studies for temporal sets prediction mainly capture each user's evolutionary preference by learning from his/her own sequence. Although insightful, we argue that: 1) the collaborative signals latent in different users' sequences are essential but have not been exploited; 2) users also tend to show stationary preferences while existing methods fail to consider. To this end, we propose an integrated learning framework to model both the evolutionary and the stationary preferences of users for temporal sets prediction, which first constructs a universal sequence by chronologically arranging all the user-set interactions, and then learns on each user-set interaction. In particular, for each user-set interaction, we first design an evolutionary user preference modelling component to track the user's time-evolving preference and exploit the latent collaborative signals among different users. This component maintains a memory bank to store memories of the related user and elements, and continuously updates their memories based on the currently encoded messages and the past memories. Then, we devise a stationary user preference modelling module to discover each user's personalized characteristics according to the historical sequence, which adaptively aggregates the previously interacted elements from dual perspectives with the guidance of the user's and elements' embeddings. Finally, we develop a set-batch algorithm to improve the model efficiency, which can create time-consistent batches in advance and achieve 3.5x training speedups on average. Experiments on real-world datasets demonstrate the effectiveness and good interpretability of our approach.
    Representational Multiplicity Should Be Exposed, Not Eliminated. (arXiv:2206.08890v1 [cs.LG])
    It is prevalent and well-observed, but poorly understood, that two machine learning models with similar performance during training can have very different real-world performance characteristics. This implies elusive differences in the internals of the models, manifesting as representational multiplicity (RM). We introduce a conceptual and experimental setup for analyzing RM and show that certain training methods systematically result in greater RM than others, measured by activation similarity via singular vector canonical correlation analysis (SVCCA). We further correlate it with predictive multiplicity measured by the variance in i.i.d. and out-of-distribution test set predictions, in four common image data sets. We call for systematic measurement and maximal exposure, not elimination, of RM in models. Qualitative tools such as our confabulator analysis can facilitate understanding and communication of RM effects to stakeholders.
    Scaling multi-species occupancy models to large citizen science datasets. (arXiv:2206.08894v1 [stat.AP])
    Citizen science datasets can be very large and promise to improve species distribution modelling, but detection is imperfect, risking bias when fitting models. In particular, observers may not detect species that are actually present. Occupancy models can estimate and correct for this observation process, and multi-species occupancy models exploit similarities in the observation process, which can improve estimates for rare species. However, the computational methods currently used to fit these models do not scale to large datasets. We develop approximate Bayesian inference methods and use graphics processing units (GPUs) to scale multi-species occupancy models to very large citizen science data. We fit multi-species occupancy models to one month of data from the eBird project consisting of 186,811 checklist records comprising 430 bird species. We evaluate the predictions on a spatially separated test set of 59,338 records, comparing two different inference methods -- Markov chain Monte Carlo (MCMC) and variational inference (VI) -- to occupancy models fitted to each species separately using maximum likelihood. We fitted models to the entire dataset using VI, and up to 32,000 records with MCMC. VI fitted to the entire dataset performed best, outperforming single-species models on both AUC (90.4% compared to 88.7%) and on log likelihood (-0.080 compared to -0.085). We also evaluate how well range maps predicted by the model agree with expert maps. We find that modelling the detection process greatly improves agreement and that the resulting maps agree as closely with expert maps as ones estimated using high quality survey data. Our results demonstrate that multi-species occupancy models are a compelling approach to model large citizen science datasets, and that, once the observation process is taken into account, they can model species distributions accurately.
    Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark. (arXiv:2202.06767v3 [cs.CV] UPDATED)
    Vision-Language Pre-training (VLP) models have shown remarkable performance on various downstream tasks. Their success heavily relies on the scale of pre-trained cross-modal datasets. However, the lack of large-scale datasets and benchmarks in Chinese hinders the development of Chinese VLP models and broader multilingual applications. In this work, we release a large-scale Chinese cross-modal dataset named Wukong, which contains 100 million Chinese image-text pairs collected from the web. Wukong aims to benchmark different multi-modal pre-training methods to facilitate the VLP research and community development. Furthermore, we release a group of models pre-trained with various image encoders (ViT-B/ViT-L/SwinT) and also apply advanced pre-training techniques into VLP such as locked-image text tuning, token-wise similarity in contrastive learning, and reduced-token interaction. Extensive experiments and a benchmarking of different downstream tasks including a new largest human-verified image-text test dataset are also provided. Experiments show that Wukong can serve as a promising Chinese pre-training dataset and benchmark for different cross-modal learning methods. For the zero-shot image classification task on 10 datasets, $Wukong_{ViT-L}$ achieves an average accuracy of 73.03%. For the image-text retrieval task, it achieves a mean recall of 71.6% on AIC-ICC which is 12.9% higher than WenLan 2.0. Also, our Wukong models are benchmarked on downstream tasks with other variants on multiple datasets, e.g., Flickr8K-CN, Flickr-30K-CN, COCO-CN, et~al. More information can be referred to: https://wukong-dataset.github.io/wukong-dataset/.
    A Sparsity-promoting Dictionary Model for Variational Autoencoders. (arXiv:2203.15758v2 [cs.LG] UPDATED)
    Structuring the latent space in probabilistic deep generative models, e.g., variational autoencoders (VAEs), is important to yield more expressive models and interpretable representations, and to avoid overfitting. One way to achieve this objective is to impose a sparsity constraint on the latent variables, e.g., via a Laplace prior. However, such approaches usually complicate the training phase, and they sacrifice the reconstruction quality to promote sparsity. In this paper, we propose a simple yet effective methodology to structure the latent space via a sparsity-promoting dictionary model, which assumes that each latent code can be written as a sparse linear combination of a dictionary's columns. In particular, we leverage a computationally efficient and tuning-free method, which relies on a zero-mean Gaussian latent prior with learnable variances. We derive a variational inference scheme to train the model. Experiments on speech generative modeling demonstrate the advantage of the proposed approach over competing techniques, since it promotes sparsity while not deteriorating the output speech quality.
    Sketching Algorithms and Lower Bounds for Ridge Regression. (arXiv:2204.06653v2 [cs.DS] UPDATED)
    We give a sketching-based iterative algorithm that computes a $1+\varepsilon$ approximate solution for the ridge regression problem $\min_x \|Ax-b\|_2^2 +\lambda\|x\|_2^2$ where $A \in R^{n \times d}$ with $d \ge n$. Our algorithm, for a constant number of iterations (requiring a constant number of passes over the input), improves upon earlier work (Chowdhury et al.) by requiring that the sketching matrix only has a weaker Approximate Matrix Multiplication (AMM) guarantee that depends on $\varepsilon$, along with a constant subspace embedding guarantee. The earlier work instead requires that the sketching matrix has a subspace embedding guarantee that depends on $\varepsilon$. For example, to produce a $1+\varepsilon$ approximate solution in $1$ iteration, which requires $2$ passes over the input, our algorithm requires the OSNAP embedding to have $m= O(n\sigma^2/\lambda\varepsilon)$ rows with a sparsity parameter $s = O(\log(n))$, whereas the earlier algorithm of Chowdhury et al. with the same number of rows of OSNAP requires a sparsity $s = O(\sqrt{\sigma^2/\lambda\varepsilon} \cdot \log(n))$, where $\sigma = \opnorm{A}$ is the spectral norm of the matrix $A$. We also show that this algorithm can be used to give faster algorithms for kernel ridge regression. Finally, we show that the sketch size required for our algorithm is essentially optimal for a natural framework of algorithms for ridge regression by proving lower bounds on oblivious sketching matrices for AMM. The sketch size lower bounds for AMM may be of independent interest.
    EGRU: Event-based GRU for activity-sparse inference and learning. (arXiv:2206.06178v1 [cs.LG] CROSS LISTED)
    The scalability of recurrent neural networks (RNNs) is hindered by the sequential dependence of each time step's computation on the previous time step's output. Therefore, one way to speed up and scale RNNs is to reduce the computation required at each time step independent of model size and task. In this paper, we propose a model that reformulates Gated Recurrent Units (GRU) as an event-based activity-sparse model that we call the Event-based GRU (EGRU), where units compute updates only on receipt of input events (event-based) from other units. When combined with having only a small fraction of the units active at a time (activity-sparse), this model has the potential to be vastly more compute efficient than current RNNs. Notably, activity-sparsity in our model also translates into sparse parameter updates during gradient descent, extending this compute efficiency to the training phase. We show that the EGRU demonstrates competitive performance compared to state-of-the-art recurrent network models in real-world tasks, including language modeling while maintaining high activity sparsity naturally during inference and training. This sets the stage for the next generation of recurrent networks that are scalable and more suitable for novel neuromorphic hardware.
    How You Start Matters for Generalization. (arXiv:2206.08558v1 [cs.LG])
    Characterizing the remarkable generalization properties of over-parameterized neural networks remains an open problem. In this paper, we promote a shift of focus towards initialization rather than neural architecture or (stochastic) gradient descent to explain this implicit regularization. Through a Fourier lens, we derive a general result for the spectral bias of neural networks and show that the generalization of neural networks is heavily tied to their initialization. Further, we empirically solidify the developed theoretical insights using practical, deep networks. Finally, we make a case against the controversial flat-minima conjecture and show that Fourier analysis grants a more reliable framework for understanding the generalization of neural networks.  ( 2 min )
    Grounded Language-Image Pre-training. (arXiv:2112.03857v2 [cs.CV] UPDATED)
    This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head. Code is released at https://github.com/microsoft/GLIP.
    BED: A Real-Time Object Detection System for Edge Devices. (arXiv:2202.07503v2 [cs.CV] UPDATED)
    Deploying deep neural networks~(DNNs) on edge devices provides efficient and effective solutions for the real-world tasks. Edge devices have been used for collecting a large volume of data efficiently in different domains. DNNs have been an effective tool for data processing and analysis. However, designing DNNs on edge devices is challenging due to the limited computational resources and memory. To tackle this challenge, we demonstrate Object Detection System for Edge Devices~(BED) on the MAX78000 DNN accelerator. It integrates on-device DNN inference with a camera and an LCD display for image acquisition and detection exhibition, respectively. BED is a concise, effective and detailed solution, including model training, quantization, synthesis and deployment. Experiment results indicate that BED can produce accurate detection with a 300-KB tiny DNN model, which takes only 91.9 ms of inference time and 1.845 mJ of energy.
    Stochastic Perturbations of Tabular Features for Non-Deterministic Inference with Automunge. (arXiv:2202.09248v2 [cs.LG] UPDATED)
    Injecting gaussian noise into training features is well known to have regularization properties. This paper considers noise injections to numeric or categoric tabular features as passed to inference, which translates inference to a non-deterministic outcome and may have relevance to fairness considerations, adversarial example protection, or other use cases benefiting from non-determinism. We offer the Automunge library for tabular preprocessing as a resource for the practice, which includes options to integrate random sampling or entropy seeding with the support of quantum circuits, representing a new way to channel quantum algorithms into classical learning.
    Deep Networks on Toroids: Removing Symmetries Reveals the Structure of Flat Regions in the Landscape Geometry. (arXiv:2202.03038v2 [cs.LG] UPDATED)
    We systematize the approach to the investigation of deep neural network landscapes by basing it on the geometry of the space of implemented functions rather than the space of parameters. Grouping classifiers into equivalence classes, we develop a standardized parameterization in which all symmetries are removed, resulting in a toroidal topology. On this space, we explore the error landscape rather than the loss. This lets us derive a meaningful notion of the flatness of minimizers and of the geodesic paths connecting them. Using different optimization algorithms that sample minimizers with different flatness we study the mode connectivity and relative distances. Testing a variety of state-of-the-art architectures and benchmark datasets, we confirm the correlation between flatness and generalization performance; we further show that in function space flatter minima are closer to each other and that the barriers along the geodesics connecting them are small. We also find that minimizers found by variants of gradient descent can be connected by zero-error paths composed of two straight lines in parameter space, i.e. polygonal chains with a single bend. We observe similar qualitative results in neural networks with binary weights and activations, providing one of the first results concerning the connectivity in this setting. Our results hinge on symmetry removal, and are in remarkable agreement with the rich phenomenology described by some recent analytical studies performed on simple shallow models.
    MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages. (arXiv:2204.08582v2 [cs.CL] UPDATED)
    We present the MASSIVE dataset--Multilingual Amazon Slu resource package (SLURP) for Slot-filling, Intent classification, and Virtual assistant Evaluation. MASSIVE contains 1M realistic, parallel, labeled virtual assistant utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots. MASSIVE was created by tasking professional translators to localize the English-only SLURP dataset into 50 typologically diverse languages from 29 genera. We also present modeling results on XLM-R and mT5, including exact match accuracy, intent classification accuracy, and slot-filling F1 score. We have released our dataset, modeling code, and models publicly.
    Toward Compositional Generalization in Object-Oriented World Modeling. (arXiv:2204.13661v2 [cs.LG] UPDATED)
    Compositional generalization is a critical ability in learning and decision-making. We focus on the setting of reinforcement learning in object-oriented environments to study compositional generalization in world modeling. We (1) formalize the compositional generalization problem with an algebraic approach and (2) study how a world model can achieve that. We introduce a conceptual environment, Object Library, and two instances, and deploy a principled pipeline to measure the generalization ability. Motivated by the formulation, we analyze several methods with exact or no compositional generalization ability using our framework, and design a differentiable approach, Homomorphic Object-oriented World Model (HOWM), that achieves soft but more efficient compositional generalization.
    Local Attention Graph-based Transformer for Multi-target Genetic Alteration Prediction. (arXiv:2205.06672v2 [cs.CV] UPDATED)
    Classical multiple instance learning (MIL) methods are often based on the identical and independent distributed assumption between instances, hence neglecting the potentially rich contextual information beyond individual entities. On the other hand, Transformers with global self-attention modules have been proposed to model the interdependencies among all instances. However, in this paper we question: Is global relation modeling using self-attention necessary, or can we appropriately restrict self-attention calculations to local regimes in large-scale whole slide images (WSIs)? We propose a general-purpose local attention graph-based Transformer for MIL (LA-MIL), introducing an inductive bias by explicitly contextualizing instances in adaptive local regimes of arbitrary size. Additionally, an efficiently adapted loss function enables our approach to learn expressive WSI embeddings for the joint analysis of multiple biomarkers. We demonstrate that LA-MIL achieves state-of-the-art results in mutation prediction for gastrointestinal cancer, outperforming existing models on important biomarkers such as microsatellite instability for colorectal cancer. Our findings suggest that local self-attention sufficiently models dependencies on par with global modules. Our LA-MIL implementation is available at https://github.com/agentdr1/LA_MIL.
    Reinforcement Learning in Macroeconomic Policy Design: A New Frontier?. (arXiv:2206.08781v1 [cs.LG])
    Agent-based computational macroeconomics is a field with a rich academic history, yet one which has struggled to enter mainstream policy design toolboxes, plagued by the challenges associated with representing a complex and dynamic reality. The field of Reinforcement Learning (RL), too, has a rich history, and has recently been at the centre of several exponential developments. Modern RL implementations have been able to achieve unprecedented levels of sophistication, handling previously-unthinkable degrees of complexity. This review surveys the historical barriers of classical agent-based techniques in macroeconomic modelling, and contemplates whether recent developments in RL can overcome any of them.  ( 2 min )
    SaDe: Learning Models that Provably Satisfy Domain Constraints. (arXiv:2112.00552v3 [cs.LG] UPDATED)
    In many real world applications of machine learning, models have to meet certain domain-based requirements that can be expressed as constraints (e.g., safety-critical constraints in autonomous driving systems). Such constraints are often handled by including them in a regularization term, while learning a model. This approach, however, does not guarantee 100% satisfaction of the constraints: it only reduces violations of the constraints on the training set rather than ensuring that the predictions by the model will always adhere to them. In this paper, we present a framework for learning models that provably fulfil the constraints under all circumstances (i.e., also on unseen data). To achieve this, we cast learning as a maximum satisfiability problem, and solve it using a novel SaDe algorithm that combines constraint satisfaction with gradient descent. We compare our method against regularization based baselines on linear models and show that our method is capable of enforcing different types of domain constraints effectively on unseen data, without sacrificing predictive performance.
    Scalable Deep Reinforcement Learning Algorithms for Mean Field Games. (arXiv:2203.11973v2 [cs.LG] UPDATED)
    Mean Field Games (MFGs) have been introduced to efficiently approximate games with very large populations of strategic agents. Recently, the question of learning equilibria in MFGs has gained momentum, particularly using model-free reinforcement learning (RL) methods. One limiting factor to further scale up using RL is that existing algorithms to solve MFGs require the mixing of approximated quantities such as strategies or $q$-values. This is far from being trivial in the case of non-linear function approximation that enjoy good generalization properties, e.g. neural networks. We propose two methods to address this shortcoming. The first one learns a mixed strategy from distillation of historical data into a neural network and is applied to the Fictitious Play algorithm. The second one is an online mixing method based on regularization that does not require memorizing historical data or previous estimates. It is used to extend Online Mirror Descent. We demonstrate numerically that these methods efficiently enable the use of Deep RL algorithms to solve various MFGs. In addition, we show that these methods outperform SotA baselines from the literature.
    Diffusion-GAN: Training GANs with Diffusion. (arXiv:2206.02262v2 [cs.LG] UPDATED)
    For stable training of generative adversarial networks (GANs), injecting instance noise into the input of the discriminator is considered as a theoretically sound solution, which, however, has not yet delivered on its promise in practice. This paper introduces Diffusion-GAN that employs a Gaussian mixture distribution, defined over all the diffusion steps of a forward diffusion chain, to inject instance noise. A random sample from the mixture, which is diffused from an observed or generated data, is fed as the input to the discriminator. The generator is updated by backpropagating its gradient through the forward diffusion chain, whose length is adaptively adjusted to control the maximum noise-to-data ratio allowed at each training step. Theoretical analysis verifies the soundness of the proposed Diffusion-GAN, which provides model- and domain-agnostic differentiable augmentation. A rich set of experiments on diverse datasets show that Diffusion-GAN can provide stable and data-efficient GAN training, bringing consistent performance improvement over strong GAN baselines for synthesizing photo-realistic images.
    A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks. (arXiv:2206.08514v1 [cs.LG])
    Textual backdoor attacks are a kind of practical threat to NLP systems. By injecting a backdoor in the training phase, the adversary could control model predictions via predefined triggers. As various attack and defense models have been proposed, it is of great significance to perform rigorous evaluations. However, we highlight two issues in previous backdoor learning evaluations: (1) The differences between real-world scenarios (e.g. releasing poisoned datasets or models) are neglected, and we argue that each scenario has its own constraints and concerns, thus requires specific evaluation protocols; (2) The evaluation metrics only consider whether the attacks could flip the models' predictions on poisoned samples and retain performances on benign samples, but ignore that poisoned samples should also be stealthy and semantic-preserving. To address these issues, we categorize existing works into three practical scenarios in which attackers release datasets, pre-trained models, and fine-tuned models respectively, then discuss their unique evaluation methodologies. On metrics, to completely evaluate poisoned samples, we use grammar error increase and perplexity difference for stealthiness, along with text similarity for validity. After formalizing the frameworks, we develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning. With this toolkit, we perform extensive experiments to benchmark attack and defense models under the suggested paradigm. To facilitate the underexplored defenses against poisoned datasets, we further propose CUBE, a simple yet strong clustering-based defense baseline. We hope that our frameworks and benchmarks could serve as the cornerstones for future model development and evaluations.  ( 3 min )
    Adversarial Estimators. (arXiv:2204.10495v3 [econ.EM] UPDATED)
    We develop an asymptotic theory of adversarial estimators ('A-estimators'). They generalize maximum-likelihood-type estimators ('M-estimators') as their average objective is maximized by some parameters and minimized by others. This class subsumes the continuous-updating Generalized Method of Moments, Generative Adversarial Networks and more recent proposals in machine learning and econometrics. In these examples, researchers state which aspects of the problem may in principle be used for estimation, and an adversary learns how to emphasize them optimally. We derive the convergence rates of A-estimators under pointwise and partial identification, and the normality of functionals of their parameters. Unknown functions may be approximated via sieves such as deep neural networks, for which we provide simplified low-level conditions. As a corollary, we obtain the normality of neural-net M-estimators, overcoming technical issues previously identified by the literature. Our theory yields novel results about a variety of A-estimators, providing intuition and formal justification for their success in recent applications.
    PDE-READ: Human-readable Partial Differential Equation Discovery using Deep Learning. (arXiv:2111.00998v5 [cs.LG] UPDATED)
    PDE discovery shows promise for uncovering predictive models of complex physical systems but has difficulty when measurements are sparse and noisy. We introduce a new approach for PDE discovery that uses two Rational Neural Networks and a principled sparse regression algorithm to identify the hidden dynamics that govern a system's response. The first network learns the system response function, while the second learns a hidden PDE describing the system's evolution. We then use a parameter-free sparse regression algorithm to extract a human-readable form of the hidden PDE from the second network. We implement our approach in an open-source library called PDE-READ. Our approach successfully identifies the governing PDE in six benchmark examples. We demonstrate that our approach is robust to both sparsity and noise and it, therefore, holds promise for application to real-world observational data.
    Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs. (arXiv:2203.09251v2 [cs.LG] UPDATED)
    In probably approximately correct (PAC) reinforcement learning (RL), an agent is required to identify an $\epsilon$-optimal policy with probability $1-\delta$. While minimax optimal algorithms exist for this problem, its instance-dependent complexity remains elusive in episodic Markov decision processes (MDPs). In this paper, we propose the first (nearly) matching upper and lower bounds on the sample complexity of PAC RL in deterministic episodic MDPs with finite state and action spaces. In particular, our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap. While our instance-dependent lower bound is written as a linear program, our algorithms are very simple and do not require solving such an optimization problem during learning. Their design and analyses employ novel ideas, including graph-theoretical concepts such as minimum flows and maximum cuts, which we believe to shed new light on this problem.
    abess: A Fast Best Subset Selection Library in Python and R. (arXiv:2110.09697v2 [stat.ML] UPDATED)
    We introduce a new library named abess that implements a unified framework of best-subset selection for solving diverse machine learning problems, e.g., linear regression, classification, and principal component analysis. Particularly, the abess certifiably gets the optimal solution within polynomial times with high probability under the linear model. Our efficient implementation allows abess to attain the solution of best-subset selection problems as fast as or even 20x faster than existing competing variable (model) selection toolboxes. Furthermore, it supports common variants like best group subset selection and $\ell_2$ regularized best-subset selection. The core of the library is programmed in C++. For ease of use, a Python library is designed for conveniently integrating with scikit-learn, and it can be installed from the Python library Index. In addition, a user-friendly R library is available at the Comprehensive R Archive Network. The source code is available at: https://github.com/abess-team/abess.
    Personalized Federated Learning through Local Memorization. (arXiv:2111.09360v3 [cs.LG] UPDATED)
    Federated learning allows clients to collaboratively learn statistical models while keeping their data local. Federated learning was originally used to train a unique global model to be served to all clients, but this approach might be sub-optimal when clients' local data distributions are heterogeneous. In order to tackle this limitation, recent personalized federated learning methods train a separate model for each client while still leveraging the knowledge available at other clients. In this work, we exploit the ability of deep neural networks to extract high quality vectorial representations (embeddings) from non-tabular data, e.g., images and text, to propose a personalization mechanism based on local memorization. Personalization is obtained by interpolating a collectively trained global model with a local $k$-nearest neighbors (kNN) model based on the shared representation provided by the global model. We provide generalization bounds for the proposed approach in the case of binary classification, and we show on a suite of federated datasets that this approach achieves significantly higher accuracy and fairness than state-of-the-art methods.
    Meta-Learning Hypothesis Spaces for Sequential Decision-making. (arXiv:2202.00602v3 [stat.ML] UPDATED)
    Obtaining reliable, adaptive confidence sets for prediction functions (hypotheses) is a central challenge in sequential decision-making tasks, such as bandits and model-based reinforcement learning. These confidence sets typically rely on prior assumptions on the hypothesis space, e.g., the known kernel of a Reproducing Kernel Hilbert Space (RKHS). Hand-designing such kernels is error prone, and misspecification may lead to poor or unsafe performance. In this work, we propose to meta-learn a kernel from offline data (Meta-KeL). For the case where the unknown kernel is a combination of known base kernels, we develop an estimator based on structured sparsity. Under mild conditions, we guarantee that our estimated RKHS yields valid confidence sets that, with increasing amounts of offline data, become as tight as those given the true unknown kernel. We demonstrate our approach on the kernelized bandit problem (a.k.a.~Bayesian optimization), where we establish regret bounds competitive with those given the true kernel. We also empirically evaluate the effectiveness of our approach on a Bayesian optimization task.
    NeuralEF: Deconstructing Kernels by Deep Neural Networks. (arXiv:2205.00165v3 [cs.LG] UPDATED)
    Learning the principal eigenfunctions of an integral operator defined by a kernel and a data distribution is at the core of many machine learning problems. Traditional nonparametric solutions based on the Nystr{\"o}m formula suffer from scalability issues. Recent work has resorted to a parametric approach, i.e., training neural networks to approximate the eigenfunctions. However, the existing method relies on an expensive orthogonalization step and is difficult to implement. We show that these problems can be fixed by using a new series of objective functions that generalizes the EigenGame~\citep{gemp2020eigengame} to function space. We test our method on a variety of supervised and unsupervised learning problems and show it provides accurate approximations to the eigenfunctions of polynomial, radial basis, neural network Gaussian process, and neural tangent kernels. Finally, we demonstrate our method can scale up linearised Laplace approximation of deep neural networks to modern image classification datasets through approximating the Gauss-Newton matrix. Code is available at \url{https://github.com/thudzj/neuraleigenfunction}.
    Conditional GANs with Auxiliary Discriminative Classifier. (arXiv:2107.10060v5 [cs.LG] UPDATED)
    Conditional generative models aim to learn the underlying joint distribution of data and labels to achieve conditional data generation. Among them, the auxiliary classifier generative adversarial network (AC-GAN) has been widely used, but suffers from the problem of low intra-class diversity of the generated samples. The fundamental reason pointed out in this paper is that the classifier of AC-GAN is generator-agnostic, which therefore cannot provide informative guidance for the generator to approach the joint distribution, resulting in a minimization of the conditional entropy that decreases the intra-class diversity. Motivated by this understanding, we propose a novel conditional GAN with an auxiliary discriminative classifier (ADC-GAN) to resolve the above problem. Specifically, the proposed auxiliary discriminative classifier becomes generator-aware by recognizing the class-labels of the real data and the generated data discriminatively. Our theoretical analysis reveals that the generator can faithfully learn the joint distribution even without the original discriminator, making the proposed ADC-GAN robust to the value of the coefficient hyperparameter and the selection of the GAN loss, and stable during training. Extensive experimental results on synthetic and real-world datasets demonstrate the superiority of ADC-GAN in conditional generative modeling compared to state-of-the-art classifier-based and projection-based conditional GANs.
    Generative Coarse-Graining of Molecular Conformations. (arXiv:2201.12176v2 [cs.LG] UPDATED)
    Coarse-graining (CG) of molecular simulations simplifies the particle representation by grouping selected atoms into pseudo-beads and drastically accelerates simulation. However, such CG procedure induces information losses, which makes accurate backmapping, i.e., restoring fine-grained (FG) coordinates from CG coordinates, a long-standing challenge. Inspired by the recent progress in generative models and equivariant networks, we propose a novel model that rigorously embeds the vital probabilistic nature and geometric consistency requirements of the backmapping transformation. Our model encodes the FG uncertainties into an invariant latent space and decodes them back to FG geometries via equivariant convolutions. To standardize the evaluation of this domain, we provide three comprehensive benchmarks based on molecular dynamics trajectories. Experiments show that our approach always recovers more realistic structures and outperforms existing data-driven methods with a significant margin.
    ROCK: Causal Inference Principles for Reasoning about Commonsense Causality. (arXiv:2202.00436v2 [cs.CL] UPDATED)
    Commonsense causality reasoning (CCR) aims at identifying plausible causes and effects in natural language descriptions that are deemed reasonable by an average person. Although being of great academic and practical interest, this problem is still shadowed by the lack of a well-posed theoretical framework; existing work usually relies on deep language models wholeheartedly, and is potentially susceptible to confounding co-occurrences. Motivated by classical causal principles, we articulate the central question of CCR and draw parallels between human subjects in observational studies and natural languages to adopt CCR to the potential-outcomes framework, which is the first such attempt for commonsense tasks. We propose a novel framework, ROCK, to Reason O(A)bout Commonsense K(C)ausality, which utilizes temporal signals as incidental supervision, and balances confounding effects using temporal propensities that are analogous to propensity scores. The ROCK implementation is modular and zero-shot, and demonstrates good CCR capabilities.
    Structure-preserving GANs. (arXiv:2202.01129v2 [cs.LG] UPDATED)
    Generative adversarial networks (GANs), a class of distribution-learning methods based on a two-player game between a generator and a discriminator, can generally be formulated as a minmax problem based on the variational representation of a divergence between the unknown and the generated distributions. We introduce structure-preserving GANs as a data-efficient framework for learning distributions with additional structure such as group symmetry, by developing new variational representations for divergences. Our theory shows that we can reduce the discriminator space to its projection on the invariant discriminator space, using the conditional expectation with respect to the sigma-algebra associated to the underlying structure. In addition, we prove that the discriminator space reduction must be accompanied by a careful design of structured generators, as flawed designs may easily lead to a catastrophic "mode collapse" of the learned distribution. We contextualize our framework by building symmetry-preserving GANs for distributions with intrinsic group symmetry, and demonstrate that both players, namely the equivariant generator and invariant discriminator, play important but distinct roles in the learning process. Empirical experiments and ablation studies across a broad range of data sets, including real-world medical imaging, validate our theory, and show our proposed methods achieve significantly improved sample fidelity and diversity -- almost an order of magnitude measured in Fr\'echet Inception Distance -- especially in the small data regime.
    High-Speed Accurate Robot Control using Learned Forward Kinodynamics and Non-linear Least Squares Optimization. (arXiv:2206.08487v1 [cs.RO])
    Accurate control of robots in the real world requires a control system that is capable of taking into account the kinodynamic interactions of the robot with its environment. At high speeds, the dependence of the movement of the robot on these kinodynamic interactions becomes more pronounced, making high-speed, accurate robot control a challenging problem. Previous work has shown that learning the inverse kinodynamics (IKD) of the robot can be helpful for high-speed robot control. However a learned inverse kinodynamic model can only be applied to a limited class of control problems, and different control problems require the learning of a new IKD model. In this work we present a new formulation for accurate, high-speed robot control that makes use of a learned forward kinodynamic (FKD) model and non-linear least squares optimization. By nature of the formulation, this approach is extensible to a wide array of control problems without requiring the retraining of a new model. We demonstrate the ability of this approach to accurately control a scale one-tenth robot car at high speeds, and show improved results over baselines.  ( 2 min )
    Tight query complexity bounds for learning graph partitions. (arXiv:2112.07897v2 [cs.LG] UPDATED)
    Given a partition of a graph into connected components, the membership oracle asserts whether any two vertices of the graph lie in the same component or not. We prove that for $n\ge k\ge 2$, learning the components of an $n$-vertex hidden graph with $k$ components requires at least $(k-1)n-\binom k2$ membership queries. Our result improves on the best known information-theoretic bound of $\Omega(n\log k)$ queries, and exactly matches the query complexity of the algorithm introduced by [Reyzin and Srivastava, 2007] for this problem. Additionally, we introduce an oracle, with access to which one can learn the number of components of $G$ in asymptotically fewer queries than learning the full partition, thus answering another question posed by the same authors. Lastly, we introduce a more applicable version of this oracle, and prove asymptotically tight bounds of $\widetilde\Theta(m)$ queries for both learning and verifying an $m$-edge hidden graph $G$ using it.
    Deep learning, stochastic gradient descent and diffusion maps. (arXiv:2204.01365v3 [stat.ML] UPDATED)
    Stochastic gradient descent (SGD) is widely used in deep learning due to its computational efficiency, but a complete understanding of why SGD performs so well remains a major challenge. It has been observed empirically that most eigenvalues of the Hessian of the loss functions on the loss landscape of over-parametrized deep neural networks are close to zero, while only a small number of eigenvalues are large. Zero eigenvalues indicate zero diffusion along the corresponding directions. This indicates that the process of minima selection mainly happens in the relatively low-dimensional subspace corresponding to the top eigenvalues of the Hessian. Although the parameter space is very high-dimensional, these findings seems to indicate that the SGD dynamics may mainly live on a low-dimensional manifold. In this paper, we pursue a truly data driven approach to the problem of getting a potentially deeper understanding of the high-dimensional parameter surface, and in particular, of the landscape traced out by SGD by analyzing the data generated through SGD, or any other optimizer for that matter, in order to possibly discover (local) low-dimensional representations of the optimization landscape. As our vehicle for the exploration, we use diffusion maps introduced by R. Coifman and coauthors.
    Domain Adaptation for Time Series Forecasting via Attention Sharing. (arXiv:2102.06828v7 [cs.LG] UPDATED)
    Recently, deep neural networks have gained increasing popularity in the field of time series forecasting. A primary reason for their success is their ability to effectively capture complex temporal dynamics across multiple related time series. The advantages of these deep forecasters only start to emerge in the presence of a sufficient amount of data. This poses a challenge for typical forecasting problems in practice, where there is a limited number of time series or observations per time series, or both. To cope with this data scarcity issue, we propose a novel domain adaptation framework, Domain Adaptation Forecaster (DAF). DAF leverages statistical strengths from a relevant domain with abundant data samples (source) to improve the performance on the domain of interest with limited data (target). In particular, we use an attention-based shared module with a domain discriminator across domains and private modules for individual domains. We induce domain-invariant latent features (queries and keys) and retrain domain-specific features (values) simultaneously to enable joint training of forecasters on source and target domains. A main insight is that our design of aligning keys allows the target domain to leverage source time series even with different characteristics. Extensive experiments on various domains demonstrate that our proposed method outperforms state-of-the-art baselines on synthetic and real-world datasets, and ablation studies verify the effectiveness of our design choices.
    Learning to Hash Robustly, Guaranteed. (arXiv:2108.05433v4 [cs.DS] UPDATED)
    The indexing algorithms for the high-dimensional nearest neighbor search (NNS) with the best worst-case guarantees are based on the randomized Locality Sensitive Hashing (LSH), and its derivatives. In practice, many heuristic approaches exist to "learn" the best indexing method in order to speed-up NNS, crucially adapting to the structure of the given dataset. Oftentimes, these heuristics outperform the LSH-based algorithms on real datasets, but, almost always, come at the cost of losing the guarantees of either correctness or robust performance on adversarial queries, or apply to datasets with an assumed extra structure/model. In this paper, we design an NNS algorithm for the Hamming space that has worst-case guarantees essentially matching that of theoretical algorithms, while optimizing the hashing to the structure of the dataset (think instance-optimal algorithms) for performance on the minimum-performing query. We evaluate the algorithm's ability to optimize for a given dataset both theoretically and practically. On the theoretical side, we exhibit a natural setting (dataset model) where our algorithm is much better than the standard theoretical one. On the practical side, we run experiments that show that our algorithm has a 1.8x and 2.1x better recall on the worst-performing queries to the MNIST and ImageNet datasets.
    A Modern Self-Referential Weight Matrix That Learns to Modify Itself. (arXiv:2202.05780v2 [cs.LG] UPDATED)
    The weight matrix (WM) of a neural network (NN) is its program. The programs of many traditional NNs are learned through gradient descent in some error function, then remain fixed. The WM of a self-referential NN, however, can keep rapidly modifying all of itself during runtime. In principle, such NNs can meta-learn to learn, and meta-meta-learn to meta-learn to learn, and so on, in the sense of recursive self-improvement. While NN architectures potentially capable of implementing such behaviour have been proposed since the '90s, there have been few if any practical studies. Here we revisit such NNs, building upon recent successes of fast weight programmers and closely related linear Transformers. We propose a scalable self-referential WM (SRWM) that learns to use outer products and the delta update rule to modify itself. We evaluate our SRWM in supervised few-shot learning and in multi-task reinforcement learning with procedurally generated game environments. Our experiments demonstrate both practical applicability and competitive performance of the proposed SRWM. Our code is public.
    Variational Nested Dropout. (arXiv:2101.11353v2 [cs.LG] UPDATED)
    Nested dropout is a variant of dropout operation that is able to order network parameters or features based on the pre-defined importance during training. It has been explored for: I. Constructing nested nets: the nested nets are neural networks whose architectures can be adjusted instantly during testing time, e.g., based on computational constraints. The nested dropout implicitly ranks the network parameters, generating a set of sub-networks such that any smaller sub-network forms the basis of a larger one. II. Learning ordered representation: the nested dropout applied to the latent representation of a generative model (e.g., auto-encoder) ranks the features, enforcing explicit order of the dense representation over dimensions. However, the dropout rate is fixed as a hyper-parameter during the whole training process. For nested nets, when network parameters are removed, the performance decays in a human-specified trajectory rather than in a trajectory learned from data. For generative models, the importance of features is specified as a constant vector, restraining the flexibility of representation learning. To address the problem, we focus on the probabilistic counterpart of the nested dropout. We propose a variational nested dropout (VND) operation that draws samples of multi-dimensional ordered masks at a low cost, providing useful gradients to the parameters of nested dropout. Based on this approach, we design a Bayesian nested neural network that learns the order knowledge of the parameter distributions. We further exploit the VND under different generative models for learning ordered latent distributions. In experiments, we show that the proposed approach outperforms the nested network in terms of accuracy, calibration, and out-of-domain detection in classification tasks. It also outperforms the related generative models on data generation tasks.
    Fairness in Credit Scoring: Assessment, Implementation and Profit Implications. (arXiv:2103.01907v4 [stat.ML] UPDATED)
    The rise of algorithmic decision-making has spawned much research on fair machine learning (ML). Financial institutions use ML for building risk scorecards that support a range of credit-related decisions. Yet, the literature on fair ML in credit scoring is scarce. The paper makes three contributions. First, we revisit statistical fairness criteria and examine their adequacy for credit scoring. Second, we catalog algorithmic options for incorporating fairness goals in the ML model development pipeline. Last, we empirically compare different fairness processors in a profit-oriented credit scoring context using real-world data. The empirical results substantiate the evaluation of fairness measures, identify suitable options to implement fair credit scoring, and clarify the profit-fairness trade-off in lending decisions. We find that multiple fairness criteria can be approximately satisfied at once and recommend separation as a proper criterion for measuring the fairness of a scorecard. We also find fair in-processors to deliver a good balance between profit and fairness and show that algorithmic discrimination can be reduced to a reasonable level at a relatively low cost. The codes corresponding to the paper are available on GitHub.
    Label-Descriptive Patterns and Their Application to Characterizing Classification Errors. (arXiv:2110.09599v3 [cs.LG] UPDATED)
    State-of-the-art deep learning methods achieve human-like performance on many tasks, but make errors nevertheless. Characterizing these errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors, but also gives a way to act and improve the classifier. We propose to discover those feature-value combinations (i.e., patterns) that strongly correlate with correct resp. erroneous predictions to obtain a global and interpretable description for arbitrary classifiers. We show this is an instance of the more general label description problem, which we formulate in terms of the Minimum Description Length principle. To discover a good pattern set, we develop the efficient Premise algorithm. Through an extensive set of experiments we show it performs very well in practice on both synthetic and real-world data. Unlike existing solutions, it ably recovers ground truth patterns, even on highly imbalanced data over many features. Through two case studies on Visual Question Answering and Named Entity Recognition, we confirm that Premise gives clear and actionable insight into the systematic errors made by modern NLP classifiers.
    Anti-Money Laundering Alert Optimization Using Machine Learning with Graphs. (arXiv:2112.07508v3 [cs.LG] UPDATED)
    Money laundering is a global problem that concerns legitimizing proceeds from serious felonies (1.7-4 trillion euros annually) such as drug dealing, human trafficking, or corruption. The anti-money laundering systems deployed by financial institutions typically comprise rules aligned with regulatory frameworks. Human investigators review the alerts and report suspicious cases. Such systems suffer from high false-positive rates, undermining their effectiveness and resulting in high operational costs. We propose a machine learning triage model, which complements the rule-based system and learns to predict the risk of an alert accurately. Our model uses both entity-centric engineered features and attributes characterizing inter-entity relations in the form of graph-based features. We leverage time windows to construct the dynamic graph, optimizing for time and space efficiency. We validate our model on a real-world banking dataset and show how the triage model can reduce the number of false positives by 80% while detecting over 90% of true positives. In this way, our model can significantly improve anti-money laundering operations.
    Spectral CUSUM for Online Network Structure Change Detection. (arXiv:1910.09083v3 [math.ST] UPDATED)
    Detecting abrupt changes in the community structure of a network from noisy observations is a fundamental problem in statistics and machine learning. This paper presents an online change detection algorithm called Spectral-CUSUM to detect unknown network structure changes through a generalized likelihood ratio statistic. We characterize the average run length (ARL) and the expected detection delay (EDD) of the Spectral-CUSUM procedure and prove its asymptotic optimality. Finally, we demonstrate the good performance of the Spectral-CUSUM procedure and compare it with several baseline methods using simulations and real data examples on seismic event detection using sensor network data.
    NAFS: A Simple yet Tough-to-beat Baseline for Graph Representation Learning. (arXiv:2206.08583v1 [cs.LG])
    Recently, graph neural networks (GNNs) have shown prominent performance in graph representation learning by leveraging knowledge from both graph structure and node features. However, most of them have two major limitations. First, GNNs can learn higher-order structural information by stacking more layers but can not deal with large depth due to the over-smoothing issue. Second, it is not easy to apply these methods on large graphs due to the expensive computation cost and high memory usage. In this paper, we present node-adaptive feature smoothing (NAFS), a simple non-parametric method that constructs node representations without parameter learning. NAFS first extracts the features of each node with its neighbors of different hops by feature smoothing, and then adaptively combines the smoothed features. Besides, the constructed node representation can further be enhanced by the ensemble of smoothed features extracted via different smoothing strategies. We conduct experiments on four benchmark datasets on two different application scenarios: node clustering and link prediction. Remarkably, NAFS with feature ensemble outperforms the state-of-the-art GNNs on these tasks and mitigates the aforementioned two limitations of most learning-based GNN counterparts.  ( 2 min )
    SYMBA: Symbolic Computation of Squared Amplitudes in High Energy Physics with Machine ALearning. (arXiv:2206.08901v1 [hep-ph])
    The cross section is one of the most important physical quantities in high-energy physics and the most time consuming to compute. While machine learning has proven to be highly successful in numerical calculations in high-energy physics, analytical calculations using machine learning are still in their infancy. In this work, we use a sequence-to-sequence transformer model to compute a key element of the cross section calculation, namely, the squared amplitude of an interaction. We show that a transformer model is able to predict correctly 89.0% and 99.4% of squared amplitudes of QCD and QED processes, respectively. We discuss the performance of the current model, its limitations and possible future directions for this work.
    Distinguishing rule- and exemplar-based generalization in learning systems. (arXiv:2110.04328v2 [cs.LG] UPDATED)
    Machine learning systems often do not share the same inductive biases as humans and, as a result, extrapolate or generalize in ways that are inconsistent with our expectations. The trade-off between exemplar- and rule-based generalization has been studied extensively in cognitive psychology; in this work, we present a protocol inspired by these experimental approaches to probe the inductive biases that control this tradeoff in category-learning systems. We isolate two such inductive biases: feature-level bias (differences in which features are more readily learned) and exemplar or rule bias (differences in how these learned features are used for generalization). We find that standard neural network models are feature-biased and exemplar-based, and discuss the implications of these findings for machine learning research on systematic generalization, fairness, and data augmentation.
    CausalVAE: Structured Causal Disentanglement in Variational Autoencoder. (arXiv:2004.08697v6 [cs.LG] UPDATED)
    Learning disentanglement aims at finding a low dimensional representation which consists of multiple explanatory and generative factors of the observational data. The framework of variational autoencoder (VAE) is commonly used to disentangle independent factors from observations. However, in real scenarios, factors with semantics are not necessarily independent. Instead, there might be an underlying causal structure which renders these factors dependent. We thus propose a new VAE based framework named CausalVAE, which includes a Causal Layer to transform independent exogenous factors into causal endogenous ones that correspond to causally related concepts in data. We further analyze the model identifiabitily, showing that the proposed model learned from observations recovers the true one up to a certain degree. Experiments are conducted on various datasets, including synthetic and real word benchmark CelebA. Results show that the causal representations learned by CausalVAE are semantically interpretable, and their causal relationship as a Directed Acyclic Graph (DAG) is identified with good accuracy. Furthermore, we demonstrate that the proposed CausalVAE model is able to generate counterfactual data through "do-operation" to the causal factors.
    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. (arXiv:2101.03961v3 [cs.LG] UPDATED)
    In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the T5-XXL model.
    Adversarial Attack and Defense for Non-Parametric Two-Sample Tests. (arXiv:2202.03077v2 [cs.LG] UPDATED)
    Non-parametric two-sample tests (TSTs) that judge whether two sets of samples are drawn from the same distribution, have been widely used in the analysis of critical data. People tend to employ TSTs as trusted basic tools and rarely have any doubt about their reliability. This paper systematically uncovers the failure mode of non-parametric TSTs through adversarial attacks and then proposes corresponding defense strategies. First, we theoretically show that an adversary can upper-bound the distributional shift which guarantees the attack's invisibility. Furthermore, we theoretically find that the adversary can also degrade the lower bound of a TST's test power, which enables us to iteratively minimize the test criterion in order to search for adversarial pairs. To enable TST-agnostic attacks, we propose an ensemble attack (EA) framework that jointly minimizes the different types of test criteria. Second, to robustify TSTs, we propose a max-min optimization that iteratively generates adversarial pairs to train the deep kernels. Extensive experiments on both simulated and real-world datasets validate the adversarial vulnerabilities of non-parametric TSTs and the effectiveness of our proposed defense. Source code is available at https://github.com/GodXuxilie/Robust-TST.git.
    MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge. (arXiv:2206.08853v1 [cs.LG])
    Autonomous agents have made great strides in specialist domains like Atari games and Go. However, they typically learn tabula rasa in isolated environments with limited and manually conceived objectives, thus failing to generalize across a wide spectrum of tasks and capabilities. Inspired by how humans continually learn and adapt in the open world, we advocate a trinity of ingredients for building generalist agents: 1) an environment that supports a multitude of tasks and goals, 2) a large-scale database of multimodal knowledge, and 3) a flexible and scalable agent architecture. We introduce MineDojo, a new framework built on the popular Minecraft game that features a simulation suite with thousands of diverse open-ended tasks and an internet-scale knowledge base with Minecraft videos, tutorials, wiki pages, and forum discussions. Using MineDojo's data, we propose a novel agent learning algorithm that leverages large pre-trained video-language models as a learned reward function. Our agent is able to solve a variety of open-ended tasks specified in free-form language without any manually designed dense shaping reward. We open-source the simulation suite and knowledge bases (https://minedojo.org) to promote research towards the goal of generally capable embodied agents.
    Smoothing Policies and Safe Policy Gradients. (arXiv:1905.03231v2 [cs.LG] UPDATED)
    Policy Gradient (PG) algorithms are among the best candidates for the much-anticipated applications of reinforcement learning to real-world control tasks, such as robotics. However, the trial-and-error nature of these methods poses safety issues whenever the learning process itself must be performed on a physical system or involves any form of human-computer interaction. In this paper, we address a specific safety formulation, where both goals and dangers are encoded in a scalar reward signal and the learning agent is constrained to never worsen its performance, measured as the expected sum of rewards. By studying actor-only policy gradient from a stochastic optimization perspective, we establish improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies. This, together with novel upper bounds on the variance of policy gradient estimators, allows us to identify meta-parameter schedules that guarantee monotonic improvement with high probability. The two key meta-parameters are the step size of the parameter updates and the batch size of the gradient estimates. Through a joint, adaptive selection of these meta-parameters, we obtain a policy gradient algorithm with monotonic improvement guarantees.
    Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective. (arXiv:2110.06256v2 [cs.LG] UPDATED)
    This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in large-scale neural network training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network's weights do not converge to stationary points where the gradient of the loss is zero. Remarkably, however, we observe that even though the weights do not converge to stationary points, the progress in minimizing the loss function halts and training loss stabilizes. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems to explain it. Rather than studying the evolution of weights, we study the evolution of the distribution of weights. We prove convergence of the distribution of weights to an approximate invariant measure, thereby explaining how the training loss can stabilize without weights necessarily converging to stationary points. We further discuss how this perspective can better align optimization theory with empirical observations in machine learning practice.
    Boosting Factorization Machines via Saliency-Guided Mixup. (arXiv:2206.08661v1 [cs.IR])
    Factorization machines (FMs) are widely used in recommender systems due to their adaptability and ability to learn from sparse data. However, for the ubiquitous non-interactive features in sparse data, existing FMs can only estimate the parameters corresponding to these features via the inner product of their embeddings. Undeniably, they cannot learn the direct interactions of these features, which limits the model's expressive power. To this end, we first present MixFM, inspired by Mixup, to generate auxiliary training data to boost FMs. Unlike existing augmentation strategies that require labor costs and expertise to collect additional information such as position and fields, these extra data generated by MixFM only by the convex combination of the raw ones without any professional knowledge support. More importantly, if the parent samples to be mixed have non-interactive features, MixFM will establish their direct interactions. Second, considering that MixFM may generate redundant or even detrimental instances, we further put forward a novel Factorization Machine powered by Saliency-guided Mixup (denoted as SMFM). Guided by the customized saliency, SMFM can generate more informative neighbor data. Through theoretical analysis, we prove that the proposed methods minimize the upper bound of the generalization error, which hold a beneficial effect on enhancing FMs. Significantly, we give the first generalization bound of FM, implying the generalization requires more data and a smaller embedding size under the sufficient representation capability. Finally, extensive experiments on five datasets confirm that our approaches are superior to baselines. Besides, the results show that "poisoning" mixed data is likewise beneficial to the FM variants.  ( 3 min )
    MetaFed: Federated Learning among Federations with Cyclic Knowledge Distillation for Personalized Healthcare. (arXiv:2206.08516v1 [cs.LG])
    Federated learning has attracted increasing attention to building models without accessing the raw user data, especially in healthcare. In real applications, different federations can seldom work together due to possible reasons such as data heterogeneity and distrust/inexistence of the central server. In this paper, we propose a novel framework called MetaFed to facilitate trustworthy FL between different federations. MetaFed obtains a personalized model for each federation without a central server via the proposed Cyclic Knowledge Distillation. Specifically, MetaFed treats each federation as a meta distribution and aggregates knowledge of each federation in a cyclic manner. The training is split into two parts: common knowledge accumulation and personalization. Comprehensive experiments on three benchmarks demonstrate that MetaFed without a server achieves better accuracy compared to state-of-the-art methods (e.g., 10%+ accuracy improvement compared to the baseline for PAMAP2) with fewer communication costs.  ( 2 min )
    Boosting Graph Structure Learning with Dummy Nodes. (arXiv:2206.08561v1 [cs.LG])
    With the development of graph kernels and graph representation learning, many superior methods have been proposed to handle scalability and oversmoothing issues on graph structure learning. However, most of those strategies are designed based on practical experience rather than theoretical analysis. In this paper, we use a particular dummy node connecting to all existing vertices without affecting original vertex and edge properties. We further prove that such the dummy node can help build an efficient monomorphic edge-to-vertex transform and an epimorphic inverse to recover the original graph back. It also indicates that adding dummy nodes can preserve local and global structures for better graph representation learning. We extend graph kernels and graph neural networks with dummy nodes and conduct experiments on graph classification and subgraph isomorphism matching tasks. Empirical results demonstrate that taking graphs with dummy nodes as input significantly boosts graph structure learning, and using their edge-to-vertex graphs can also achieve similar results. We also discuss the gain of expressive power from the dummy in neural networks.  ( 2 min )
    Neural Ensemble Search via Bayesian Sampling. (arXiv:2109.02533v2 [cs.LG] UPDATED)
    Recently, neural architecture search (NAS) has been applied to automate the design of neural networks in real-world applications. A large number of algorithms have been developed to improve the search cost or the performance of the final selected architectures in NAS. Unfortunately, these NAS algorithms aim to select only one single well-performing architecture from their search spaces and thus have overlooked the capability of neural network ensemble (i.e., an ensemble of neural networks with diverse architectures) in achieving improved performance over a single final selected architecture. To this end, we introduce a novel neural ensemble search algorithm, called neural ensemble search via Bayesian sampling (NESBS), to effectively and efficiently select well-performing neural network ensembles from a NAS search space. In our extensive experiments, NESBS algorithm is shown to be able to achieve improved performance over state-of-the-art NAS algorithms while incurring a comparable search cost, thus indicating the superior performance of our NESBS algorithm over these NAS algorithms in practice.
    Popular decision tree algorithms are provably noise tolerant. (arXiv:2206.08899v1 [cs.LG])
    Using the framework of boosting, we prove that all impurity-based decision tree learning algorithms, including the classic ID3, C4.5, and CART, are highly noise tolerant. Our guarantees hold under the strongest noise model of nasty noise, and we provide near-matching upper and lower bounds on the allowable noise rate. We further show that these algorithms, which are simple and have long been central to everyday machine learning, enjoy provable guarantees in the noisy setting that are unmatched by existing algorithms in the theoretical literature on decision tree learning. Taken together, our results add to an ongoing line of research that seeks to place the empirical success of these practical decision tree algorithms on firm theoretical footing.
    Feature and Parameter Selection in Stochastic Linear Bandits. (arXiv:2106.05378v3 [cs.LG] UPDATED)
    We study two model selection settings in stochastic linear bandits (LB). In the first setting, which we refer to as feature selection, the expected reward of the LB problem is in the linear span of at least one of $M$ feature maps (models). In the second setting, the reward parameter of the LB problem is arbitrarily selected from $M$ models represented as (possibly) overlapping balls in $\mathbb R^d$. However, the agent only has access to misspecified models, i.e.,~estimates of the centers and radii of the balls. We refer to this setting as parameter selection. For each setting, we develop and analyze a computationally efficient algorithm that is based on a reduction from bandits to full-information problems. This allows us to obtain regret bounds that are not worse (up to a $\sqrt{\log M}$ factor) than the case where the true model is known. This is the best-reported dependence on the number of models $M$ in these settings. Finally, we empirically show the effectiveness of our algorithms using synthetic and real-world experiments.
    Learngene: From Open-World to Your Learning Task. (arXiv:2106.06788v3 [cs.LG] UPDATED)
    Although deep learning has made significant progress on fixed large-scale datasets, it typically encounters challenges regarding improperly detecting unknown/unseen classes in the open-world scenario, over-parametrized, and overfitting small samples. Since biological systems can overcome the above difficulties very well, individuals inherit an innate gene from collective creatures that have evolved over hundreds of millions of years and then learn new skills through few examples. Inspired by this, we propose a practical collective-individual paradigm where an evolution (expandable) network is trained on sequential tasks and then recognize unknown classes in real-world. Moreover, the learngene, i.e., the gene for learning initialization rules of the target model, is proposed to inherit the meta-knowledge from the collective model and reconstruct a lightweight individual model on the target task. Particularly, a novel criterion is proposed to discover learngene in the collective model, according to the gradient information. Finally, the individual model is trained only with few samples on the target learning tasks. We demonstrate the effectiveness of our approach in an extensive empirical study and theoretical analysis.
    Spherical Sliced-Wasserstein. (arXiv:2206.08780v1 [stat.ML])
    Many variants of the Wasserstein distance have been introduced to reduce its original computational burden. In particular the Sliced-Wasserstein distance (SW), which leverages one-dimensional projections for which a closed-form solution of the Wasserstein distance is available, has received a lot of interest. Yet, it is restricted to data living in Euclidean spaces, while the Wasserstein distance has been studied and used recently on manifolds. We focus more specifically on the sphere, for which we define a novel SW discrepancy, which we call spherical Sliced-Wasserstein, making a first step towards defining SW discrepancies on manifolds. Our construction is notably based on closed-form solutions of the Wasserstein distance on the circle, together with a new spherical Radon transform. Along with efficient algorithms and the corresponding implementations, we illustrate its properties in several machine learning use cases where spherical representations of data are at stake: density estimation on the sphere, variational inference or hyperspherical auto-encoders.
    Open-Sampling: Exploring Out-of-Distribution data for Re-balancing Long-tailed datasets. (arXiv:2206.08802v1 [cs.LG])
    Deep neural networks usually perform poorly when the training dataset suffers from extreme class imbalance. Recent studies found that directly training with out-of-distribution data (i.e., open-set samples) in a semi-supervised manner would harm the generalization performance. In this work, we theoretically show that out-of-distribution data can still be leveraged to augment the minority classes from a Bayesian perspective. Based on this motivation, we propose a novel method called Open-sampling, which utilizes open-set noisy labels to re-balance the class priors of the training dataset. For each open-set instance, the label is sampled from our pre-defined distribution that is complementary to the distribution of original class priors. We empirically show that Open-sampling not only re-balances the class priors but also encourages the neural network to learn separable representations. Extensive experiments demonstrate that our proposed method significantly outperforms existing data re-balancing methods and can boost the performance of existing state-of-the-art methods.
    Tensor-on-Tensor Regression: Riemannian Optimization, Over-parameterization, Statistical-computational Gap, and Their Interplay. (arXiv:2206.08756v1 [math.ST])
    We study the tensor-on-tensor regression, where the goal is to connect tensor responses to tensor covariates with a low Tucker rank parameter tensor/matrix without the prior knowledge of its intrinsic rank. We propose the Riemannian gradient descent (RGD) and Riemannian Gauss-Newton (RGN) methods and cope with the challenge of unknown rank by studying the effect of rank over-parameterization. We provide the first convergence guarantee for the general tensor-on-tensor regression by showing that RGD and RGN respectively converge linearly and quadratically to a statistically optimal estimate in both rank correctly-parameterized and over-parameterized settings. Our theory reveals an intriguing phenomenon: Riemannian optimization methods naturally adapt to over-parameterization without modifications to their implementation. We also give the first rigorous evidence for the statistical-computational gap in scalar-on-tensor regression under the low-degree polynomials framework. Our theory demonstrates a ``blessing of statistical-computational gap" phenomenon: in a wide range of scenarios in tensor-on-tensor regression for tensors of order three or higher, the computationally required sample size matches what is needed by moderate rank over-parameterization when considering computationally feasible estimators, while there are no such benefits in the matrix settings. This shows moderate rank over-parameterization is essentially ``cost-free" in terms of sample size in tensor-on-tensor regression of order three or higher. Finally, we conduct simulation studies to show the advantages of our proposed methods and to corroborate our theoretical findings.
    You Are the Best Reviewer of Your Own Papers: An Owner-Assisted Scoring Mechanism. (arXiv:2110.14802v2 [cs.LG] UPDATED)
    I consider a setting where reviewers offer very noisy scores for several items for the selection of high-quality ones (e.g., peer review of large conference proceedings), whereas the owner of these items knows the true underlying scores but prefers not to provide this information. To address this withholding of information, in this paper, I introduce the Isotonic Mechanism, a simple and efficient approach to improving imprecise raw scores by leveraging certain information that the owner is incentivized to provide. This mechanism takes the ranking of the items from best to worst provided by the owner as input, in addition to the raw scores provided by the reviewers. It reports the adjusted scores for the items by solving a convex optimization problem. Under certain conditions, I show that the owner's optimal strategy is to honestly report the true ranking of the items to her best knowledge in order to maximize the expected utility. Moreover, I prove that the adjusted scores provided by this owner-assisted mechanism are significantly more accurate than the raw scores provided by the reviewers. This paper concludes with several extensions of the Isotonic Mechanism and some refinements of the mechanism for practical consideration.
    DISCO: Comprehensive and Explainable Disinformation Detection. (arXiv:2203.04928v2 [cs.LG] UPDATED)
    Disinformation refers to false information deliberately spread to influence the general public, and the negative impact of disinformation on society can be observed in numerous issues, such as political agendas and manipulating financial markets. In this paper, we identify prevalent challenges and advances related to automated disinformation detection from multiple aspects and propose a comprehensive and explainable disinformation detection framework called DISCO. It leverages the heterogeneity of disinformation and addresses the opaqueness of prediction. Then we provide a demonstration of DISCO on a real-world fake news detection task with satisfactory detection accuracy and explanation. The demo video and source code of DISCO is now publicly available. We expect that our demo could pave the way for addressing the limitations of identification, comprehension, and explainability as a whole.
    Explainability's Gain is Optimality's Loss? -- How Explanations Bias Decision-making. (arXiv:2206.08705v1 [cs.HC])
    Decisions in organizations are about evaluating alternatives and choosing the one that would best serve organizational goals. To the extent that the evaluation of alternatives could be formulated as a predictive task with appropriate metrics, machine learning algorithms are increasingly being used to improve the efficiency of the process. Explanations help to facilitate communication between the algorithm and the human decision-maker, making it easier for the latter to interpret and make decisions on the basis of predictions by the former. Feature-based explanations' semantics of causal models, however, induce leakage from the decision-maker's prior beliefs. Our findings from a field experiment demonstrate empirically how this leads to confirmation bias and disparate impact on the decision-maker's confidence in the predictions. Such differences can lead to sub-optimal and biased decision outcomes.
    Mirror Descent with Relative Smoothness in Measure Spaces, with application to Sinkhorn and EM. (arXiv:2206.08873v1 [math.OC])
    Many problems in machine learning can be formulated as optimizing a convex functional over a space of measures. This paper studies the convergence of the mirror descent algorithm in this infinite-dimensional setting. Defining Bregman divergences through directional derivatives, we derive the convergence of the scheme for relatively smooth and strongly convex pairs of functionals. Applying our result to joint distributions and the Kullback--Leibler (KL) divergence, we show that Sinkhorn's primal iterations for entropic optimal transport in the continuous setting correspond to a mirror descent, and we obtain a new proof of its (sub)linear convergence. We also show that Expectation Maximization (EM) can always formally be written as a mirror descent, and, when optimizing on the latent distribution while fixing the mixtures, we derive sublinear rates of convergence.
    Lossy Compression with Gaussian Diffusion. (arXiv:2206.08889v1 [stat.ML])
    We describe a novel lossy compression approach called DiffC which is based on unconditional diffusion generative models. Unlike modern compression schemes which rely on transform coding and quantization to restrict the transmitted information, DiffC relies on the efficient communication of pixels corrupted by Gaussian noise. We implement a proof of concept and find that it works surprisingly well despite the lack of an encoder transform, outperforming the state-of-the-art generative compression method HiFiC on ImageNet 64x64. DiffC only uses a single model to encode and denoise corrupted pixels at arbitrary bitrates. The approach further provides support for progressive coding, that is, decoding from partial bit streams. We perform a rate-distortion analysis to gain a deeper understanding of its performance, providing analytical results for multivariate Gaussian data as well as initial results for general distributions. Furthermore, we show that a flow-based reconstruction achieves a 3 dB gain over ancestral sampling at high bitrates.
    GrASP: A Library for Extracting and Exploring Human-Interpretable Textual Patterns. (arXiv:2104.03958v2 [cs.CL] UPDATED)
    Data exploration is an important step of every data science and machine learning project, including those involving textual data. We provide a novel language tool, in the form of a publicly available Python library for extracting patterns from textual data. The library integrates a first public implementation of the existing GrASP algorithm. It allows users to extract patterns using a number of general-purpose built-in linguistic attributes (such as hypernyms, part-of-speech tags, and syntactic dependency tags), as envisaged for the original algorithm, as well as domain-specific custom attributes which can be incorporated into the library by implementing two functions. The library is equipped with a web-based interface empowering human users to conveniently explore data via the extracted patterns, using complementary pattern-centric and example-centric views: the former includes a reading in natural language and statistics of each extracted pattern; the latter shows applications of each extracted pattern to training examples. We demonstrate the usefulness of the library in classification (spam detection and argument mining), model analysis (machine translation), and artifact discovery in datasets (SNLI and 20Newsgroups).
    Decentralized adaptive clustering of deep nets is beneficial for client collaboration. (arXiv:2206.08839v1 [cs.LG])
    We study the problem of training personalized deep learning models in a decentralized peer-to-peer setting, focusing on the setting where data distributions differ between the clients and where different clients have different local learning tasks. We study both covariate and label shift, and our contribution is an algorithm which for each client finds beneficial collaborations based on a similarity estimate for the local task. Our method does not rely on hyperparameters which are hard to estimate, such as the number of client clusters, but rather continuously adapts to the network topology using soft cluster assignment based on a novel adaptive gossip algorithm. We test the proposed method in various settings where data is not independent and identically distributed among the clients. The experimental evaluation shows that the proposed method performs better than previous state-of-the-art algorithms for this problem setting, and handles situations well where previous methods fail.
    Toward Learning Human-aligned Cross-domain Robust Models by Countering Misaligned Features. (arXiv:2111.03740v2 [cs.LG] UPDATED)
    Machine learning has demonstrated remarkable prediction accuracy over i.i.d data, but the accuracy often drops when tested with data from another distribution. In this paper, we aim to offer another view of this problem in a perspective assuming the reason behind this accuracy drop is the reliance of models on the features that are not aligned well with how a data annotator considers similar across these two datasets. We refer to these features as misaligned features. We extend the conventional generalization error bound to a new one for this setup with the knowledge of how the misaligned features are associated with the label. Our analysis offers a set of techniques for this problem, and these techniques are naturally linked to many previous methods in robust machine learning literature. We also compared the empirical strength of these methods demonstrated the performance when these previous techniques are combined, with an implementation available at https://github.com/OoDBag/WR
    Incorporating intratumoral heterogeneity into weakly-supervised deep learning models via variance pooling. (arXiv:2206.08885v1 [eess.IV])
    Supervised learning tasks such as cancer survival prediction from gigapixel whole slide images (WSIs) are a critical challenge in computational pathology that requires modeling complex features of the tumor microenvironment. These learning tasks are often solved with deep multi-instance learning (MIL) models that do not explicitly capture intratumoral heterogeneity. We develop a novel variance pooling architecture that enables a MIL model to incorporate intratumoral heterogeneity into its predictions. Two interpretability tools based on representative patches are illustrated to probe the biological signals captured by these models. An empirical study with 4,479 gigapixel WSIs from the Cancer Genome Atlas shows that adding variance pooling onto MIL frameworks improves survival prediction performance for five cancer types.
    Maximum Class Separation as Inductive Bias in One Matrix. (arXiv:2206.08704v1 [cs.LG])
    Maximizing the separation between classes constitutes a well-known inductive bias in machine learning and a pillar of many traditional algorithms. By default, deep networks are not equipped with this inductive bias and therefore many alternative solutions have been proposed through differential optimization. Current approaches tend to optimize classification and separation jointly: aligning inputs with class vectors and separating class vectors angularly. This paper proposes a simple alternative: encoding maximum separation as an inductive bias in the network by adding one fixed matrix multiplication before computing the softmax activations. The main observation behind our approach is that separation does not require optimization but can be solved in closed-form prior to training and plugged into a network. We outline a recursive approach to obtain the matrix consisting of maximally separable vectors for any number of classes, which can be added with negligible engineering effort and computational overhead. Despite its simple nature, this one matrix multiplication provides real impact. We show that our proposal directly boosts classification, long-tailed recognition, out-of-distribution detection, and open-set recognition, from CIFAR to ImageNet. We find empirically that maximum separation works best as a fixed bias; making the matrix learnable adds nothing to the performance. The closed-form implementation and code to reproduce the experiments are on github.
    Beyond Ridge Regression for Distribution-Free Data. (arXiv:2206.08757v1 [cs.LG])
    In supervised batch learning, the predictive normalized maximum likelihood (pNML) has been proposed as the min-max regret solution for the distribution-free setting, where no distributional assumptions are made on the data. However, the pNML is not defined for a large capacity hypothesis class as over-parameterized linear regression. For a large class, a common approach is to use regularization or a model prior. In the context of online prediction where the min-max solution is the Normalized Maximum Likelihood (NML), it has been suggested to use NML with ``luckiness'': A prior-like function is applied to the hypothesis class, which reduces its effective size. Motivated by the luckiness concept, for linear regression we incorporate a luckiness function that penalizes the hypothesis proportionally to its l2 norm. This leads to the ridge regression solution. The associated pNML with luckiness (LpNML) prediction deviates from the ridge regression empirical risk minimizer (Ridge ERM): When the test data reside in the subspace corresponding to the small eigenvalues of the empirical correlation matrix of the training data, the prediction is shifted toward 0. Our LpNML reduces the Ridge ERM error by up to 20% for the PMLB sets, and is up to 4.9% more robust in the presence of distribution shift compared to recent leading methods for UCI sets.
    AutoML Two-Sample Test. (arXiv:2206.08843v1 [cs.LG])
    Two-sample tests are important in statistics and machine learning, both as tools for scientific discovery as well as to detect distribution shifts. This led to the development of many sophisticated test procedures going beyond the standard supervised learning frameworks, whose usage can require specialized knowledge about two-sample testing. We use a simple test that takes the mean discrepancy of a witness function as the test statistic and prove that minimizing a squared loss leads to a witness with optimal testing power. This allows us to leverage recent advancements in AutoML. Without any user input about the problems at hand, and using the same method for all our experiments, our AutoML two-sample test achieves competitive performance on a diverse distribution shift benchmark as well as on challenging two-sample testing problems. We provide an implementation of the AutoML two-sample test in the Python package autotst.
    BITS Pilani at HinglishEval: Quality Evaluation for Code-Mixed Hinglish Text Using Transformers. (arXiv:2206.08680v1 [cs.CL])
    Code-Mixed text data consists of sentences having words or phrases from more than one language. Most multi-lingual communities worldwide communicate using multiple languages, with English usually one of them. Hinglish is a Code-Mixed text composed of Hindi and English but written in Roman script. This paper aims to determine the factors influencing the quality of Code-Mixed text data generated by the system. For the HinglishEval task, the proposed model uses multi-lingual BERT to find the similarity between synthetically generated and human-generated sentences to predict the quality of synthetically generated Hinglish sentences.
    Nudge: Accelerating Overdue Pull Requests Towards Completion. (arXiv:2011.12468v5 [cs.SE] UPDATED)
    Pull requests are a key part of the collaborative software development and code review process today. However, pull requests can also slow down the software development process when the reviewer(s) or the author do not actively engage with the pull request. In this work, we design an end-to-end service, Nudge, for accelerating overdue pull requests towards completion by reminding the author or the reviewer(s) to engage with their overdue pull requests. First, we use models based on effort estimation and machine learning to predict the completion time for a given pull request. Second, we use activity detection to filter out pull requests that may be overdue, but for which sufficient action is taking place nonetheless. Lastly, we use actor identification to understand who the blocker of the pull request is and nudge the appropriate actor (author or reviewer(s)). The key novelty of Nudge is that it succeeds in reducing pull request resolution time, while ensuring that developers perceive the notifications sent as useful, at the scale of thousands of repositories. In a randomized trial on 147 repositories in use at Microsoft, Nudge was able to reduce pull request resolution time by 60% for 8,500 pull requests, when compared to overdue pull requests for which Nudge did not send a notification. Furthermore, developers receiving Nudge notifications resolved 73% of these notifications as positive. We observed similar results when scaling up the deployment of Nudge to 8,000 repositories at Microsoft, for which Nudge sent 210,000 notifications during a full year. This demonstrates Nudge's ability to scale to thousands of repositories. Lastly, our qualitative analysis of a selection of Nudge notifications indicates areas for future research, such as taking dependencies among pull requests and developer availability into account.
    The Role of Depth, Width, and Activation Complexity in the Number of Linear Regions of Neural Networks. (arXiv:2206.08615v1 [cs.LG])
    Many feedforward neural networks generate continuous and piecewise-linear (CPWL) mappings. Specifically, they partition the input domain into regions on which the mapping is an affine function. The number of these so-called linear regions offers a natural metric to characterize the expressiveness of CPWL mappings. Although the precise determination of this quantity is often out of reach, bounds have been proposed for specific architectures, including the well-known ReLU and Maxout networks. In this work, we propose a more general perspective and provide precise bounds on the maximal number of linear regions of CPWL networks based on three sources of expressiveness: depth, width, and activation complexity. Our estimates rely on the combinatorial structure of convex partitions and highlight the distinctive role of depth which, on its own, is able to exponentially increase the number of regions. We then introduce a complementary stochastic framework to estimate the average number of linear regions produced by a CPWL network architecture. Under reasonable assumptions, the expected density of linear regions along any 1D path is bounded by the product of depth, width, and a measure of activation complexity (up to a scaling factor). This yields an identical role to the three sources of expressiveness: no exponential growth with depth is observed anymore.
    Understanding Decision-Time vs. Background Planning in Model-Based Reinforcement Learning. (arXiv:2206.08442v1 [cs.LG])
    In model-based reinforcement learning, an agent can leverage a learned model to improve its way of behaving in different ways. Two prevalent approaches are decision-time planning and background planning. In this study, we are interested in understanding under what conditions and in which settings one of these two planning styles will perform better than the other in domains that require fast responses. After viewing them through the lens of dynamic programming, we first consider the classical instantiations of these planning styles and provide theoretical results and hypotheses on which one will perform better in the pure planning, planning & learning, and transfer learning settings. We then consider the modern instantiations of these planning styles and provide hypotheses on which one will perform better in the last two of the considered settings. Lastly, we perform several illustrative experiments to empirically validate both our theoretical results and hypotheses. Overall, our findings suggest that even though decision-time planning does not perform as well as background planning in their classical instantiations, in their modern instantiations, it can perform on par or better than background planning in both the planning & learning and transfer learning settings.
    A Deep Learning Approach for the Segmentation of Electroencephalography Data in Eye Tracking Applications. (arXiv:2206.08672v1 [cs.LG])
    The collection of eye gaze information provides a window into many critical aspects of human cognition, health and behaviour. Additionally, many neuroscientific studies complement the behavioural information gained from eye tracking with the high temporal resolution and neurophysiological markers provided by electroencephalography (EEG). One of the essential eye-tracking software processing steps is the segmentation of the continuous data stream into events relevant to eye-tracking applications, such as saccades, fixations, and blinks. Here, we introduce DETRtime, a novel framework for time-series segmentation that creates ocular event detectors that do not require additionally recorded eye-tracking modality and rely solely on EEG data. Our end-to-end deep learning-based framework brings recent advances in Computer Vision to the forefront of the times series segmentation of EEG data. DETRtime achieves state-of-the-art performance in ocular event detection across diverse eye-tracking experiment paradigms. In addition to that, we provide evidence that our model generalizes well in the task of EEG sleep stage segmentation.
    ReViSe: Remote Vital Signs Measurement Using Smartphone Camera. (arXiv:2206.08748v1 [cs.CV])
    Remote Photoplethysmography (rPPG) is a fast, effective, inexpensive and convenient method for collecting biometric data as it enables vital signs estimation using face videos. Remote contactless medical service provisioning has proven to be a dire necessity during the COVID-19 pandemic. We propose an end-to-end framework to measure people's vital signs including Heart Rate (HR), Heart Rate Variability (HRV), Oxygen Saturation (SpO2) and Blood Pressure (BP) based on the rPPG methodology from the video of a user's face captured with a smartphone camera. We extract face landmarks with a deep learning-based neural network model in real-time. Multiple face patches also called Region-of-Interests (RoIs) are extracted by using the predicted face landmarks. Several filters are applied to reduce the noise from the RoIs in the extracted cardiac signals called Blood Volume Pulse (BVP) signal. We trained and validated machine learning models using two public rPPG datasets namely the TokyoTech rPPG and the Pulse Rate Detection (PURE) datasets, on which our models achieved the following Mean Absolute Errors (MAE): a) for HR, 1.73 and 3.95 Beats-Per-Minute (bpm) respectively, b) for HRV, 18.55 and 25.03 ms respectively, and c) for SpO2, a MAE of 1.64 on the PURE dataset. We validated our end-to-end rPPG framework, ReViSe, in real life environment, and thereby created the Video-HR dataset. Our HR estimation model achieved a MAE of 2.49 bpm on this dataset. Since no publicly available rPPG datasets existed for BP measurement with face videos, we used a dataset with signals from fingertip sensor to train our model and also created our own video dataset, Video-BP. On our Video-BP dataset, our BP estimation model achieved a MAE of 6.7 mmHg for Systolic Blood Pressure (SBP), and a MAE of 9.6 mmHg for Diastolic Blood Pressure (DBP).
    Active Sampling for Min-Max Fairness. (arXiv:2006.06879v3 [stat.ML] UPDATED)
    We propose simple active sampling and reweighting strategies for optimizing min-max fairness that can be applied to any classification or regression model learned via loss minimization. The key intuition behind our approach is to use at each timestep a datapoint from the group that is worst off under the current model for updating the model. The ease of implementation and the generality of our robust formulation make it an attractive option for improving model performance on disadvantaged groups. For convex learning problems, such as linear or logistic regression, we provide a fine-grained analysis, proving the rate of convergence to a min-max fair solution.
    Thompson Sampling Achieves $\tilde O(\sqrt{T})$ Regret in Linear Quadratic Control. (arXiv:2206.08520v1 [cs.LG])
    Thompson Sampling (TS) is an efficient method for decision-making under uncertainty, where an action is sampled from a carefully prescribed distribution which is updated based on the observed data. In this work, we study the problem of adaptive control of stabilizable linear-quadratic regulators (LQRs) using TS, where the system dynamics are unknown. Previous works have established that $\tilde O(\sqrt{T})$ frequentist regret is optimal for the adaptive control of LQRs. However, the existing methods either work only in restrictive settings, require a priori known stabilizing controllers, or utilize computationally intractable approaches. We propose an efficient TS algorithm for the adaptive control of LQRs, TS-based Adaptive Control, TSAC, that attains $\tilde O(\sqrt{T})$ regret, even for multidimensional systems, thereby solving the open problem posed in Abeille and Lazaric (2018). TSAC does not require a priori known stabilizing controller and achieves fast stabilization of the underlying system by effectively exploring the environment in the early stages. Our result hinges on developing a novel lower bound on the probability that the TS provides an optimistic sample. By carefully prescribing an early exploration strategy and a policy update rule, we show that TS achieves order-optimal regret in adaptive control of multidimensional stabilizable LQRs. We empirically demonstrate the performance and the efficiency of TSAC in several adaptive control tasks.
    Understanding Robust Overfitting of Adversarial Training and Beyond. (arXiv:2206.08675v1 [cs.LG])
    Robust overfitting widely exists in adversarial training of deep networks. The exact underlying reasons for this are still not completely understood. Here, we explore the causes of robust overfitting by comparing the data distribution of \emph{non-overfit} (weak adversary) and \emph{overfitted} (strong adversary) adversarial training, and observe that the distribution of the adversarial data generated by weak adversary mainly contain small-loss data. However, the adversarial data generated by strong adversary is more diversely distributed on the large-loss data and the small-loss data. Given these observations, we further designed data ablation adversarial training and identify that some small-loss data which are not worthy of the adversary strength cause robust overfitting in the strong adversary mode. To relieve this issue, we propose \emph{minimum loss constrained adversarial training} (MLCAT): in a minibatch, we learn large-loss data as usual, and adopt additional measures to increase the loss of the small-loss data. Technically, MLCAT hinders data fitting when they become easy to learn to prevent robust overfitting; philosophically, MLCAT reflects the spirit of turning waste into treasure and making the best use of each adversarial data; algorithmically, we designed two realizations of MLCAT, and extensive experiments demonstrate that MLCAT can eliminate robust overfitting and further boost adversarial robustness.
    Statistical and Neural Methods for Cross-lingual Entity Label Mapping in Knowledge Graphs. (arXiv:2206.08709v1 [cs.CL])
    Knowledge bases such as Wikidata amass vast amounts of named entity information, such as multilingual labels, which can be extremely useful for various multilingual and cross-lingual applications. However, such labels are not guaranteed to match across languages from an information consistency standpoint, greatly compromising their usefulness for fields such as machine translation. In this work, we investigate the application of word and sentence alignment techniques coupled with a matching algorithm to align cross-lingual entity labels extracted from Wikidata in 10 languages. Our results indicate that mapping between Wikidata's main labels stands to be considerably improved (up to $20$ points in F1-score) by any of the employed methods. We show how methods relying on sentence embeddings outperform all others, even across different scripts. We believe the application of such techniques to measure the similarity of label pairs, coupled with a knowledge base rich in high-quality entity labels, to be an excellent asset to machine translation.
    Improving Generalization of Metric Learning via Listwise Self-distillation. (arXiv:2206.08880v1 [cs.CV])
    Most deep metric learning (DML) methods employ a strategy that forces all positive samples to be close in the embedding space while keeping them away from negative ones. However, such a strategy ignores the internal relationships of positive (negative) samples and often leads to overfitting, especially in the presence of hard samples and mislabeled samples. In this work, we propose a simple yet effective regularization, namely Listwise Self-Distillation (LSD), which progressively distills a model's own knowledge to adaptively assign a more appropriate distance target to each sample pair in a batch. LSD encourages smoother embeddings and information mining within positive (negative) samples as a way to mitigate overfitting and thus improve generalization. Our LSD can be directly integrated into general DML frameworks. Extensive experiments show that LSD consistently boosts the performance of various metric learning methods on multiple datasets.
    Communication-Efficient Adaptive Federated Learning. (arXiv:2205.02719v2 [cs.LG] UPDATED)
    Federated learning is a machine learning training paradigm that enables clients to jointly train models without sharing their own localized data. However, the implementation of federated learning in practice still faces numerous challenges, such as the large communication overhead due to the repetitive server-client synchronization and the lack of adaptivity by SGD-based model updates. Despite that various methods have been proposed for reducing the communication cost by gradient compression or quantization, and the federated versions of adaptive optimizers such as FedAdam are proposed to add more adaptivity, the current federated learning framework still cannot solve the aforementioned challenges all at once. In this paper, we propose a novel communication-efficient adaptive federated learning method (FedCAMS) with theoretical convergence guarantees. We show that in the nonconvex stochastic optimization setting, our proposed FedCAMS achieves the same convergence rate of $O(\frac{1}{\sqrt{TKm}})$ as its non-compressed counterparts. Extensive experiments on various benchmarks verify our theoretical analysis.  ( 2 min )
    XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence. (arXiv:2206.08474v1 [cs.SE])
    Recent advances in machine learning have significantly improved the understanding of source code data and achieved good performance on a number of downstream tasks. Open source repositories like GitHub enable this process with rich unlabeled code data. However, the lack of high quality labeled data has largely hindered the progress of several code related tasks, such as program translation, summarization, synthesis, and code search. This paper introduces XLCoST, Cross-Lingual Code SnippeT dataset, a new benchmark dataset for cross-lingual code intelligence. Our dataset contains fine-grained parallel data from 8 languages (7 commonly used programming languages and English), and supports 10 cross-lingual code tasks. To the best of our knowledge, it is the largest parallel dataset for source code both in terms of size and the number of languages. We also provide the performance of several state-of-the-art baseline models for each task. We believe this new dataset can be a valuable asset for the research community and facilitate the development and validation of new methods for cross-lingual code intelligence.  ( 2 min )
    Local overlap reduction procedure for dynamic ensemble selection. (arXiv:2206.08455v1 [cs.LG])
    Class imbalance is a characteristic known for making learning more challenging for classification models as they may end up biased towards the majority class. A promising approach among the ensemble-based methods in the context of imbalance learning is Dynamic Selection (DS). DS techniques single out a subset of the classifiers in the ensemble to label each given unknown sample according to their estimated competence in the area surrounding the query. Because only a small region is taken into account in the selection scheme, the global class disproportion may have less impact over the system's performance. However, the presence of local class overlap may severely hinder the DS techniques' performance over imbalanced distributions as it not only exacerbates the effects of the under-representation but also introduces ambiguous and possibly unreliable samples to the competence estimation process. Thus, in this work, we propose a DS technique which attempts to minimize the effects of the local class overlap during the classifier selection procedure. The proposed method iteratively removes from the target region the instance perceived as the hardest to classify until a classifier is deemed competent to label the query sample. The known samples are characterized using instance hardness measures that quantify the local class overlap. Experimental results show that the proposed technique can significantly outperform the baseline as well as several other DS techniques, suggesting its suitability for dealing with class under-representation and overlap. Furthermore, the proposed technique still yielded competitive results when using an under-sampled, less overlapped version of the labelled sets, specially over the problems with a high proportion of minority class samples in overlap areas. Code available at https://github.com/marianaasouza/lords.  ( 3 min )
    Debugging using Orthogonal Gradient Descent. (arXiv:2206.08489v1 [cs.LG])
    In this report we consider the following problem: Given a trained model that is partially faulty, can we correct its behaviour without having to train the model from scratch? In other words, can we ``debug" neural networks similar to how we address bugs in our mathematical models and standard computer code. We base our approach on the hypothesis that debugging can be treated as a two-task continual learning problem. In particular, we employ a modified version of a continual learning algorithm called Orthogonal Gradient Descent (OGD) to demonstrate, via two simple experiments on the MNIST dataset, that we can in-fact \textit{unlearn} the undesirable behaviour while retaining the general performance of the model, and we can additionally \textit{relearn} the appropriate behaviour, both without having to train the model from scratch.  ( 2 min )
    SATBench: Benchmarking the speed-accuracy tradeoff in object recognition by humans and dynamic neural networks. (arXiv:2206.08427v1 [cs.CV])
    The core of everyday tasks like reading and driving is active object recognition. Attempts to model such tasks are currently stymied by the inability to incorporate time. People show a flexible tradeoff between speed and accuracy and this tradeoff is a crucial human skill. Deep neural networks have emerged as promising candidates for predicting peak human object recognition performance and neural activity. However, modeling the temporal dimension i.e., the speed-accuracy tradeoff (SAT), is essential for them to serve as useful computational models for how humans recognize objects. To this end, we here present the first large-scale (148 observers, 4 neural networks, 8 tasks) dataset of the speed-accuracy tradeoff (SAT) in recognizing ImageNet images. In each human trial, a beep, indicating the desired reaction time, sounds at a fixed delay after the image is presented, and observer's response counts only if it occurs near the time of the beep. In a series of blocks, we test many beep latencies, i.e., reaction times. We observe that human accuracy increases with reaction time and proceed to compare its characteristics with the behavior of several dynamic neural networks that are capable of inference-time adaptive computation. Using FLOPs as an analog for reaction time, we compare networks with humans on curve-fit error, category-wise correlation, and curve steepness, and conclude that cascaded dynamic neural networks are a promising model of human reaction time in object recognition tasks.  ( 3 min )
    Quantifying Feature Contributions to Overall Disparity Using Information Theory. (arXiv:2206.08454v1 [cs.LG])
    When a machine-learning algorithm makes biased decisions, it can be helpful to understand the sources of disparity to explain why the bias exists. Towards this, we examine the problem of quantifying the contribution of each individual feature to the observed disparity. If we have access to the decision-making model, one potential approach (inspired from intervention-based approaches in explainability literature) is to vary each individual feature (while keeping the others fixed) and use the resulting change in disparity to quantify its contribution. However, we may not have access to the model or be able to test/audit its outputs for individually varying features. Furthermore, the decision may not always be a deterministic function of the input features (e.g., with human-in-the-loop). For these situations, we might need to explain contributions using purely distributional (i.e., observational) techniques, rather than interventional. We ask the question: what is the "potential" contribution of each individual feature to the observed disparity in the decisions when the exact decision-making mechanism is not accessible? We first provide canonical examples (thought experiments) that help illustrate the difference between distributional and interventional approaches to explaining contributions, and when either is better suited. When unable to intervene on the inputs, we quantify the "redundant" statistical dependency about the protected attribute that is present in both the final decision and an individual feature, by leveraging a body of work in information theory called Partial Information Decomposition. We also perform a simple case study to show how this technique could be applied to quantify contributions.  ( 3 min )
    Towards a multi-stakeholder value-based assessment framework for algorithmic systems. (arXiv:2205.04525v2 [cs.LG] UPDATED)
    In an effort to regulate Machine Learning-driven (ML) systems, current auditing processes mostly focus on detecting harmful algorithmic biases. While these strategies have proven to be impactful, some values outlined in documents dealing with ethics in ML-driven systems are still underrepresented in auditing processes. Such unaddressed values mainly deal with contextual factors that cannot be easily quantified. In this paper, we develop a value-based assessment framework that is not limited to bias auditing and that covers prominent ethical principles for algorithmic systems. Our framework presents a circular arrangement of values with two bipolar dimensions that make common motivations and potential tensions explicit. In order to operationalize these high-level principles, values are then broken down into specific criteria and their manifestations. However, some of these value-specific criteria are mutually exclusive and require negotiation. As opposed to some other auditing frameworks that merely rely on ML researchers' and practitioners' input, we argue that it is necessary to include stakeholders that present diverse standpoints to systematically negotiate and consolidate value and criteria tensions. To that end, we map stakeholders with different insight needs, and assign tailored means for communicating value manifestations to them. We, therefore, contribute to current ML auditing practices with an assessment framework that visualizes closeness and tensions between values and we give guidelines on how to operationalize them, while opening up the evaluation and deliberation process to a wide range of stakeholders.  ( 3 min )
    The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention. (arXiv:2202.05798v2 [cs.LG] UPDATED)
    Linear layers in neural networks (NNs) trained by gradient descent can be expressed as a key-value memory system which stores all training datapoints and the initial weights, and produces outputs using unnormalised dot attention over the entire training experience. While this has been technically known since the 1960s, no prior work has effectively studied the operations of NNs in such a form, presumably due to prohibitive time and space complexities and impractical model sizes, all of them growing linearly with the number of training patterns which may get very large. However, this dual formulation offers a possibility of directly visualising how an NN makes use of training patterns at test time, by examining the corresponding attention weights. We conduct experiments on small scale supervised image classification tasks in single-task, multi-task, and continual learning settings, as well as language modelling, and discuss potentials and limits of this view for better understanding and interpreting how NNs exploit training patterns. Our code is public.  ( 2 min )
    Local Augmentation for Graph Neural Networks. (arXiv:2109.03856v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have achieved remarkable performance on graph-based tasks. The key idea for GNNs is to obtain informative representation through aggregating information from local neighborhoods. However, it remains an open question whether the neighborhood information is adequately aggregated for learning representations of nodes with few neighbors. To address this, we propose a simple and efficient data augmentation strategy, local augmentation, to learn the distribution of the node representations of the neighbors conditioned on the central node's representation and enhance GNN's expressive power with generated features. Local augmentation is a general framework that can be applied to any GNN model in a plug-and-play manner. It samples feature vectors associated with each node from the learned conditional distribution as additional input for the backbone model at each training iteration. Extensive experiments and analyses show that local augmentation consistently yields performance improvement when applied to various GNN architectures across a diverse set of benchmarks. For example, experiments show that plugging in local augmentation to GCN and GAT improves by an average of 3.4\% and 1.6\% in terms of test accuracy on Cora, Citeseer, and Pubmed. Besides, our experimental results on large graphs (OGB) show that our model consistently improves performance over backbones. Code is available at https://github.com/SongtaoLiu0823/LAGNN.
    Human Interpretation of Saliency-based Explanation Over Text. (arXiv:2201.11569v2 [cs.CL] UPDATED)
    While a lot of research in explainable AI focuses on producing effective explanations, less work is devoted to the question of how people understand and interpret the explanation. In this work, we focus on this question through a study of saliency-based explanations over textual data. Feature-attribution explanations of text models aim to communicate which parts of the input text were more influential than others towards the model decision. Many current explanation methods, such as gradient-based or Shapley value-based methods, provide measures of importance which are well-understood mathematically. But how does a person receiving the explanation (the explainee) comprehend it? And does their understanding match what the explanation attempted to communicate? We empirically investigate the effect of various factors of the input, the feature-attribution explanation, and visualization procedure, on laypeople's interpretation of the explanation. We query crowdworkers for their interpretation on tasks in English and German, and fit a GAMM model to their responses considering the factors of interest. We find that people often mis-interpret the explanations: superficial and unrelated factors, such as word length, influence the explainees' importance assignment despite the explanation communicating importance directly. We then show that some of this distortion can be attenuated: we propose a method to adjust saliencies based on model estimates of over- and under-perception, and explore bar charts as an alternative to heatmap saliency visualization. We find that both approaches can attenuate the distorting effect of specific factors, leading to better-calibrated understanding of the explanation.
    Large-Margin Representation Learning for Texture Classification. (arXiv:2206.08537v1 [cs.CV])
    This paper presents a novel approach combining convolutional layers (CLs) and large-margin metric learning for training supervised models on small datasets for texture classification. The core of such an approach is a loss function that computes the distances between instances of interest and support vectors. The objective is to update the weights of CLs iteratively to learn a representation with a large margin between classes. Each iteration results in a large-margin discriminant model represented by support vectors based on such a representation. The advantage of the proposed approach w.r.t. convolutional neural networks (CNNs) is two-fold. First, it allows representation learning with a small amount of data due to the reduced number of parameters compared to an equivalent CNN. Second, it has a low training cost since the backpropagation considers only support vectors. The experimental results on texture and histopathologic image datasets have shown that the proposed approach achieves competitive accuracy with lower computational cost and faster convergence when compared to equivalent CNNs.  ( 2 min )
    A Robust Stacking Framework for Training Deep Graph Models with Multifaceted Node Features. (arXiv:2206.08473v1 [cs.LG])
    Graph Neural Networks (GNNs) with numerical node features and graph structure as inputs have demonstrated superior performance on various supervised learning tasks with graph data. However the numerical node features utilized by GNNs are commonly extracted from raw data which is of text or tabular (numeric/categorical) type in most real-world applications. The best models for such data types in most standard supervised learning settings with IID (non-graph) data are not simple neural network layers and thus are not easily incorporated into a GNN. Here we propose a robust stacking framework that fuses graph-aware propagation with arbitrary models intended for IID data, which are ensembled and stacked in multiple layers. Our layer-wise framework leverages bagging and stacking strategies to enjoy strong generalization, in a manner which effectively mitigates label leakage and overfitting. Across a variety of graph datasets with tabular/text node features, our method achieves comparable or superior performance relative to both tabular/text and graph neural network models, as well as existing state-of-the-art hybrid strategies that combine the two.  ( 2 min )
    Geometrically Guided Integrated Gradients. (arXiv:2206.05903v2 [cs.CV] UPDATED)
    Interpretability methods for deep neural networks mainly focus on the sensitivity of the class score with respect to the original or perturbed input, usually measured using actual or modified gradients. Some methods also use a model-agnostic approach to understanding the rationale behind every prediction. In this paper, we argue and demonstrate that local geometry of the model parameter space relative to the input can also be beneficial for improved post-hoc explanations. To achieve this goal, we introduce an interpretability method called "geometrically-guided integrated gradients" that builds on top of the gradient calculation along a linear path as traditionally used in integrated gradient methods. However, instead of integrating gradient information, our method explores the model's dynamic behavior from multiple scaled versions of the input and captures the best possible attribution for each input. We demonstrate through extensive experiments that the proposed approach outperforms vanilla and integrated gradients in subjective and quantitative assessment. We also propose a "model perturbation" sanity check to complement the traditionally used "model randomization" test.  ( 2 min )
    Sanity Simulations for Saliency Methods. (arXiv:2105.06506v3 [cs.LG] UPDATED)
    Saliency methods are a popular class of feature attribution explanation methods that aim to capture a model's predictive reasoning by identifying "important" pixels in an input image. However, the development and adoption of these methods are hindered by the lack of access to ground-truth model reasoning, which prevents accurate evaluation. In this work, we design a synthetic benchmarking framework, SMERF, that allows us to perform ground-truth-based evaluation while controlling the complexity of the model's reasoning. Experimentally, SMERF reveals significant limitations in existing saliency methods and, as a result, represents a useful tool for the development of new saliency methods.
    OpenSRH: optimizing brain tumor surgery using intraoperative stimulated Raman histology. (arXiv:2206.08439v1 [eess.IV])
    Accurate intraoperative diagnosis is essential for providing safe and effective care during brain tumor surgery. Our standard-of-care diagnostic methods are time, resource, and labor intensive, which restricts access to optimal surgical treatments. To address these limitations, we propose an alternative workflow that combines stimulated Raman histology (SRH), a rapid optical imaging method, with deep learning-based automated interpretation of SRH images for intraoperative brain tumor diagnosis and real-time surgical decision support. Here, we present OpenSRH, the first public dataset of clinical SRH images from 300+ brain tumors patients and 1300+ unique whole slide optical images. OpenSRH contains data from the most common brain tumors diagnoses, full pathologic annotations, whole slide tumor segmentations, raw and processed optical imaging data for end-to-end model development and validation. We provide a framework for patch-based whole slide SRH classification and inference using weak (i.e. patient-level) diagnostic labels. Finally, we benchmark two computer vision tasks: multiclass histologic brain tumor classification and patch-based contrastive representation learning. We hope OpenSRH will facilitate the clinical translation of rapid optical imaging and real-time ML-based surgical decision support in order to improve the access, safety, and efficacy of cancer surgery in the era of precision medicine. Dataset access, code, and benchmarks are available at opensrh.mlins.org.  ( 2 min )
    Empirical Bayesian Approaches for Robust Constraint-based Causal Discovery under Insufficient Data. (arXiv:2206.08448v1 [cs.LG])
    Causal discovery is to learn cause-effect relationships among variables given observational data and is important for many applications. Existing causal discovery methods assume data sufficiency, which may not be the case in many real world datasets. As a result, many existing causal discovery methods can fail under limited data. In this work, we propose Bayesian-augmented frequentist independence tests to improve the performance of constraint-based causal discovery methods under insufficient data: 1) We firstly introduce a Bayesian method to estimate mutual information (MI), based on which we propose a robust MI based independence test; 2) Secondly, we consider the Bayesian estimation of hypothesis likelihood and incorporate it into a well-defined statistical test, resulting in a robust statistical testing based independence test. We apply proposed independence tests to constraint-based causal discovery methods and evaluate the performance on benchmark datasets with insufficient samples. Experiments show significant performance improvement in terms of both accuracy and efficiency over SOTA methods.  ( 2 min )
    Learning over All Stabilizing Nonlinear Controllers for a Partially-Observed Linear System. (arXiv:2112.04219v3 [eess.SY] UPDATED)
    This paper proposes a nonlinear policy architecture for control of partially-observed linear dynamical systems providing built-in closed-loop stability guarantees. The policy is based on a nonlinear version of the Youla parameterization, and augments a known stabilizing linear controller with a nonlinear operator from a recently developed class of dynamic neural network models called the recurrent equilibrium network (REN). We prove that RENs are universal approximators of contracting and Lipschitz nonlinear systems, and subsequently show that the the proposed Youla-REN architecture is a universal approximator of stabilizing nonlinear controllers. The REN architecture simplifies learning since unconstrained optimization can be applied, and we consider both a model-based case where exact gradients are available and reinforcement learning using random search with zeroth-order oracles. In simulation examples our method converges faster to better controllers and is more scalable than existing methods, while guaranteeing stability during learning transients.
    Learning a Single Neuron with Adversarial Label Noise via Gradient Descent. (arXiv:2206.08918v1 [cs.LG])
    We study the fundamental problem of learning a single neuron, i.e., a function of the form $\mathbf{x}\mapsto\sigma(\mathbf{w}\cdot\mathbf{x})$ for monotone activations $\sigma:\mathbb{R}\mapsto\mathbb{R}$, with respect to the $L_2^2$-loss in the presence of adversarial label noise. Specifically, we are given labeled examples from a distribution $D$ on $(\mathbf{x}, y)\in\mathbb{R}^d \times \mathbb{R}$ such that there exists $\mathbf{w}^\ast\in\mathbb{R}^d$ achieving $F(\mathbf{w}^\ast)=\epsilon$, where $F(\mathbf{w})=\mathbf{E}_{(\mathbf{x},y)\sim D}[(\sigma(\mathbf{w}\cdot \mathbf{x})-y)^2]$. The goal of the learner is to output a hypothesis vector $\mathbf{w}$ such that $F(\mathbb{w})=C\, \epsilon$ with high probability, where $C>1$ is a universal constant. As our main contribution, we give efficient constant-factor approximate learners for a broad class of distributions (including log-concave distributions) and activation functions. Concretely, for the class of isotropic log-concave distributions, we obtain the following important corollaries: For the logistic activation, we obtain the first polynomial-time constant factor approximation (even under the Gaussian distribution). Our algorithm has sample complexity $\widetilde{O}(d/\epsilon)$, which is tight within polylogarithmic factors. For the ReLU activation, we give an efficient algorithm with sample complexity $\tilde{O}(d\, \polylog(1/\epsilon))$. Prior to our work, the best known constant-factor approximate learner had sample complexity $\tilde{\Omega}(d/\epsilon)$. In both of these settings, our algorithms are simple, performing gradient-descent on the (regularized) $L_2^2$-loss. The correctness of our algorithms relies on novel structural results that we establish, showing that (essentially all) stationary points of the underlying non-convex loss are approximately optimal.
    Learning to Teach Fairness-aware Deep Multi-task Learning. (arXiv:2206.08403v1 [cs.LG])
    Fairness-aware learning mainly focuses on single task learning (STL). The fairness implications of multi-task learning (MTL) have only recently been considered and a seminal approach has been proposed that considers the fairness-accuracy trade-off for each task and the performance trade-off among different tasks. Instead of a rigid fairness-accuracy trade-off formulation, we propose a flexible approach that learns how to be fair in a MTL setting by selecting which objective (accuracy or fairness) to optimize at each step. We introduce the L2T-FMT algorithm that is a teacher-student network trained collaboratively; the student learns to solve the fair MTL problem while the teacher instructs the student to learn from either accuracy or fairness, depending on what is harder to learn for each task. Moreover, this dynamic selection of which objective to use at each step for each task reduces the number of trade-off weights from 2T to T, where T is the number of tasks. Our experiments on three real datasets show that L2T-FMT improves on both fairness (12-19%) and accuracy (up to 2%) over state-of-the-art approaches.  ( 2 min )
    Resolution Limits of Non-Adaptive 20 Questions Search for a Moving Target. (arXiv:2206.08884v1 [cs.IT])
    Using the 20 questions estimation framework with query-dependent noise, we study non-adaptive search strategies for a moving target over the unit cube with unknown initial location and velocities under a piecewise constant velocity model. In this search problem, there is an oracle who knows the instantaneous location of the target at any time. Our task is to query the oracle as few times as possible to accurately estimate the location of the target at any specified time. We first study the case where the oracle's answer to each query is corrupted by discrete noise and then generalize our results to the case of additive white Gaussian noise. In our formulation, the performance criterion is the resolution, which is defined as the maximal $L_\infty$ distance between the true locations and estimated locations. We characterize the minimal resolution of an optimal non-adaptive query procedure with a finite number of queries by deriving non-asymptotic and asymptotic bounds. Our bounds are tight in the first-order asymptotic sense when the number of queries satisfies a certain condition and our bounds are tight in the stronger second-order asymptotic sense when the target moves with a constant velocity. To prove our results, we relate the current problem to channel coding, borrow ideas from finite blocklength information theory and construct bounds on the number of possible quantized target trajectories.
    Multiple-Play Stochastic Bandits with Shareable Finite-Capacity Arms. (arXiv:2206.08776v1 [cs.LG])
    We generalize the multiple-play multi-armed bandits (MP-MAB) problem with a shareable arm setting, in which several plays can share the same arm. Furthermore, each shareable arm has a finite reward capacity and a ''per-load'' reward distribution, both of which are unknown to the learner. The reward from a shareable arm is load-dependent, which is the "per-load" reward multiplying either the number of plays pulling the arm, or its reward capacity when the number of plays exceeds the capacity limit. When the "per-load" reward follows a Gaussian distribution, we prove a sample complexity lower bound of learning the capacity from load-dependent rewards and also a regret lower bound of this new MP-MAB problem. We devise a capacity estimator whose sample complexity upper bound matches the lower bound in terms of reward means and capacities. We also propose an online learning algorithm to address the problem and prove its regret upper bound. This regret upper bound's first term is the same as regret lower bound's, and its second and third terms also evidently correspond to lower bound's. Extensive experiments validate our algorithm's performance and also its gain in 5G & 4G base station selection.
    A Convergence Theory for SVGD in the Population Limit under Talagrand's Inequality T1. (arXiv:2106.03076v2 [cs.LG] UPDATED)
    Stein Variational Gradient Descent (SVGD) is an algorithm for sampling from a target density which is known up to a multiplicative constant. Although SVGD is a popular algorithm in practice, its theoretical study is limited to a few recent works. We study the convergence of SVGD in the population limit, (i.e., with an infinite number of particles) to sample from a non-logconcave target distribution satisfying Talagrand's inequality T1. We first establish the convergence of the algorithm. Then, we establish a dimension-dependent complexity bound in terms of the Kernelized Stein Discrepancy (KSD). Unlike existing works, we do not assume that the KSD is bounded along the trajectory of the algorithm. Our approach relies on interpreting SVGD as a gradient descent over a space of probability measures.
    SMPL: Simulated Industrial Manufacturing and Process Control Learning Environments. (arXiv:2206.08851v1 [cs.LG])
    Traditional biological and pharmaceutical manufacturing plants are controlled by human workers or pre-defined thresholds. Modernized factories have advanced process control algorithms such as model predictive control (MPC). However, there is little exploration of applying deep reinforcement learning to control manufacturing plants. One of the reasons is the lack of high fidelity simulations and standard APIs for benchmarking. To bridge this gap, we develop an easy-to-use library that includes five high-fidelity simulation environments: BeerFMTEnv, ReactorEnv, AtropineEnv, PenSimEnv and mAbEnv, which cover a wide range of manufacturing processes. We build these environments on published dynamics models. Furthermore, we benchmark online and offline, model-based and model-free reinforcement learning algorithms for comparisons of follow-up research.
    Classification of datasets with imputed missing values: does imputation quality matter?. (arXiv:2206.08478v1 [cs.LG])
    Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete, imputed, samples. The focus of the machine learning researcher is then to optimise the downstream classification performance. In this study, we highlight that it is imperative to consider the quality of the imputation. We demonstrate how the commonly used measures for assessing quality are flawed and propose a new class of discrepancy scores which focus on how well the method recreates the overall distribution of the data. To conclude, we highlight the compromised interpretability of classifier models trained using poorly imputed data.
    Backdoor Attacks on Vision Transformers. (arXiv:2206.08477v1 [cs.CV])
    Vision Transformers (ViT) have recently demonstrated exemplary performance on a variety of vision tasks and are being used as an alternative to CNNs. Their design is based on a self-attention mechanism that processes images as a sequence of patches, which is quite different compared to CNNs. Hence it is interesting to study if ViTs are vulnerable to backdoor attacks. Backdoor attacks happen when an attacker poisons a small part of the training data for malicious purposes. The model performance is good on clean test images, but the attacker can manipulate the decision of the model by showing the trigger at test time. To the best of our knowledge, we are the first to show that ViTs are vulnerable to backdoor attacks. We also find an intriguing difference between ViTs and CNNs - interpretation algorithms effectively highlight the trigger on test images for ViTs but not for CNNs. Based on this observation, we propose a test-time image blocking defense for ViTs which reduces the attack success rate by a large margin. Code is available here: https://github.com/UCDvision/backdoor_transformer.git
    RECAPP: Crafting a More Efficient Catalyst for Convex Optimization. (arXiv:2206.08627v1 [math.OC])
    The accelerated proximal point algorithm (APPA), also known as "Catalyst", is a well-established reduction from convex optimization to approximate proximal point computation (i.e., regularized minimization). This reduction is conceptually elegant and yields strong convergence rate guarantees. However, these rates feature an extraneous logarithmic term arising from the need to compute each proximal point to high accuracy. In this work, we propose a novel Relaxed Error Criterion for Accelerated Proximal Point (RECAPP) that eliminates the need for high accuracy subproblem solutions. We apply RECAPP to two canonical problems: finite-sum and max-structured minimization. For finite-sum problems, we match the best known complexity, previously obtained by carefully-designed problem-specific algorithms. For minimizing $\max_y f(x,y)$ where $f$ is convex in $x$ and strongly-concave in $y$, we improve on the best known (Catalyst-based) bound by a logarithmic factor.
    Active Fairness Auditing. (arXiv:2206.08450v1 [cs.LG])
    The fast spreading adoption of machine learning (ML) by companies across industries poses significant regulatory challenges. One such challenge is scalability: how can regulatory bodies efficiently audit these ML models, ensuring that they are fair? In this paper, we initiate the study of query-based auditing algorithms that can estimate the demographic parity of ML models in a query-efficient manner. We propose an optimal deterministic algorithm, as well as a practical randomized, oracle-efficient algorithm with comparable guarantees. Furthermore, we make inroads into understanding the optimal query complexity of randomized active fairness estimation algorithms. Our first exploration of active fairness estimation aims to put AI governance on firmer theoretical foundations.
    SOS: Score-based Oversampling for Tabular Data. (arXiv:2206.08555v1 [cs.LG])
    Score-based generative models (SGMs) are a recent breakthrough in generating fake images. SGMs are known to surpass other generative models, e.g., generative adversarial networks (GANs) and variational autoencoders (VAEs). Being inspired by their big success, in this work, we fully customize them for generating fake tabular data. In particular, we are interested in oversampling minor classes since imbalanced classes frequently lead to sub-optimal training outcomes. To our knowledge, we are the first presenting a score-based tabular data oversampling method. Firstly, we re-design our own score network since we have to process tabular data. Secondly, we propose two options for our generation method: the former is equivalent to a style transfer for tabular data and the latter uses the standard generative policy of SGMs. Lastly, we define a fine-tuning method, which further enhances the oversampling quality. In our experiments with 6 datasets and 10 baselines, our method outperforms other oversampling methods in all cases.
    PRANC: Pseudo RAndom Networks for Compacting deep models. (arXiv:2206.08464v1 [cs.LG])
    Communication becomes a bottleneck in various distributed Machine Learning settings. Here, we propose a novel training framework that leads to highly efficient communication of models between agents. In short, we train our network to be a linear combination of many pseudo-randomly generated frozen models. For communication, the source agent transmits only the `seed' scalar used to generate the pseudo-random `basis' networks along with the learned linear mixture coefficients. Our method, denoted as PRANC, learns almost $100\times$ fewer parameters than a deep model and still performs well on several datasets and architectures. PRANC enables 1) efficient communication of models between agents, 2) efficient model storage, and 3) accelerated inference by generating layer-wise weights on the fly. We test PRANC on CIFAR-10, CIFAR-100, tinyImageNet, and ImageNet-100 with various architectures like AlexNet, LeNet, ResNet18, ResNet20, and ResNet56 and demonstrate a massive reduction in the number of parameters while providing satisfactory performance on these benchmark datasets. The code is available \href{https://github.com/UCDvision/PRANC}{https://github.com/UCDvision/PRANC}
    Sheaf Neural Networks with Connection Laplacians. (arXiv:2206.08702v1 [cs.LG])
    A Sheaf Neural Network (SNN) is a type of Graph Neural Network (GNN) that operates on a sheaf, an object that equips a graph with vector spaces over its nodes and edges and linear maps between these spaces. SNNs have been shown to have useful theoretical properties that help tackle issues arising from heterophily and over-smoothing. One complication intrinsic to these models is finding a good sheaf for the task to be solved. Previous works proposed two diametrically opposed approaches: manually constructing the sheaf based on domain knowledge and learning the sheaf end-to-end using gradient-based methods. However, domain knowledge is often insufficient, while learning a sheaf could lead to overfitting and significant computational overhead. In this work, we propose a novel way of computing sheaves drawing inspiration from Riemannian geometry: we leverage the manifold assumption to compute manifold-and-graph-aware orthogonal maps, which optimally align the tangent spaces of neighbouring data points. We show that this approach achieves promising results with less computational overhead when compared to previous SNN models. Overall, this work provides an interesting connection between algebraic topology and differential geometry, and we hope that it will spark future research in this direction.
    Recursive Neural Programs: Variational Learning of Image Grammars and Part-Whole Hierarchies. (arXiv:2206.08462v1 [cs.CV])
    Human vision involves parsing and representing objects and scenes using structured representations based on part-whole hierarchies. Computer vision and machine learning researchers have recently sought to emulate this capability using capsule networks, reference frames and active predictive coding, but a generative model formulation has been lacking. We introduce Recursive Neural Programs (RNPs), which, to our knowledge, is the first neural generative model to address the part-whole hierarchy learning problem. RNPs model images as hierarchical trees of probabilistic sensory-motor programs that recursively reuse learned sensory-motor primitives to model an image within different reference frames, forming recursive image grammars. We express RNPs as structured variational autoencoders (sVAEs) for inference and sampling, and demonstrate parts-based parsing, sampling and one-shot transfer learning for MNIST, Omniglot and Fashion-MNIST datasets, demonstrating the model's expressive power. Our results show that RNPs provide an intuitive and explainable way of composing objects and scenes, allowing rich compositionality and intuitive interpretations of objects in terms of part-whole hierarchies.
    Thompson Sampling for Robust Transfer in Multi-Task Bandits. (arXiv:2206.08556v1 [cs.LG])
    We study the problem of online multi-task learning where the tasks are performed within similar but not necessarily identical multi-armed bandit environments. In particular, we study how a learner can improve its overall performance across multiple related tasks through robust transfer of knowledge. While an upper confidence bound (UCB)-based algorithm has recently been shown to achieve nearly-optimal performance guarantees in a setting where all tasks are solved concurrently, it remains unclear whether Thompson sampling (TS) algorithms, which have superior empirical performance in general, share similar theoretical properties. In this work, we present a TS-type algorithm for a more general online multi-task learning protocol, which extends the concurrent setting. We provide its frequentist analysis and prove that it is also nearly-optimal using a novel concentration inequality for multi-task data aggregation at random stopping times. Finally, we evaluate the algorithm on synthetic data and show that the TS-type algorithm enjoys superior empirical performance in comparison with the UCB-based algorithm and a baseline algorithm that performs TS for each individual task without transfer.  ( 2 min )
    Generalised Policy Improvement with Geometric Policy Composition. (arXiv:2206.08736v1 [stat.ML])
    We introduce a method for policy improvement that interpolates between the greedy approach of value-based reinforcement learning (RL) and the full planning approach typical of model-based RL. The new method builds on the concept of a geometric horizon model (GHM, also known as a gamma-model), which models the discounted state-visitation distribution of a given policy. We show that we can evaluate any non-Markov policy that switches between a set of base Markov policies with fixed probability by a careful composition of the base policy GHMs, without any additional learning. We can then apply generalised policy improvement (GPI) to collections of such non-Markov policies to obtain a new Markov policy that will in general outperform its precursors. We provide a thorough theoretical analysis of this approach, develop applications to transfer and standard RL, and empirically demonstrate its effectiveness over standard GPI on a challenging deep RL continuous control task. We also provide an analysis of GHM training methods, proving a novel convergence result regarding previously proposed methods and showing how to train these models stably in deep RL settings.  ( 2 min )
    Discovery of the Content and Engagement with the Content. (arXiv:2206.08786v1 [cs.IR])
    In the second half of the 20th century, Parliament allowed broadcasters to transmit radio and eventually television coverage of debates and meetings of select committees. More recently, in an effort to further improve transparency and citizen engagement, the UK Parliament started publishing videos of these debates and meetings itself, and tweeting details of debates as they happened. In this paper, we attempt to characterise how people engage with video data of Parliamentary debates by using more than two years of Google Analytics data around these videos. We analyse the patterns of engagement - how do they land on a particular video? How do they hear about this video, i.e., what is the (HTTP) referrer website that led to the user clicking on the video? Once a user lands on a video, how do they engage with it? For how long is the video played? What is the next destination? etc. Answering these questions is an important first step towards understanding why and how people use Parliamentary videos, and therefore, how the video delivery platform should be adapted and personalised for the needs of the citizens of the country. Taking inspiration from An, Kwak, and Jansen (2017), we employ Non-Negative Matrix Factorization (NMF) (Lee and Seung, 1999) on the video views matrix to identify different archetypes of users, and identify archetypes. A deeper examination of the archetypes we find reveals that they are primarily distinguished by how they land on the video page: Search (i.e., through a search engine), Referral (i.e., from other Parliamentary websites), Direct (i.e., through a direct link, which is embedded on another website), Social (i.e., through a social platform such as Facebook or Twitter) and Others.  ( 3 min )
    NU-Wave 2: A General Neural Audio Upsampling Model for Various Sampling Rates. (arXiv:2206.08545v1 [eess.AS])
    Conventionally, audio super-resolution models fixed the initial and the target sampling rates, which necessitate the model to be trained for each pair of sampling rates. We introduce NU-Wave 2, a diffusion model for neural audio upsampling that enables the generation of 48 kHz audio signals from inputs of various sampling rates with a single model. Based on the architecture of NU-Wave, NU-Wave 2 uses short-time Fourier convolution (STFC) to generate harmonics to resolve the main failure modes of NU-Wave, and incorporates bandwidth spectral feature transform (BSFT) to condition the bandwidths of inputs in the frequency domain. We experimentally demonstrate that NU-Wave 2 produces high-resolution audio regardless of the sampling rate of input while requiring fewer parameters than other models. The official code and the audio samples are available at https://mindslab-ai.github.io/nuwave2.  ( 2 min )
    The Sensorium competition on predicting large-scale mouse primary visual cortex activity. (arXiv:2206.08666v1 [q-bio.NC])
    The neural underpinning of the biological visual system is challenging to study experimentally, in particular as the neuronal activity becomes increasingly nonlinear with respect to visual input. Artificial neural networks (ANNs) can serve a variety of goals for improving our understanding of this complex system, not only serving as predictive digital twins of sensory cortex for novel hypothesis generation in silico, but also incorporating bio-inspired architectural motifs to progressively bridge the gap between biological and machine vision. The mouse has recently emerged as a popular model system to study visual information processing, but no standardized large-scale benchmark to identify state-of-the-art models of the mouse visual system has been established. To fill this gap, we propose the Sensorium benchmark competition. We collected a large-scale dataset from mouse primary visual cortex containing the responses of more than 28,000 neurons across seven mice stimulated with thousands of natural images, together with simultaneous behavioral measurements that include running speed, pupil dilation, and eye movements. The benchmark challenge will rank models based on predictive performance for neuronal responses on a held-out test set, and includes two tracks for model input limited to either stimulus only (Sensorium) or stimulus plus behavior (Sensorium+). We provide a starting kit to lower the barrier for entry, including tutorials, pre-trained baseline models, and APIs with one line commands for data loading and submission. We would like to see this as a starting point for regular challenges and data releases, and as a standard tool for measuring progress in large-scale neural system identification models of the mouse visual system and beyond.  ( 3 min )
    Automatic Correction of Human Translations. (arXiv:2206.08593v1 [cs.CL])
    We introduce translation error correction (TEC), the task of automatically correcting human-generated translations. Imperfections in machine translations (MT) have long motivated systems for improving translations post-hoc with automatic post-editing. In contrast, little attention has been devoted to the problem of automatically correcting human translations, despite the intuition that humans make distinct errors that machines would be well-suited to assist with, from typos to inconsistencies in translation conventions. To investigate this, we build and release the Aced corpus with three TEC datasets. We show that human errors in TEC exhibit a more diverse range of errors and far fewer translation fluency errors than the MT errors in automatic post-editing datasets, suggesting the need for dedicated TEC models that are specialized to correct human errors. We show that pre-training instead on synthetic errors based on human errors improves TEC F-score by as much as 5.1 points. We conducted a human-in-the-loop user study with nine professional translation editors and found that the assistance of our TEC system led them to produce significantly higher quality revised translations.  ( 2 min )
    Powershap: A Power-full Shapley Feature Selection Method. (arXiv:2206.08394v1 [cs.LG])
    Feature selection is a crucial step in developing robust and powerful machine learning models. Feature selection techniques can be divided into two categories: filter and wrapper methods. While wrapper methods commonly result in strong predictive performances, they suffer from a large computational complexity and therefore take a significant amount of time to complete, especially when dealing with high-dimensional feature sets. Alternatively, filter methods are considerably faster, but suffer from several other disadvantages, such as (i) requiring a threshold value, (ii) not taking into account intercorrelation between features, and (iii) ignoring feature interactions with the model. To this end, we present powershap, a novel wrapper feature selection method, which leverages statistical hypothesis testing and power calculations in combination with Shapley values for quick and intuitive feature selection. Powershap is built on the core assumption that an informative feature will have a larger impact on the prediction compared to a known random feature. Benchmarks and simulations show that powershap outperforms other filter methods with predictive performances on par with wrapper methods while being significantly faster, often even reaching half or a third of the execution time. As such, powershap provides a competitive and quick algorithm that can be used by various models in different domains. Furthermore, powershap is implemented as a plug-and-play and open-source sklearn component, enabling easy integration in conventional data science pipelines. User experience is even further enhanced by also providing an automatic mode that automatically tunes the hyper-parameters of the powershap algorithm, allowing to use the algorithm without any configuration needed.  ( 3 min )
    Accelerating numerical methods by gradient-based meta-solving. (arXiv:2206.08594v1 [math.NA])
    In science and engineering applications, it is often required to solve similar computational problems repeatedly. In such cases, we can utilize the data from previously solved problem instances to improve the efficiency of finding subsequent solutions. This offers a unique opportunity to combine machine learning (in particular, meta-learning) and scientific computing. To date, a variety of such domain-specific methods have been proposed in the literature, but a generic approach for designing these methods remains under-explored. In this paper, we tackle this issue by formulating a general framework to describe these problems, and propose a gradient-based algorithm to solve them in a unified way. As an illustration of this approach, we study the adaptive generation of parameters for iterative solvers to accelerate the solution of differential equations. We demonstrate the performance and versatility of our method through theoretical analysis and numerical experiments, including applications to incompressible flow simulations and an inverse problem of parameter estimation.  ( 2 min )
    Modeling Structure with Undirected Neural Networks. (arXiv:2202.03760v2 [cs.LG] UPDATED)
    Neural networks are powerful function estimators, leading to their status as a paradigm of choice for modeling structured data. However, unlike other structured representations that emphasize the modularity of the problem -- e.g., factor graphs -- neural networks are usually monolithic mappings from inputs to outputs, with a fixed computation order. This limitation prevents them from capturing different directions of computation and interaction between the modeled variables. In this paper, we combine the representational strengths of factor graphs and of neural networks, proposing undirected neural networks (UNNs): a flexible framework for specifying computations that can be performed in any order. For particular choices, our proposed models subsume and extend many existing architectures: feed-forward, recurrent, self-attention networks, auto-encoders, and networks with implicit layers. We demonstrate the effectiveness of undirected neural architectures, both unstructured and structured, on a range of tasks: tree-constrained dependency parsing, convolutional image classification, and sequence completion with attention. By varying the computation order, we show how a single UNN can be used both as a classifier and a prototype generator, and how it can fill in missing parts of an input sequence, making them a promising field for further research.
    How Powerful are Spectral Graph Neural Networks. (arXiv:2205.11172v2 [cs.LG] UPDATED)
    Spectral Graph Neural Network is a kind of Graph Neural Network (GNN) based on graph signal filters. Some models able to learn arbitrary spectral filters have emerged recently. However, few works analyze the expressive power of spectral GNNs. This paper studies spectral GNNs' expressive power theoretically. We first prove that even spectral GNNs without nonlinearity can produce arbitrary graph signals and give two conditions for reaching universality. They are: 1) no multiple eigenvalues of graph Laplacian, and 2) no missing frequency components in node features. We also establish a connection between the expressive power of spectral GNNs and Graph Isomorphism (GI) testing, the latter of which is often used to characterize spatial GNNs' expressive power. Moreover, we study the difference in empirical performance among different spectral GNNs with the same expressive power from an optimization perspective, and motivate the use of an orthogonal basis whose weight function corresponds to the graph signal density in the spectrum. Inspired by the analysis, we propose JacobiConv, which uses Jacobi basis due to its orthogonality and flexibility to adapt to a wide range of weight functions. JacobiConv deserts nonlinearity while outperforming all baselines on both synthetic and real-world datasets.
    Graph Neural Networks for Multimodal Single-Cell Data Integration. (arXiv:2203.01884v2 [cs.LG] UPDATED)
    Recent advances in multimodal single-cell technologies have enabled simultaneous acquisitions of multiple omics data from the same cell, providing deeper insights into cellular states and dynamics. However, it is challenging to learn the joint representations from the multimodal data, model the relationship between modalities, and, more importantly, incorporate the vast amount of single-modality datasets into the downstream analyses. To address these challenges and correspondingly facilitate multimodal single-cell data analyses, three key tasks have been introduced: $\textit{modality prediction}$, $\textit{modality matching}$ and $\textit{joint embedding}$. In this work, we present a general Graph Neural Network framework $\textit{scMoGNN}$ to tackle these three tasks and show that $\textit{scMoGNN}$ demonstrates superior results in all three tasks compared with the state-of-the-art and conventional approaches. Our method is an official winner in the overall ranking of \textit{Modality prediction} from NeurIPS 2021 Competition\footnote{\url{https://openproblems.bio/neurips_2021/}}, and all implementations of our methods have been integrated into DANCE package \footnote{\url{https://github.com/OmicsML/dance}}.
    Bayesian Spillover Graphs for Dynamic Networks. (arXiv:2203.01912v2 [stat.ME] UPDATED)
    We present Bayesian Spillover Graphs (BSG), a novel method for learning temporal relationships, identifying critical nodes, and quantifying uncertainty for multi-horizon spillover effects in a dynamic system. BSG leverages both an interpretable framework via forecast error variance decompositions (FEVD) and comprehensive uncertainty quantification via Bayesian time series models to contextualize temporal relationships in terms of systemic risk and prediction variability. Forecast horizon hyperparameter $h$ allows for learning both short-term and equilibrium state network behaviors. Experiments for identifying source and sink nodes under various graph and error specifications show significant performance gains against state-of-the-art Bayesian Networks and deep-learning baselines. Applications to real-world systems also showcase BSG as an exploratory analysis tool for uncovering indirect spillovers and quantifying systemic risk.
    Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks. (arXiv:2201.11729v2 [cs.LG] UPDATED)
    In the pursuit of explaining implicit regularization in deep learning, prominent focus was given to matrix and tensor factorizations, which correspond to simplified neural networks. It was shown that these models exhibit an implicit tendency towards low matrix and tensor ranks, respectively. Drawing closer to practical deep learning, the current paper theoretically analyzes the implicit regularization in hierarchical tensor factorization, a model equivalent to certain deep convolutional neural networks. Through a dynamical systems lens, we overcome challenges associated with hierarchy, and establish implicit regularization towards low hierarchical tensor rank. This translates to an implicit regularization towards locality for the associated convolutional networks. Inspired by our theory, we design explicit regularization discouraging locality, and demonstrate its ability to improve the performance of modern convolutional networks on non-local tasks, in defiance of conventional wisdom by which architectural changes are needed. Our work highlights the potential of enhancing neural networks via theoretical analysis of their implicit regularization.
    CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer. (arXiv:2206.08883v1 [cs.CV])
    Transformer has achieved great successes in learning vision and language representation, which is general across various downstream tasks. In visual control, learning transferable state representation that can transfer between different control tasks is important to reduce the training sample size. However, porting Transformer to sample-efficient visual control remains a challenging and unsolved problem. To this end, we propose a novel Control Transformer (CtrlFormer), possessing many appealing benefits that prior arts do not have. Firstly, CtrlFormer jointly learns self-attention mechanisms between visual tokens and policy tokens among different control tasks, where multitask representation can be learned and transferred without catastrophic forgetting. Secondly, we carefully design a contrastive reinforcement learning paradigm to train CtrlFormer, enabling it to achieve high sample efficiency, which is important in control problems. For example, in the DMControl benchmark, unlike recent advanced methods that failed by producing a zero score in the "Cartpole" task after transfer learning with 100k samples, CtrlFormer can achieve a state-of-the-art score with only 100k samples while maintaining the performance of previous tasks. The code and models are released in our project homepage.
    Generalized Frank-Wolfe Algorithm for Bilevel Optimization. (arXiv:2206.08868v1 [math.OC])
    In this paper, we study a class of bilevel optimization problems, also known as simple bilevel optimization, where we minimize a smooth objective function over the optimal solution set of another convex constrained optimization problem. Several iterative methods have been developed for tackling this class of problems. Alas, their convergence guarantees are not satisfactory as they are either asymptotic for the upper-level objective, or the convergence rates are slow and sub-optimal. To address this issue, in this paper, we introduce a generalization of the Frank-Wolfe (FW) method to solve the considered problem. The main idea of our method is to locally approximate the solution set of the lower-level problem via a cutting plane, and then run a FW-type update to decrease the upper-level objective. When the upper-level objective is convex, we show that our method requires ${\mathcal{O}}(\max\{1/\epsilon_f,1/\epsilon_g\})$ iterations to find a solution that is $\epsilon_f$-optimal for the upper-level objective and $\epsilon_g$-optimal for the lower-level objective. Moreover, when the upper-level objective is non-convex, our method requires ${\mathcal{O}}(\max\{1/\epsilon_f^2,1/(\epsilon_f\epsilon_g)\})$ iterations to find an $(\epsilon_f,\epsilon_g)$-optimal solution. We further prove stronger convergence guarantees under the H\"olderian error bound assumption on the lower-level problem. To the best of our knowledge, our method achieves the best-known iteration complexity for the considered bilevel problem. We also present numerical experiments to showcase the superior performance of our method compared with state-of-the-art methods.
    Omni-Scale CNNs: a simple and effective kernel size configuration for time series classification. (arXiv:2002.10061v3 [cs.LG] UPDATED)
    The Receptive Field (RF) size has been one of the most important factors for One Dimensional Convolutional Neural Networks (1D-CNNs) on time series classification tasks. Large efforts have been taken to choose the appropriate size because it has a huge influence on the performance and differs significantly for each dataset. In this paper, we propose an Omni-Scale block (OS-block) for 1D-CNNs, where the kernel sizes are decided by a simple and universal rule. Particularly, it is a set of kernel sizes that can efficiently cover the best RF size across different datasets via consisting of multiple prime numbers according to the length of the time series. The experiment result shows that models with the OS-block can achieve a similar performance as models with the searched optimal RF size and due to the strong optimal RF size capture ability, simple 1D-CNN models with OS-block achieves the state-of-the-art performance on four time series benchmarks, including both univariate and multivariate data from multiple domains. Comprehensive analysis and discussions shed light on why the OS-block can capture optimal RF sizes across different datasets. Code available [https://github.com/Wensi-Tang/OS-CNN]
    Avoid Overfitting User Specific Information in Federated Keyword Spotting. (arXiv:2206.08864v1 [cs.LG])
    Keyword spotting (KWS) aims to discriminate a specific wake-up word from other signals precisely and efficiently for different users. Recent works utilize various deep networks to train KWS models with all users' speech data centralized without considering data privacy. Federated KWS (FedKWS) could serve as a solution without directly sharing users' data. However, the small amount of data, different user habits, and various accents could lead to fatal problems, e.g., overfitting or weight divergence. Hence, we propose several strategies to encourage the model not to overfit user-specific information in FedKWS. Specifically, we first propose an adversarial learning strategy, which updates the downloaded global model against an overfitted local model and explicitly encourages the global model to capture user-invariant information. Furthermore, we propose an adaptive local training strategy, letting clients with more training data and more uniform class distributions undertake more local update steps. Equivalently, this strategy could weaken the negative impacts of those users whose data is less qualified. Our proposed FedKWS-UI could explicitly and implicitly learn user-invariant information in FedKWS. Abundant experimental results on federated Google Speech Commands verify the effectiveness of FedKWS-UI.
    A Survey of Sound Source Localization with Deep Learning Methods. (arXiv:2109.03465v3 [cs.SD] UPDATED)
    This article is a survey on deep learning methods for single and multiple sound source localization. We are particularly interested in sound source localization in indoor/domestic environment, where reverberation and diffuse noise are present. We provide an exhaustive topography of the neural-based localization literature in this context, organized according to several aspects: the neural network architecture, the type of input features, the output strategy (classification or regression), the types of data used for model training and evaluation, and the model training strategy. This way, an interested reader can easily comprehend the vast panorama of the deep learning-based sound source localization methods. Tables summarizing the literature survey are provided at the end of the paper for a quick search of methods with a given set of target characteristics.
    Evaluating the Impact of Source Code Parsers on ML4SE Models. (arXiv:2206.08713v1 [cs.SE])
    As researchers and practitioners apply Machine Learning to increasingly more software engineering problems, the approaches they use become more sophisticated. A lot of modern approaches utilize internal code structure in the form of an abstract syntax tree (AST) or its extensions: path-based representation, complex graph combining AST with additional edges. Even though the process of extracting ASTs from code can be done with different parsers, the impact of choosing a parser on the final model quality remains unstudied. Moreover, researchers often omit the exact details of extracting particular code representations. In this work, we evaluate two models, namely Code2Seq and TreeLSTM, in the method name prediction task backed by eight different parsers for the Java language. To unify the process of data preparation with different parsers, we develop SuperParser, a multi-language parser-agnostic library based on PathMiner. SuperParser facilitates the end-to-end creation of datasets suitable for training and evaluation of ML models that work with structural information from source code. Our results demonstrate that trees built by different parsers vary in their structure and content. We then analyze how this diversity affects the models' quality and show that the quality gap between the most and least suitable parsers for both models turns out to be significant. Finally, we discuss other features of the parsers that researchers and practitioners should take into account when selecting a parser along with the impact on the models' quality. The code of SuperParser is publicly available at https://doi.org/10.5281/zenodo.6366591. We also publish Java-norm, the dataset we use to evaluate the models: https://doi.org/10.5281/zenodo.6366599.
    Dropout Prediction Uncertainty Estimation Using Neuron Activation Strength. (arXiv:2110.06435v3 [cs.LG] UPDATED)
    Dropout has been commonly used to quantify prediction uncertainty, i.e, the variations of model predictions on a given input example. However, using dropout in practice can be expensive as it requires running dropout inferences many times. In this paper, we study how to estimate dropout prediction uncertainty in a resource-efficient manner. We demonstrate that we can use neuron activation strengths to estimate dropout prediction uncertainty under different dropout settings and on a variety of tasks using three large datasets, MovieLens, Criteo, and EMNIST. Our approach provides an inference-once method to estimate dropout prediction uncertainty as a cheap auxiliary task. We also demonstrate that using activation features from a subset of the neural network layers can be sufficient to achieve uncertainty estimation performance almost comparable to that of using activation features from all layers, thus reducing resources even further for uncertainty estimation.
    Optimal Extragradient-Based Bilinearly-Coupled Saddle-Point Optimization. (arXiv:2206.08573v1 [math.OC])
    We consider the smooth convex-concave bilinearly-coupled saddle-point problem, $\min_{\mathbf{x}}\max_{\mathbf{y}}~F(\mathbf{x}) + H(\mathbf{x},\mathbf{y}) - G(\mathbf{y})$, where one has access to stochastic first-order oracles for $F$, $G$ as well as the bilinear coupling function $H$. Building upon standard stochastic extragradient analysis for variational inequalities, we present a stochastic \emph{accelerated gradient-extragradient (AG-EG)} descent-ascent algorithm that combines extragradient and Nesterov's acceleration in general stochastic settings. This algorithm leverages scheduled restarting to admit a fine-grained nonasymptotic convergence rate that matches known lower bounds by both \citet{ibrahim2020linear} and \citet{zhang2021lower} in their corresponding settings, plus an additional statistical error term for bounded stochastic noise that is optimal up to a constant prefactor. This is the first result that achieves such a relatively mature characterization of optimality in saddle-point optimization.
    Leveraging Uncertainty in Deep Learning for Pancreatic Adenocarcinoma Grading. (arXiv:2206.08787v1 [eess.IV])
    Pancreatic cancers have one of the worst prognoses compared to other cancers, as they are diagnosed when cancer has progressed towards its latter stages. The current manual histological grading for diagnosing pancreatic adenocarcinomas is time-consuming and often results in misdiagnosis. In digital pathology, AI-based cancer grading must be extremely accurate in prediction and uncertainty quantification to improve reliability and explainability and are essential for gaining clinicians trust in the technology. We present Bayesian Convolutional Neural Networks for automated pancreatic cancer grading from MGG and HE stained images to estimate uncertainty in model prediction. We show that the estimated uncertainty correlates with prediction error. Specifically, it is useful in setting the acceptance threshold using a metric that weighs classification accuracy-reject trade-off and misclassification cost controlled by hyperparameters and can be employed in clinical settings.
    Truly Unordered Probabilistic Rule Sets for Multi-class Classification. (arXiv:2206.08804v1 [cs.LG])
    Rule set learning has long been studied and has recently been frequently revisited due to the need for interpretable models. Still, existing methods have several shortcomings: 1) most recent methods require a binary feature matrix as input, learning rules directly from numeric variables is understudied; 2) existing methods impose orders among rules, either explicitly or implicitly, which harms interpretability; and 3) currently no method exists for learning probabilistic rule sets for multi-class target variables (there is only a method for probabilistic rule lists). We propose TURS, for Truly Unordered Rule Sets, which addresses these shortcomings. We first formalise the problem of learning truly unordered rule sets. To resolve conflicts caused by overlapping rules, i.e., instances covered by multiple rules, we propose a novel approach that exploits the probabilistic properties of our rule sets. We next develop a two-phase heuristic algorithm that learns rule sets by carefully growing rules. An important innovation is that we use a surrogate score to take the global potential of the rule set into account when learning a local rule. Finally, we empirically demonstrate that, compared to non-probabilistic and (explicitly or implicitly) ordered state-of-the-art methods, our method learns rule sets that not only have better interpretability (i.e., they are smaller and truly unordered), but also better predictive performance.
    FedNew: A Communication-Efficient and Privacy-Preserving Newton-Type Method for Federated Learning. (arXiv:2206.08829v1 [cs.LG])
    Newton-type methods are popular in federated learning due to their fast convergence. Still, they suffer from two main issues, namely: low communication efficiency and low privacy due to the requirement of sending Hessian information from clients to parameter server (PS). In this work, we introduced a novel framework called FedNew in which there is no need to transmit Hessian information from clients to PS, hence resolving the bottleneck to improve communication efficiency. In addition, FedNew hides the gradient information and results in a privacy-preserving approach compared to the existing state-of-the-art. The core novel idea in FedNew is to introduce a two level framework, and alternate between updating the inverse Hessian-gradient product using only one alternating direction method of multipliers (ADMM) step and then performing the global model update using Newton's method. Though only one ADMM pass is used to approximate the inverse Hessian-gradient product at each iteration, we develop a novel theoretical approach to show the converging behavior of FedNew for convex problems. Additionally, a significant reduction in communication overhead is achieved by utilizing stochastic quantization. Numerical results using real datasets show the superiority of FedNew compared to existing methods in terms of communication costs.
    On Efficient Real-Time Semantic Segmentation: A Survey. (arXiv:2206.08605v1 [cs.CV])
    Semantic segmentation is the problem of assigning a class label to every pixel in an image, and is an important component of an autonomous vehicle vision stack for facilitating scene understanding and object detection. However, many of the top performing semantic segmentation models are extremely complex and cumbersome, and as such are not suited to deployment onboard autonomous vehicle platforms where computational resources are limited and low-latency operation is a vital requirement. In this survey, we take a thorough look at the works that aim to address this misalignment with more compact and efficient models capable of deployment on low-memory embedded systems while meeting the constraint of real-time inference. We discuss several of the most prominent works in the field, placing them within a taxonomy based on their major contributions, and finally we evaluate the inference speed of the discussed models under consistent hardware and software setups that represent a typical research environment with high-end GPU and a realistic deployed scenario using low-memory embedded GPU hardware. Our experimental results demonstrate that many works are capable of real-time performance on resource-constrained hardware, while illustrating the consistent trade-off between latency and accuracy.
    Learning Generic Lung Ultrasound Biomarkers for Decoupling Feature Extraction from Downstream Tasks. (arXiv:2206.08398v1 [eess.IV])
    Contemporary artificial neural networks (ANN) are trained end-to-end, jointly learning both features and classifiers for the task of interest. Though enormously effective, this paradigm imposes significant costs in assembling annotated task-specific datasets and training large-scale networks. We propose to decouple feature learning from downstream lung ultrasound tasks by introducing an auxiliary pre-task of visual biomarker classification. We demonstrate that one can learn an informative, concise, and interpretable feature space from ultrasound videos by training models for predicting biomarker labels. Notably, biomarker feature extractors can be trained from data annotated with weak video-scale supervision. These features can be used by a variety of downstream Expert models targeted for diverse clinical tasks (Diagnosis, lung severity, S/F ratio). Crucially, task-specific expert models are comparable in accuracy to end-to-end models directly trained for such target tasks, while being significantly lower cost to train.
    Revisiting Self-Distillation. (arXiv:2206.08491v1 [cs.LG])
    Knowledge distillation is the procedure of transferring "knowledge" from a large model (the teacher) to a more compact one (the student), often being used in the context of model compression. When both models have the same architecture, this procedure is called self-distillation. Several works have anecdotally shown that a self-distilled student can outperform the teacher on held-out data. In this work, we systematically study self-distillation in a number of settings. We first show that even with a highly accurate teacher, self-distillation allows a student to surpass the teacher in all cases. Secondly, we revisit existing theoretical explanations of (self) distillation and identify contradicting examples, revealing possible drawbacks of these explanations. Finally, we provide an alternative explanation for the dynamics of self-distillation through the lens of loss landscape geometry. We conduct extensive experiments to show that self-distillation leads to flatter minima, thereby resulting in better generalization.
    Bootstrapped Transformer for Offline Reinforcement Learning. (arXiv:2206.08569v1 [cs.LG])
    Offline reinforcement learning (RL) aims at learning policies from previously collected static trajectory data without interacting with the real environment. Recent works provide a novel perspective by viewing offline RL as a generic sequence generation problem, adopting sequence models such as Transformer architecture to model distributions over trajectories, and repurposing beam search as a planning algorithm. However, the training datasets utilized in general offline RL tasks are quite limited and often suffer from insufficient distribution coverage, which could be harmful to training sequence generation models yet has not drawn enough attention in the previous works. In this paper, we propose a novel algorithm named Bootstrapped Transformer, which incorporates the idea of bootstrapping and leverages the learned model to self-generate more offline data to further boost the sequence model training. We conduct extensive experiments on two offline RL benchmarks and demonstrate that our model can largely remedy the existing offline RL training limitations and beat other strong baseline methods. We also analyze the generated pseudo data and the revealed characteristics may shed some light on offline RL training. The codes are available at https://seqml.github.io/bootorl.
    tinySNN: Towards Memory- and Energy-Efficient Spiking Neural Networks. (arXiv:2206.08656v1 [cs.NE])
    Larger Spiking Neural Network (SNN) models are typically favorable as they can offer higher accuracy. However, employing such models on the resource- and energy-constrained embedded platforms is inefficient. Towards this, we present a tinySNN framework that optimizes the memory and energy requirements of SNN processing in both the training and inference phases, while keeping the accuracy high. It is achieved by reducing the SNN operations, improving the learning quality, quantizing the SNN parameters, and selecting the appropriate SNN model. Furthermore, our tinySNN quantizes different SNN parameters (i.e., weights and neuron parameters) to maximize the compression while exploring different combinations of quantization schemes, precision levels, and rounding schemes to find the model that provides acceptable accuracy. The experimental results demonstrate that our tinySNN significantly reduces the memory footprint and the energy consumption of SNNs without accuracy loss as compared to the baseline network. Therefore, our tinySNN effectively compresses the given SNN model to achieve high accuracy in a memory- and energy-efficient manner, hence enabling the employment of SNNs for the resource- and energy-constrained embedded applications.
    Scalable Differentially Private Clustering via Hierarchically Separated Trees. (arXiv:2206.08646v1 [cs.DS])
    We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space. By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods. We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / \epsilon^2)$, where $\epsilon$ is the privacy guarantee. (The dimension term, $d$, can be replaced with $O(\log k)$ using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, $\tilde{O}(nkd)$, time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime. Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines.
    Multimodal Attention-based Deep Learning for Alzheimer's Disease Diagnosis. (arXiv:2206.08826v1 [cs.LG])
    Alzheimer's Disease (AD) is the most common neurodegenerative disorder with one of the most complex pathogeneses, making effective and clinically actionable decision support difficult. The objective of this study was to develop a novel multimodal deep learning framework to aid medical professionals in AD diagnosis. We present a Multimodal Alzheimer's Disease Diagnosis framework (MADDi) to accurately detect the presence of AD and mild cognitive impairment (MCI) from imaging, genetic, and clinical data. MADDi is novel in that we use cross-modal attention, which captures interactions between modalities - a method not previously explored in this domain. We perform multi-class classification, a challenging task considering the strong similarities between MCI and AD. We compare with previous state-of-the-art models, evaluate the importance of attention, and examine the contribution of each modality to the model's performance. MADDi classifies MCI, AD, and controls with 96.88% accuracy on a held-out test set. When examining the contribution of different attention schemes, we found that the combination of cross-modal attention with self-attention performed the best, and no attention layers in the model performed the worst, with a 7.9% difference in F1-Scores. Our experiments underlined the importance of structured clinical data to help machine learning models contextualize and interpret the remaining modalities. Extensive ablation studies showed that any multimodal mixture of input features without access to structured clinical information suffered marked performance losses. This study demonstrates the merit of combining multiple input modalities via cross-modal attention to deliver highly accurate AD diagnostic decision support.
    Zero-Shot AutoML with Pretrained Models. (arXiv:2206.08476v1 [cs.LG])
    Given a new dataset D and a low compute budget, how should we choose a pre-trained model to fine-tune to D, and set the fine-tuning hyperparameters without risking overfitting, particularly if D is small? Here, we extend automated machine learning (AutoML) to best make these choices. Our domain-independent meta-learning approach learns a zero-shot surrogate model which, at test time, allows to select the right deep learning (DL) pipeline (including the pre-trained model and fine-tuning hyperparameters) for a new dataset D given only trivial meta-features describing D such as image resolution or the number of classes. To train this zero-shot model, we collect performance data for many DL pipelines on a large collection of datasets and meta-train on this data to minimize a pairwise ranking objective. We evaluate our approach under the strict time limit of the vision track of the ChaLearn AutoDL challenge benchmark, clearly outperforming all challenge contenders.
    Query-Efficient and Scalable Black-Box Adversarial Attacks on Discrete Sequential Data via Bayesian Optimization. (arXiv:2206.08575v1 [cs.LG])
    We focus on the problem of adversarial attacks against models on discrete sequential data in the black-box setting where the attacker aims to craft adversarial examples with limited query access to the victim model. Existing black-box attacks, mostly based on greedy algorithms, find adversarial examples using pre-computed key positions to perturb, which severely limits the search space and might result in suboptimal solutions. To this end, we propose a query-efficient black-box attack using Bayesian optimization, which dynamically computes important positions using an automatic relevance determination (ARD) categorical kernel. We introduce block decomposition and history subsampling techniques to improve the scalability of Bayesian optimization when an input sequence becomes long. Moreover, we develop a post-optimization algorithm that finds adversarial examples with smaller perturbation size. Experiments on natural language and protein classification tasks demonstrate that our method consistently achieves higher attack success rate with significant reduction in query count and modification rate compared to the previous state-of-the-art methods.
    Embarrassingly Parallel Independent Training of Multi-Layer Perceptrons with Heterogeneous Architectures. (arXiv:2206.08369v1 [cs.LG])
    The definition of a Neural Network architecture is one of the most critical and challenging tasks to perform. In this paper, we propose ParallelMLPs. ParallelMLPs is a procedure to enable the training of several independent Multilayer Perceptron Neural Networks with a different number of neurons and activation functions in parallel by exploring the principle of locality and parallelization capabilities of modern CPUs and GPUs. The core idea of this technique is to use a Modified Matrix Multiplication that replaces an ordinal matrix multiplication by two simple matrix operations that allow separate and independent paths for gradient flowing, which can be used in other scenarios. We have assessed our algorithm in simulated datasets varying the number of samples, features and batches using 10,000 different models. We achieved a training speedup from 1 to 4 orders of magnitude if compared to the sequential approach.
    Self-Supervised Contrastive Pre-Training For Time Series via Time-Frequency Consistency. (arXiv:2206.08496v1 [cs.LG])
    Pre-training on time series poses a unique challenge due to the potential mismatch between pre-training and target domains, such as shifts in temporal dynamics, fast-evolving trends, and long-range and short cyclic effects, which can lead to poor downstream performance. While domain adaptation methods can mitigate these shifts, most methods need examples directly from the target domain, making them suboptimal for pre-training. To address this challenge, methods need to accommodate target domains with different temporal dynamics and be capable of doing so without seeing any target examples during pre-training. Relative to other modalities, in time series, we expect that time-based and frequency-based representations of the same example are located close together in the time-frequency space. To this end, we posit that time-frequency consistency (TF-C) -- embedding a time-based neighborhood of a particular example close to its frequency-based neighborhood and back -- is desirable for pre-training. Motivated by TF-C, we define a decomposable pre-training model, where the self-supervised signal is provided by the distance between time and frequency components, each individually trained by contrastive estimation. We evaluate the new method on eight datasets, including electrodiagnostic testing, human activity recognition, mechanical fault detection, and physical status monitoring. Experiments against eight state-of-the-art methods show that TF-C outperforms baselines by 15.4% (F1 score) on average in one-to-one settings (e.g., fine-tuning an EEG-pretrained model on EMG data) and by up to 8.4% (F1 score) in challenging one-to-many settings, reflecting the breadth of scenarios that arise in real-world applications. The source code and datasets are available at https: //anonymous.4open.science/r/TFC-pretraining-6B07.
    On the Influence of Enforcing Model Identifiability on Learning dynamics of Gaussian Mixture Models. (arXiv:2206.08598v1 [cs.LG])
    A common way to learn and analyze statistical models is to consider operations in the model parameter space. But what happens if we optimize in the parameter space and there is no one-to-one mapping between the parameter space and the underlying statistical model space? Such cases frequently occur for hierarchical models which include statistical mixtures or stochastic neural networks, and these models are said to be singular. Singular models reveal several important and well-studied problems in machine learning like the decrease in convergence speed of learning trajectories due to attractor behaviors. In this work, we propose a relative reparameterization technique of the parameter space, which yields a general method for extracting regular submodels from singular models. Our method enforces model identifiability during training and we study the learning dynamics for gradient descent and expectation maximization for Gaussian Mixture Models (GMMs) under relative parameterization, showing faster experimental convergence and a improved manifold shape of the dynamics around the singularity. Extending the analysis beyond GMMs, we furthermore analyze the Fisher information matrix under relative reparameterization and its influence on the generalization error, and show how the method can be applied to more complex models like deep neural networks.
    GOOD: A Graph Out-of-Distribution Benchmark. (arXiv:2206.08452v1 [cs.LG])
    Out-of-distribution (OOD) learning deals with scenarios in which training and test data follow different distributions. Although general OOD problems have been intensively studied in machine learning, graph OOD is only an emerging area of research. Currently, there lacks a systematic benchmark tailored to graph OOD method evaluation. In this work, we aim at developing an OOD benchmark, known as GOOD, for graphs specifically. We explicitly make distinctions between covariate and concept shifts and design data splits that accurately reflect different shifts. We consider both graph and node prediction tasks as there are key differences when designing shifts. Overall, GOOD contains 8 datasets with 14 domain selections. When combined with covariate, concept, and no shifts, we obtain 42 different splits. We provide performance results on 7 commonly used baseline methods with 10 random runs. This results in 294 dataset-model combinations in total. Our results show significant performance gaps between in-distribution and OOD settings. Our results also shed light on different performance trends between covariate and concept shifts by different methods. Our GOOD benchmark is a growing project and expects to expand in both quantity and variety of resources as the area develops. The GOOD benchmark can be accessed via $\href{https://github.com/divelab/GOOD/}{\text{https://github.com/divelab/GOOD/}}$.
    ComENet: Towards Complete and Efficient Message Passing for 3D Molecular Graphs. (arXiv:2206.08515v1 [cs.LG])
    Many real-world data can be modeled as 3D graphs, but learning representations that incorporates 3D information completely and efficiently is challenging. Existing methods either use partial 3D information, or suffer from excessive computational cost. To incorporate 3D information completely and efficiently, we propose a novel message passing scheme that operates within 1-hop neighborhood. Our method guarantees full completeness of 3D information on 3D graphs by achieving global and local completeness. Notably, we propose the important rotation angles to fulfill global completeness. Additionally, we show that our method is orders of magnitude faster than prior methods. We provide rigorous proof of completeness and analysis of time complexity for our methods. As molecules are in essence quantum systems, we build the \underline{com}plete and \underline{e}fficient graph neural network (ComENet) by combing quantum inspired basis functions and the proposed message passing scheme. Experimental results demonstrate the capability and efficiency of ComENet, especially on real-world datasets that are large in both numbers and sizes of graphs. Our code is publicly available as part of the DIG library (\url{https://github.com/divelab/DIG}).
    I Know What You Trained Last Summer: A Survey on Stealing Machine Learning Models and Defences. (arXiv:2206.08451v1 [cs.LG])
    Machine Learning-as-a-Service (MLaaS) has become a widespread paradigm, making even the most complex machine learning models available for clients via e.g. a pay-per-query principle. This allows users to avoid time-consuming processes of data collection, hyperparameter tuning, and model training. However, by giving their customers access to the (predictions of their) models, MLaaS providers endanger their intellectual property, such as sensitive training data, optimised hyperparameters, or learned model parameters. Adversaries can create a copy of the model with (almost) identical behavior using the the prediction labels only. While many variants of this attack have been described, only scattered defence strategies have been proposed, addressing isolated threats. This raises the necessity for a thorough systematisation of the field of model stealing, to arrive at a comprehensive understanding why these attacks are successful, and how they could be holistically defended against. We address this by categorising and comparing model stealing attacks, assessing their performance, and exploring corresponding defence techniques in different settings. We propose a taxonomy for attack and defence approaches, and provide guidelines on how to select the right attack or defence strategy based on the goal and available resources. Finally, we analyse which defences are rendered less effective by current attack strategies.
    Residual Bootstrap Exploration for Stochastic Linear Bandit. (arXiv:2202.11474v2 [stat.ML] UPDATED)
    We propose a new bootstrap-based online algorithm for stochastic linear bandit problems. The key idea is to adopt residual bootstrap exploration, in which the agent estimates the next step reward by re-sampling the residuals of mean reward estimate. Our algorithm, residual bootstrap exploration for stochastic linear bandit (\texttt{LinReBoot}), estimates the linear reward from its re-sampling distribution and pulls the arm with the highest reward estimate. In particular, we contribute a theoretical framework to demystify residual bootstrap-based exploration mechanisms in stochastic linear bandit problems. The key insight is that the strength of bootstrap exploration is based on collaborated optimism between the online-learned model and the re-sampling distribution of residuals. Such observation enables us to show that the proposed \texttt{LinReBoot} secure a high-probability $\tilde{O}(d \sqrt{n})$ sub-linear regret under mild conditions. Our experiments support the easy generalizability of the \texttt{ReBoot} principle in the various formulations of linear bandit problems and show the significant computational efficiency of \texttt{LinReBoot}.
    Universal Hopfield Networks: A General Framework for Single-Shot Associative Memory Models. (arXiv:2202.04557v2 [cs.NE] UPDATED)
    A large number of neural network models of associative memory have been proposed in the literature. These include the classical Hopfield networks (HNs), sparse distributed memories (SDMs), and more recently the modern continuous Hopfield networks (MCHNs), which possesses close links with self-attention in machine learning. In this paper, we propose a general framework for understanding the operation of such memory networks as a sequence of three operations: similarity, separation, and projection. We derive all these memory models as instances of our general framework with differing similarity and separation functions. We extend the mathematical framework of Krotov et al (2020) to express general associative memory models using neural network dynamics with only second-order interactions between neurons, and derive a general energy function that is a Lyapunov function of the dynamics. Finally, using our framework, we empirically investigate the capacity of using different similarity functions for these associative memory models, beyond the dot product similarity measure, and demonstrate empirically that Euclidean or Manhattan distance similarity metrics perform substantially better in practice on many tasks, enabling a more robust retrieval and higher memory capacity than existing models.
    On the Compression of Neural Networks Using $\ell_0$-Norm Regularization and Weight Pruning. (arXiv:2109.05075v2 [cs.LG] UPDATED)
    Despite the growing availability of high-capacity computational platforms, implementation complexity still has been a great concern for the real-world deployment of neural networks. This concern is not exclusively due to the huge costs of state-of-the-art network architectures, but also due to the recent push towards edge intelligence and the use of neural networks in embedded applications. In this context, network compression techniques have been gaining interest due to their ability for reducing deployment costs while keeping inference accuracy at satisfactory levels. The present paper is dedicated to the development of a novel compression scheme for neural networks. To this end, a new $\ell_0$-norm-based regularization approach is firstly developed, which is capable of inducing strong sparseness in the network during training. Then, targeting the smaller weights of the trained network with pruning techniques, smaller yet highly effective networks can be obtained. The proposed compression scheme also involves the use of $\ell_2$-norm regularization to avoid overfitting as well as fine tuning to improve the performance of the pruned network. Experimental results are presented aiming to show the effectiveness of the proposed scheme as well as to make comparisons with competing approaches.
    Decision-Focused Learning: Through the Lens of Learning to Rank. (arXiv:2112.03609v4 [cs.LG] UPDATED)
    In the last years decision-focused learning framework, also known as predict-and-optimize, have received increasing attention. In this setting, the predictions of a machine learning model are used as estimated cost coefficients in the objective function of a discrete combinatorial optimization problem for decision making. Decision-focused learning proposes to train the ML models, often neural network models, by directly optimizing the quality of decisions made by the optimization solvers. Based on a recent work that proposed a noise contrastive estimation loss over a subset of the solution space, we observe that decision-focused learning can more generally be seen as a learning-to-rank problem, where the goal is to learn an objective function that ranks the feasible points correctly. This observation is independent of the optimization method used and of the form of the objective function. We develop pointwise, pairwise and listwise ranking loss functions, which can be differentiated in closed form given a subset of solutions. We empirically investigate the quality of our generic methods compared to existing decision-focused learning approaches with competitive results. Furthermore, controlling the subset of solutions allows controlling the runtime considerably, with limited effect on regret.
    Minimum Noticeable Difference based Adversarial Privacy Preserving Image Generation. (arXiv:2206.08638v1 [cs.CV])
    Deep learning models are found to be vulnerable to adversarial examples, as wrong predictions can be caused by small perturbation in input for deep learning models. Most of the existing works of adversarial image generation try to achieve attacks for most models, while few of them make efforts on guaranteeing the perceptual quality of the adversarial examples. High quality adversarial examples matter for many applications, especially for the privacy preserving. In this work, we develop a framework based on the Minimum Noticeable Difference (MND) concept to generate adversarial privacy preserving images that have minimum perceptual difference from the clean ones but are able to attack deep learning models. To achieve this, an adversarial loss is firstly proposed to make the deep learning models attacked by the adversarial images successfully. Then, a perceptual quality-preserving loss is developed by taking the magnitude of perturbation and perturbation-caused structural and gradient changes into account, which aims to preserve high perceptual quality for adversarial image generation. To the best of our knowledge, this is the first work on exploring quality-preserving adversarial image generation based on the MND concept for privacy preserving. To evaluate its performance in terms of perceptual quality, the deep models on image classification and face recognition are tested with the proposed method and several anchor methods in this work. Extensive experimental results demonstrate that the proposed MND framework is capable of generating adversarial images with remarkably improved performance metrics (e.g., PSNR, SSIM, and MOS) than that generated with the anchor methods.
    A Spatio-Temporal Neural Network Forecasting Approach for Emulation of Firefront Models. (arXiv:2206.08523v1 [cs.LG])
    Computational simulations of wildfire spread typically employ empirical rate-of-spread calculations under various conditions (such as terrain, fuel type, weather). Small perturbations in conditions can often lead to significant changes in fire spread (such as speed and direction), necessitating a computationally expensive large set of simulations to quantify uncertainty. Model emulation seeks alternative representations of physical models using machine learning, aiming to provide more efficient and/or simplified surrogate models. We propose a dedicated spatio-temporal neural network based framework for model emulation, able to capture the complex behaviour of fire spread models. The proposed approach can approximate forecasts at fine spatial and temporal resolutions that are often challenging for neural network based approaches. Furthermore, the proposed approach is robust even with small training sets, due to novel data augmentation methods. Empirical experiments show good agreement between simulated and emulated firefronts, with an average Jaccard score of 0.76.
  • Open

    Distribution Regression with Sliced Wasserstein Kernels. (arXiv:2202.03926v2 [stat.ML] UPDATED)
    The problem of learning functions over spaces of probabilities - or distribution regression - is gaining significant interest in the machine learning community. A key challenge behind this problem is to identify a suitable representation capturing all relevant properties of the underlying functional mapping. A principled approach to distribution regression is provided by kernel mean embeddings, which lifts kernel-induced similarity on the input domain at the probability level. This strategy effectively tackles the two-stage sampling nature of the problem, enabling one to derive estimators with strong statistical guarantees, such as universal consistency and excess risk bounds. However, kernel mean embeddings implicitly hinge on the maximum mean discrepancy (MMD), a metric on probabilities, which may fail to capture key geometrical relations between distributions. In contrast, optimal transport (OT) metrics, are potentially more appealing. In this work, we propose an OT-based estimator for distribution regression. We build on the Sliced Wasserstein distance to obtain an OT-based representation. We study the theoretical properties of a kernel ridge regression estimator based on such representation, for which we prove universal consistency and excess risk bounds. Preliminary experiments complement our theoretical findings by showing the effectiveness of the proposed approach and compare it with MMD-based estimators.  ( 2 min )
    Deep learning, stochastic gradient descent and diffusion maps. (arXiv:2204.01365v3 [stat.ML] UPDATED)
    Stochastic gradient descent (SGD) is widely used in deep learning due to its computational efficiency, but a complete understanding of why SGD performs so well remains a major challenge. It has been observed empirically that most eigenvalues of the Hessian of the loss functions on the loss landscape of over-parametrized deep neural networks are close to zero, while only a small number of eigenvalues are large. Zero eigenvalues indicate zero diffusion along the corresponding directions. This indicates that the process of minima selection mainly happens in the relatively low-dimensional subspace corresponding to the top eigenvalues of the Hessian. Although the parameter space is very high-dimensional, these findings seems to indicate that the SGD dynamics may mainly live on a low-dimensional manifold. In this paper, we pursue a truly data driven approach to the problem of getting a potentially deeper understanding of the high-dimensional parameter surface, and in particular, of the landscape traced out by SGD by analyzing the data generated through SGD, or any other optimizer for that matter, in order to possibly discover (local) low-dimensional representations of the optimization landscape. As our vehicle for the exploration, we use diffusion maps introduced by R. Coifman and coauthors.  ( 2 min )
    Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs. (arXiv:2203.09251v2 [cs.LG] UPDATED)
    In probably approximately correct (PAC) reinforcement learning (RL), an agent is required to identify an $\epsilon$-optimal policy with probability $1-\delta$. While minimax optimal algorithms exist for this problem, its instance-dependent complexity remains elusive in episodic Markov decision processes (MDPs). In this paper, we propose the first (nearly) matching upper and lower bounds on the sample complexity of PAC RL in deterministic episodic MDPs with finite state and action spaces. In particular, our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap. While our instance-dependent lower bound is written as a linear program, our algorithms are very simple and do not require solving such an optimization problem during learning. Their design and analyses employ novel ideas, including graph-theoretical concepts such as minimum flows and maximum cuts, which we believe to shed new light on this problem.  ( 2 min )
    Structure-preserving GANs. (arXiv:2202.01129v2 [cs.LG] UPDATED)
    Generative adversarial networks (GANs), a class of distribution-learning methods based on a two-player game between a generator and a discriminator, can generally be formulated as a minmax problem based on the variational representation of a divergence between the unknown and the generated distributions. We introduce structure-preserving GANs as a data-efficient framework for learning distributions with additional structure such as group symmetry, by developing new variational representations for divergences. Our theory shows that we can reduce the discriminator space to its projection on the invariant discriminator space, using the conditional expectation with respect to the sigma-algebra associated to the underlying structure. In addition, we prove that the discriminator space reduction must be accompanied by a careful design of structured generators, as flawed designs may easily lead to a catastrophic "mode collapse" of the learned distribution. We contextualize our framework by building symmetry-preserving GANs for distributions with intrinsic group symmetry, and demonstrate that both players, namely the equivariant generator and invariant discriminator, play important but distinct roles in the learning process. Empirical experiments and ablation studies across a broad range of data sets, including real-world medical imaging, validate our theory, and show our proposed methods achieve significantly improved sample fidelity and diversity -- almost an order of magnitude measured in Fr\'echet Inception Distance -- especially in the small data regime.  ( 3 min )
    Bayesian Spillover Graphs for Dynamic Networks. (arXiv:2203.01912v2 [stat.ME] UPDATED)
    We present Bayesian Spillover Graphs (BSG), a novel method for learning temporal relationships, identifying critical nodes, and quantifying uncertainty for multi-horizon spillover effects in a dynamic system. BSG leverages both an interpretable framework via forecast error variance decompositions (FEVD) and comprehensive uncertainty quantification via Bayesian time series models to contextualize temporal relationships in terms of systemic risk and prediction variability. Forecast horizon hyperparameter $h$ allows for learning both short-term and equilibrium state network behaviors. Experiments for identifying source and sink nodes under various graph and error specifications show significant performance gains against state-of-the-art Bayesian Networks and deep-learning baselines. Applications to real-world systems also showcase BSG as an exploratory analysis tool for uncovering indirect spillovers and quantifying systemic risk.  ( 2 min )
    A Theoretical Analysis on Independence-driven Importance Weighting for Covariate-shift Generalization. (arXiv:2111.02355v2 [cs.LG] UPDATED)
    Covariate-shift generalization, a typical case in out-of-distribution (OOD) generalization, requires a good performance on the unknown test distribution, which varies from the accessible training distribution in the form of covariate shift. Recently, independence-driven importance weighting algorithms in stable learning literature have shown empirical effectiveness to deal with covariate-shift generalization on several learning models, including regression algorithms and deep neural networks, while their theoretical analyses are missing. In this paper, we theoretically prove the effectiveness of such algorithms by explaining them as feature selection processes. We first specify a set of variables, named minimal stable variable set, that is the minimal and optimal set of variables to deal with covariate-shift generalization for common loss functions, such as the mean squared loss and binary cross-entropy loss. Afterward, we prove that under ideal conditions, independence-driven importance weighting algorithms could identify the variables in this set. Analysis of asymptotic properties is also provided. These theories are further validated in several synthetic experiments.  ( 2 min )
    Mirror Descent with Relative Smoothness in Measure Spaces, with application to Sinkhorn and EM. (arXiv:2206.08873v1 [math.OC])
    Many problems in machine learning can be formulated as optimizing a convex functional over a space of measures. This paper studies the convergence of the mirror descent algorithm in this infinite-dimensional setting. Defining Bregman divergences through directional derivatives, we derive the convergence of the scheme for relatively smooth and strongly convex pairs of functionals. Applying our result to joint distributions and the Kullback--Leibler (KL) divergence, we show that Sinkhorn's primal iterations for entropic optimal transport in the continuous setting correspond to a mirror descent, and we obtain a new proof of its (sub)linear convergence. We also show that Expectation Maximization (EM) can always formally be written as a mirror descent, and, when optimizing on the latent distribution while fixing the mixtures, we derive sublinear rates of convergence.  ( 2 min )
    Solar Radiation Ramping Events Modeling Using Spatio-temporal Point Processes. (arXiv:2101.11179v2 [stat.AP] UPDATED)
    Modeling and predicting solar events, particularly the solar ramping event, is critical for improving situational awareness for solar power generation systems. It has been acknowledged that weather conditions such as temperature, humidity, and cloud density can significantly impact the emergence and position of solar ramping events. As a result, modeling these events with complex spatio-temporal correlations is highly challenging. To tackle the question, we adopt a novel spatio-temporal categorical point process model, which intuitively and effectively addresses correlation and interaction among ramping events. We demonstrate the interpretability and predictive power of our model on extensive real-data experiments.  ( 2 min )
    Lossy Compression with Gaussian Diffusion. (arXiv:2206.08889v1 [stat.ML])
    We describe a novel lossy compression approach called DiffC which is based on unconditional diffusion generative models. Unlike modern compression schemes which rely on transform coding and quantization to restrict the transmitted information, DiffC relies on the efficient communication of pixels corrupted by Gaussian noise. We implement a proof of concept and find that it works surprisingly well despite the lack of an encoder transform, outperforming the state-of-the-art generative compression method HiFiC on ImageNet 64x64. DiffC only uses a single model to encode and denoise corrupted pixels at arbitrary bitrates. The approach further provides support for progressive coding, that is, decoding from partial bit streams. We perform a rate-distortion analysis to gain a deeper understanding of its performance, providing analytical results for multivariate Gaussian data as well as initial results for general distributions. Furthermore, we show that a flow-based reconstruction achieves a 3 dB gain over ancestral sampling at high bitrates.  ( 2 min )
    Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks. (arXiv:2201.11729v2 [cs.LG] UPDATED)
    In the pursuit of explaining implicit regularization in deep learning, prominent focus was given to matrix and tensor factorizations, which correspond to simplified neural networks. It was shown that these models exhibit an implicit tendency towards low matrix and tensor ranks, respectively. Drawing closer to practical deep learning, the current paper theoretically analyzes the implicit regularization in hierarchical tensor factorization, a model equivalent to certain deep convolutional neural networks. Through a dynamical systems lens, we overcome challenges associated with hierarchy, and establish implicit regularization towards low hierarchical tensor rank. This translates to an implicit regularization towards locality for the associated convolutional networks. Inspired by our theory, we design explicit regularization discouraging locality, and demonstrate its ability to improve the performance of modern convolutional networks on non-local tasks, in defiance of conventional wisdom by which architectural changes are needed. Our work highlights the potential of enhancing neural networks via theoretical analysis of their implicit regularization.  ( 2 min )
    You Are the Best Reviewer of Your Own Papers: An Owner-Assisted Scoring Mechanism. (arXiv:2110.14802v2 [cs.LG] UPDATED)
    I consider a setting where reviewers offer very noisy scores for several items for the selection of high-quality ones (e.g., peer review of large conference proceedings), whereas the owner of these items knows the true underlying scores but prefers not to provide this information. To address this withholding of information, in this paper, I introduce the Isotonic Mechanism, a simple and efficient approach to improving imprecise raw scores by leveraging certain information that the owner is incentivized to provide. This mechanism takes the ranking of the items from best to worst provided by the owner as input, in addition to the raw scores provided by the reviewers. It reports the adjusted scores for the items by solving a convex optimization problem. Under certain conditions, I show that the owner's optimal strategy is to honestly report the true ranking of the items to her best knowledge in order to maximize the expected utility. Moreover, I prove that the adjusted scores provided by this owner-assisted mechanism are significantly more accurate than the raw scores provided by the reviewers. This paper concludes with several extensions of the Isotonic Mechanism and some refinements of the mechanism for practical consideration.  ( 3 min )
    Smoothing Policies and Safe Policy Gradients. (arXiv:1905.03231v2 [cs.LG] UPDATED)
    Policy Gradient (PG) algorithms are among the best candidates for the much-anticipated applications of reinforcement learning to real-world control tasks, such as robotics. However, the trial-and-error nature of these methods poses safety issues whenever the learning process itself must be performed on a physical system or involves any form of human-computer interaction. In this paper, we address a specific safety formulation, where both goals and dangers are encoded in a scalar reward signal and the learning agent is constrained to never worsen its performance, measured as the expected sum of rewards. By studying actor-only policy gradient from a stochastic optimization perspective, we establish improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies. This, together with novel upper bounds on the variance of policy gradient estimators, allows us to identify meta-parameter schedules that guarantee monotonic improvement with high probability. The two key meta-parameters are the step size of the parameter updates and the batch size of the gradient estimates. Through a joint, adaptive selection of these meta-parameters, we obtain a policy gradient algorithm with monotonic improvement guarantees.  ( 2 min )
    Domain Adaptation for Time Series Forecasting via Attention Sharing. (arXiv:2102.06828v7 [cs.LG] UPDATED)
    Recently, deep neural networks have gained increasing popularity in the field of time series forecasting. A primary reason for their success is their ability to effectively capture complex temporal dynamics across multiple related time series. The advantages of these deep forecasters only start to emerge in the presence of a sufficient amount of data. This poses a challenge for typical forecasting problems in practice, where there is a limited number of time series or observations per time series, or both. To cope with this data scarcity issue, we propose a novel domain adaptation framework, Domain Adaptation Forecaster (DAF). DAF leverages statistical strengths from a relevant domain with abundant data samples (source) to improve the performance on the domain of interest with limited data (target). In particular, we use an attention-based shared module with a domain discriminator across domains and private modules for individual domains. We induce domain-invariant latent features (queries and keys) and retrain domain-specific features (values) simultaneously to enable joint training of forecasters on source and target domains. A main insight is that our design of aligning keys allows the target domain to leverage source time series even with different characteristics. Extensive experiments on various domains demonstrate that our proposed method outperforms state-of-the-art baselines on synthetic and real-world datasets, and ablation studies verify the effectiveness of our design choices.  ( 3 min )
    Generalized Frank-Wolfe Algorithm for Bilevel Optimization. (arXiv:2206.08868v1 [math.OC])
    In this paper, we study a class of bilevel optimization problems, also known as simple bilevel optimization, where we minimize a smooth objective function over the optimal solution set of another convex constrained optimization problem. Several iterative methods have been developed for tackling this class of problems. Alas, their convergence guarantees are not satisfactory as they are either asymptotic for the upper-level objective, or the convergence rates are slow and sub-optimal. To address this issue, in this paper, we introduce a generalization of the Frank-Wolfe (FW) method to solve the considered problem. The main idea of our method is to locally approximate the solution set of the lower-level problem via a cutting plane, and then run a FW-type update to decrease the upper-level objective. When the upper-level objective is convex, we show that our method requires ${\mathcal{O}}(\max\{1/\epsilon_f,1/\epsilon_g\})$ iterations to find a solution that is $\epsilon_f$-optimal for the upper-level objective and $\epsilon_g$-optimal for the lower-level objective. Moreover, when the upper-level objective is non-convex, our method requires ${\mathcal{O}}(\max\{1/\epsilon_f^2,1/(\epsilon_f\epsilon_g)\})$ iterations to find an $(\epsilon_f,\epsilon_g)$-optimal solution. We further prove stronger convergence guarantees under the H\"olderian error bound assumption on the lower-level problem. To the best of our knowledge, our method achieves the best-known iteration complexity for the considered bilevel problem. We also present numerical experiments to showcase the superior performance of our method compared with state-of-the-art methods.  ( 2 min )
    CausalVAE: Structured Causal Disentanglement in Variational Autoencoder. (arXiv:2004.08697v6 [cs.LG] UPDATED)
    Learning disentanglement aims at finding a low dimensional representation which consists of multiple explanatory and generative factors of the observational data. The framework of variational autoencoder (VAE) is commonly used to disentangle independent factors from observations. However, in real scenarios, factors with semantics are not necessarily independent. Instead, there might be an underlying causal structure which renders these factors dependent. We thus propose a new VAE based framework named CausalVAE, which includes a Causal Layer to transform independent exogenous factors into causal endogenous ones that correspond to causally related concepts in data. We further analyze the model identifiabitily, showing that the proposed model learned from observations recovers the true one up to a certain degree. Experiments are conducted on various datasets, including synthetic and real word benchmark CelebA. Results show that the causal representations learned by CausalVAE are semantically interpretable, and their causal relationship as a Directed Acyclic Graph (DAG) is identified with good accuracy. Furthermore, we demonstrate that the proposed CausalVAE model is able to generate counterfactual data through "do-operation" to the causal factors.  ( 2 min )
    AutoML Two-Sample Test. (arXiv:2206.08843v1 [cs.LG])
    Two-sample tests are important in statistics and machine learning, both as tools for scientific discovery as well as to detect distribution shifts. This led to the development of many sophisticated test procedures going beyond the standard supervised learning frameworks, whose usage can require specialized knowledge about two-sample testing. We use a simple test that takes the mean discrepancy of a witness function as the test statistic and prove that minimizing a squared loss leads to a witness with optimal testing power. This allows us to leverage recent advancements in AutoML. Without any user input about the problems at hand, and using the same method for all our experiments, our AutoML two-sample test achieves competitive performance on a diverse distribution shift benchmark as well as on challenging two-sample testing problems. We provide an implementation of the AutoML two-sample test in the Python package autotst.  ( 2 min )
    abess: A Fast Best Subset Selection Library in Python and R. (arXiv:2110.09697v2 [stat.ML] UPDATED)
    We introduce a new library named abess that implements a unified framework of best-subset selection for solving diverse machine learning problems, e.g., linear regression, classification, and principal component analysis. Particularly, the abess certifiably gets the optimal solution within polynomial times with high probability under the linear model. Our efficient implementation allows abess to attain the solution of best-subset selection problems as fast as or even 20x faster than existing competing variable (model) selection toolboxes. Furthermore, it supports common variants like best group subset selection and $\ell_2$ regularized best-subset selection. The core of the library is programmed in C++. For ease of use, a Python library is designed for conveniently integrating with scikit-learn, and it can be installed from the Python library Index. In addition, a user-friendly R library is available at the Comprehensive R Archive Network. The source code is available at: https://github.com/abess-team/abess.  ( 2 min )
    How robust are pre-trained models to distribution shift?. (arXiv:2206.08871v1 [cs.LG])
    The vulnerability of machine learning models to spurious correlations has mostly been discussed in the context of supervised learning (SL). However, there is a lack of insight on how spurious correlations affect the performance of popular self-supervised learning (SSL) and auto-encoder based models (AE). In this work, we shed light on this by evaluating the performance of these models on both real world and synthetic distribution shift datasets. Following observations that the linear head itself can be susceptible to spurious correlations, we develop a novel evaluation scheme with the linear head trained on out-of-distribution (OOD) data, to isolate the performance of the pre-trained models from a potential bias of the linear head used for evaluation. With this new methodology, we show that SSL models are consistently more robust to distribution shifts and thus better at OOD generalisation than AE and SL models.  ( 2 min )
    Residual Bootstrap Exploration for Stochastic Linear Bandit. (arXiv:2202.11474v2 [stat.ML] UPDATED)
    We propose a new bootstrap-based online algorithm for stochastic linear bandit problems. The key idea is to adopt residual bootstrap exploration, in which the agent estimates the next step reward by re-sampling the residuals of mean reward estimate. Our algorithm, residual bootstrap exploration for stochastic linear bandit (\texttt{LinReBoot}), estimates the linear reward from its re-sampling distribution and pulls the arm with the highest reward estimate. In particular, we contribute a theoretical framework to demystify residual bootstrap-based exploration mechanisms in stochastic linear bandit problems. The key insight is that the strength of bootstrap exploration is based on collaborated optimism between the online-learned model and the re-sampling distribution of residuals. Such observation enables us to show that the proposed \texttt{LinReBoot} secure a high-probability $\tilde{O}(d \sqrt{n})$ sub-linear regret under mild conditions. Our experiments support the easy generalizability of the \texttt{ReBoot} principle in the various formulations of linear bandit problems and show the significant computational efficiency of \texttt{LinReBoot}.
    On the Influence of Enforcing Model Identifiability on Learning dynamics of Gaussian Mixture Models. (arXiv:2206.08598v1 [cs.LG])
    A common way to learn and analyze statistical models is to consider operations in the model parameter space. But what happens if we optimize in the parameter space and there is no one-to-one mapping between the parameter space and the underlying statistical model space? Such cases frequently occur for hierarchical models which include statistical mixtures or stochastic neural networks, and these models are said to be singular. Singular models reveal several important and well-studied problems in machine learning like the decrease in convergence speed of learning trajectories due to attractor behaviors. In this work, we propose a relative reparameterization technique of the parameter space, which yields a general method for extracting regular submodels from singular models. Our method enforces model identifiability during training and we study the learning dynamics for gradient descent and expectation maximization for Gaussian Mixture Models (GMMs) under relative parameterization, showing faster experimental convergence and a improved manifold shape of the dynamics around the singularity. Extending the analysis beyond GMMs, we furthermore analyze the Fisher information matrix under relative reparameterization and its influence on the generalization error, and show how the method can be applied to more complex models like deep neural networks.
    Meta-Learning Hypothesis Spaces for Sequential Decision-making. (arXiv:2202.00602v3 [stat.ML] UPDATED)
    Obtaining reliable, adaptive confidence sets for prediction functions (hypotheses) is a central challenge in sequential decision-making tasks, such as bandits and model-based reinforcement learning. These confidence sets typically rely on prior assumptions on the hypothesis space, e.g., the known kernel of a Reproducing Kernel Hilbert Space (RKHS). Hand-designing such kernels is error prone, and misspecification may lead to poor or unsafe performance. In this work, we propose to meta-learn a kernel from offline data (Meta-KeL). For the case where the unknown kernel is a combination of known base kernels, we develop an estimator based on structured sparsity. Under mild conditions, we guarantee that our estimated RKHS yields valid confidence sets that, with increasing amounts of offline data, become as tight as those given the true unknown kernel. We demonstrate our approach on the kernelized bandit problem (a.k.a.~Bayesian optimization), where we establish regret bounds competitive with those given the true kernel. We also empirically evaluate the effectiveness of our approach on a Bayesian optimization task.
    Capturing Actionable Dynamics with Structured Latent Ordinary Differential Equations. (arXiv:2202.12932v2 [stat.ML] UPDATED)
    End-to-end learning of dynamical systems with black-box models, such as neural ordinary differential equations (ODEs), provides a flexible framework for learning dynamics from data without prescribing a mathematical model for the dynamics. Unfortunately, this flexibility comes at the cost of understanding the dynamical system, for which ODEs are used ubiquitously. Further, experimental data are collected under various conditions (inputs), such as treatments, or grouped in some way, such as part of sub-populations. Understanding the effects of these system inputs on system outputs is crucial to have any meaningful model of a dynamical system. To that end, we propose a structured latent ODE model that explicitly captures system input variations within its latent representation. Building on a static latent variable specification, our model learns (independent) stochastic factors of variation for each input to the system, thus separating the effects of the system inputs in the latent space. This approach provides actionable modeling through the controlled generation of time-series data for novel input combinations (or perturbations). Additionally, we propose a flexible approach for quantifying uncertainties, leveraging a quantile regression formulation. Results on challenging biological datasets show consistent improvements over competitive baselines in the controlled generation of observational data and inference of biologically meaningful system inputs.
    Tensor-on-Tensor Regression: Riemannian Optimization, Over-parameterization, Statistical-computational Gap, and Their Interplay. (arXiv:2206.08756v1 [math.ST])
    We study the tensor-on-tensor regression, where the goal is to connect tensor responses to tensor covariates with a low Tucker rank parameter tensor/matrix without the prior knowledge of its intrinsic rank. We propose the Riemannian gradient descent (RGD) and Riemannian Gauss-Newton (RGN) methods and cope with the challenge of unknown rank by studying the effect of rank over-parameterization. We provide the first convergence guarantee for the general tensor-on-tensor regression by showing that RGD and RGN respectively converge linearly and quadratically to a statistically optimal estimate in both rank correctly-parameterized and over-parameterized settings. Our theory reveals an intriguing phenomenon: Riemannian optimization methods naturally adapt to over-parameterization without modifications to their implementation. We also give the first rigorous evidence for the statistical-computational gap in scalar-on-tensor regression under the low-degree polynomials framework. Our theory demonstrates a ``blessing of statistical-computational gap" phenomenon: in a wide range of scenarios in tensor-on-tensor regression for tensors of order three or higher, the computationally required sample size matches what is needed by moderate rank over-parameterization when considering computationally feasible estimators, while there are no such benefits in the matrix settings. This shows moderate rank over-parameterization is essentially ``cost-free" in terms of sample size in tensor-on-tensor regression of order three or higher. Finally, we conduct simulation studies to show the advantages of our proposed methods and to corroborate our theoretical findings.
    Active Sampling for Min-Max Fairness. (arXiv:2006.06879v3 [stat.ML] UPDATED)
    We propose simple active sampling and reweighting strategies for optimizing min-max fairness that can be applied to any classification or regression model learned via loss minimization. The key intuition behind our approach is to use at each timestep a datapoint from the group that is worst off under the current model for updating the model. The ease of implementation and the generality of our robust formulation make it an attractive option for improving model performance on disadvantaged groups. For convex learning problems, such as linear or logistic regression, we provide a fine-grained analysis, proving the rate of convergence to a min-max fair solution.
    Author Clustering and Topic Estimation for Short Texts. (arXiv:2106.09533v2 [cs.IR] UPDATED)
    Analysis of short text, such as social media posts, is extremely difficult because of their inherent brevity. In addition to classifying topics of such posts, a common downstream task is grouping the authors of these documents for subsequent analyses. We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document, with user-level topic distributions. We also simultaneously cluster users, removing the need for post-hoc cluster estimation and improving topic estimation by shrinking noisy user-level topic distributions towards typical values. Our method performs as well as -- or better -- than traditional approaches, and we demonstrate its usefulness on a dataset of tweets from United States Senators, recovering both meaningful topics and clusters that reflect partisan ideology. We also develop a novel measure of echo chambers among these politicians by characterizing insularity of topics discussed by groups of Senators and provide uncertainty quantification.
    Optimizing Sequential Experimental Design with Deep Reinforcement Learning. (arXiv:2202.00821v3 [cs.LG] UPDATED)
    Bayesian approaches developed to solve the optimal design of sequential experiments are mathematically elegant but computationally challenging. Recently, techniques using amortization have been proposed to make these Bayesian approaches practical, by training a parameterized policy that proposes designs efficiently at deployment time. However, these methods may not sufficiently explore the design space, require access to a differentiable probabilistic model and can only optimize over continuous design spaces. Here, we address these limitations by showing that the problem of optimizing policies can be reduced to solving a Markov decision process (MDP). We solve the equivalent MDP with modern deep reinforcement learning techniques. Our experiments show that our approach is also computationally efficient at deployment time and exhibits state-of-the-art performance on both continuous and discrete design spaces, even when the probabilistic model is a black box.
    Multiple-Play Stochastic Bandits with Shareable Finite-Capacity Arms. (arXiv:2206.08776v1 [cs.LG])
    We generalize the multiple-play multi-armed bandits (MP-MAB) problem with a shareable arm setting, in which several plays can share the same arm. Furthermore, each shareable arm has a finite reward capacity and a ''per-load'' reward distribution, both of which are unknown to the learner. The reward from a shareable arm is load-dependent, which is the "per-load" reward multiplying either the number of plays pulling the arm, or its reward capacity when the number of plays exceeds the capacity limit. When the "per-load" reward follows a Gaussian distribution, we prove a sample complexity lower bound of learning the capacity from load-dependent rewards and also a regret lower bound of this new MP-MAB problem. We devise a capacity estimator whose sample complexity upper bound matches the lower bound in terms of reward means and capacities. We also propose an online learning algorithm to address the problem and prove its regret upper bound. This regret upper bound's first term is the same as regret lower bound's, and its second and third terms also evidently correspond to lower bound's. Extensive experiments validate our algorithm's performance and also its gain in 5G & 4G base station selection.
    MET: Masked Encoding for Tabular Data. (arXiv:2206.08564v1 [cs.LG])
    We consider the task of self-supervised representation learning (SSL) for tabular data: tabular-SSL. Typical contrastive learning based SSL methods require instance-wise data augmentations which are difficult to design for unstructured tabular data. Existing tabular-SSL methods design such augmentations in a relatively ad-hoc fashion and can fail to capture the underlying data manifold. Instead of augmentations based approaches for tabular-SSL, we propose a new reconstruction based method, called Masked Encoding for Tabular Data (MET), that does not require augmentations. MET is based on the popular MAE approach for vision-SSL [He et al., 2021] and uses two key ideas: (i) since each coordinate in a tabular dataset has a distinct meaning, we need to use separate representations for all coordinates, and (ii) using an adversarial reconstruction loss in addition to the standard one. Empirical results on five diverse tabular datasets show that MET achieves a new state of the art (SOTA) on all of these datasets and improves up to 9% over current SOTA methods. We shed more light on the working of MET via experiments on carefully designed simple datasets.
    FedNew: A Communication-Efficient and Privacy-Preserving Newton-Type Method for Federated Learning. (arXiv:2206.08829v1 [cs.LG])
    Newton-type methods are popular in federated learning due to their fast convergence. Still, they suffer from two main issues, namely: low communication efficiency and low privacy due to the requirement of sending Hessian information from clients to parameter server (PS). In this work, we introduced a novel framework called FedNew in which there is no need to transmit Hessian information from clients to PS, hence resolving the bottleneck to improve communication efficiency. In addition, FedNew hides the gradient information and results in a privacy-preserving approach compared to the existing state-of-the-art. The core novel idea in FedNew is to introduce a two level framework, and alternate between updating the inverse Hessian-gradient product using only one alternating direction method of multipliers (ADMM) step and then performing the global model update using Newton's method. Though only one ADMM pass is used to approximate the inverse Hessian-gradient product at each iteration, we develop a novel theoretical approach to show the converging behavior of FedNew for convex problems. Additionally, a significant reduction in communication overhead is achieved by utilizing stochastic quantization. Numerical results using real datasets show the superiority of FedNew compared to existing methods in terms of communication costs.
    k-Sliced Mutual Information: A Quantitative Study of Scalability with Dimension. (arXiv:2206.08526v1 [cs.IT])
    Sliced mutual information (SMI) is defined as an average of mutual information (MI) terms between one-dimensional random projections of the random variables. It serves as a surrogate measure of dependence to classic MI that preserves many of its properties but is more scalable to high dimensions. However, a quantitative characterization of how SMI itself and estimation rates thereof depend on the ambient dimension, which is crucial to the understanding of scalability, remain obscure. This works extends the original SMI definition to $k$-SMI, which considers projections to $k$-dimensional subspaces, and provides a multifaceted account on its dependence on dimension. Using a new result on the continuity of differential entropy in the 2-Wasserstein metric, we derive sharp bounds on the error of Monte Carlo (MC)-based estimates of $k$-SMI, with explicit dependence on $k$ and the ambient dimension, revealing their interplay with the number of samples. We then combine the MC integrator with the neural estimation framework to provide an end-to-end $k$-SMI estimator, for which optimal convergence rates are established. We also explore asymptotics of the population $k$-SMI as dimension grows, providing Gaussian approximation results with a residual that decays under appropriate moment bounds. Our theory is validated with numerical experiments and is applied to sliced InfoGAN, which altogether provide a comprehensive quantitative account of the scalability question of $k$-SMI, including SMI as a special case when $k=1$.
    FiT: Parameter Efficient Few-shot Transfer Learning for Personalized and Federated Image Classification. (arXiv:2206.08671v1 [stat.ML])
    Modern deep learning systems are increasingly deployed in situations such as personalization and federated learning where it is necessary to support i) learning on small amounts of data, and ii) communication efficient distributed training protocols. In this work we develop FiLM Transfer (FiT) which fulfills these requirements in the image classification setting. FiT uses an automatically configured Naive Bayes classifier on top of a fixed backbone that has been pretrained on large image datasets. Parameter efficient FiLM layers are used to modulate the backbone, shaping the representation for the downstream task. The network is trained via an episodic fine-tuning protocol. The approach is parameter efficient which is key for enabling few-shot learning, inexpensive model updates for personalization, and communication efficient federated learning. We experiment with FiT on a wide range of downstream datasets and show that it achieves better classification accuracy than the state-of-the-art Big Transfer (BiT) algorithm at low-shot and on the challenging VTAB-1k benchmark, with fewer than 1% of the updateable parameters. Finally, we demonstrate the parameter efficiency of FiT in distributed low-shot applications including model personalization and federated learning where model update size is an important performance metric.
    Reframed GES with a Neural Conditional Dependence Measure. (arXiv:2206.08531v1 [stat.ML])
    In a nonparametric setting, the causal structure is often identifiable only up to Markov equivalence, and for the purpose of causal inference, it is useful to learn a graphical representation of the Markov equivalence class (MEC). In this paper, we revisit the Greedy Equivalence Search (GES) algorithm, which is widely cited as a score-based algorithm for learning the MEC of the underlying causal structure. We observe that in order to make the GES algorithm consistent in a nonparametric setting, it is not necessary to design a scoring metric that evaluates graphs. Instead, it suffices to plug in a consistent estimator of a measure of conditional dependence to guide the search. We therefore present a reframing of the GES algorithm, which is more flexible than the standard score-based version and readily lends itself to the nonparametric setting with a general measure of conditional dependence. In addition, we propose a neural conditional dependence (NCD) measure, which utilizes the expressive power of deep neural networks to characterize conditional independence in a nonparametric manner. We establish the optimality of the reframed GES algorithm under standard assumptions and the consistency of using our NCD estimator to decide conditional independence. Together these results justify the proposed approach. Experimental results demonstrate the effectiveness of our method in causal discovery, as well as the advantages of using our NCD measure over kernel-based measures.
    Adapting the Linearised Laplace Model Evidence for Modern Deep Learning. (arXiv:2206.08900v1 [stat.ML])
    The linearised Laplace method for estimating model uncertainty has received renewed attention in the Bayesian deep learning community. The method provides reliable error bars and admits a closed-form expression for the model evidence, allowing for scalable selection of model hyperparameters. In this work, we examine the assumptions behind this method, particularly in conjunction with model selection. We show that these interact poorly with some now-standard tools of deep learning--stochastic approximation methods and normalisation layers--and make recommendations for how to better adapt this classic method to the modern setting. We provide theoretical support for our recommendations and validate them empirically on MLPs, classic CNNs, residual networks with and without normalisation layers, generative autoencoders and transformers.
    Omni-Scale CNNs: a simple and effective kernel size configuration for time series classification. (arXiv:2002.10061v3 [cs.LG] UPDATED)
    The Receptive Field (RF) size has been one of the most important factors for One Dimensional Convolutional Neural Networks (1D-CNNs) on time series classification tasks. Large efforts have been taken to choose the appropriate size because it has a huge influence on the performance and differs significantly for each dataset. In this paper, we propose an Omni-Scale block (OS-block) for 1D-CNNs, where the kernel sizes are decided by a simple and universal rule. Particularly, it is a set of kernel sizes that can efficiently cover the best RF size across different datasets via consisting of multiple prime numbers according to the length of the time series. The experiment result shows that models with the OS-block can achieve a similar performance as models with the searched optimal RF size and due to the strong optimal RF size capture ability, simple 1D-CNN models with OS-block achieves the state-of-the-art performance on four time series benchmarks, including both univariate and multivariate data from multiple domains. Comprehensive analysis and discussions shed light on why the OS-block can capture optimal RF sizes across different datasets. Code available [https://github.com/Wensi-Tang/OS-CNN]
    Variational Estimators of the Degree-corrected Latent Block Model for Bipartite Networks. (arXiv:2206.08465v1 [stat.ML])
    Biclustering on bipartite graphs is an unsupervised learning task that simultaneously clusters the two types of objects in the graph, for example, users and movies in a movie review dataset. The latent block model (LBM) has been proposed as a model-based tool for biclustering. Biclustering results by the LBM are, however, usually dominated by the row and column sums of the data matrix, i.e., degrees. We propose a degree-corrected latent block model (DC-LBM) to accommodate degree heterogeneity in row and column clusters, which greatly outperforms the classical LBM in the MovieLens dataset and simulated data. We develop an efficient variational expectation-maximization algorithm by observing that the row and column degrees maximize the objective function in the M step given any probability assignment on the cluster labels. We prove the label consistency of the variational estimator under the DC-LBM, which allows the expected graph density goes to zero as long as the average expected degrees of rows and columns go to infinity.
    The Role of Depth, Width, and Activation Complexity in the Number of Linear Regions of Neural Networks. (arXiv:2206.08615v1 [cs.LG])
    Many feedforward neural networks generate continuous and piecewise-linear (CPWL) mappings. Specifically, they partition the input domain into regions on which the mapping is an affine function. The number of these so-called linear regions offers a natural metric to characterize the expressiveness of CPWL mappings. Although the precise determination of this quantity is often out of reach, bounds have been proposed for specific architectures, including the well-known ReLU and Maxout networks. In this work, we propose a more general perspective and provide precise bounds on the maximal number of linear regions of CPWL networks based on three sources of expressiveness: depth, width, and activation complexity. Our estimates rely on the combinatorial structure of convex partitions and highlight the distinctive role of depth which, on its own, is able to exponentially increase the number of regions. We then introduce a complementary stochastic framework to estimate the average number of linear regions produced by a CPWL network architecture. Under reasonable assumptions, the expected density of linear regions along any 1D path is bounded by the product of depth, width, and a measure of activation complexity (up to a scaling factor). This yields an identical role to the three sources of expressiveness: no exponential growth with depth is observed anymore.
    Thompson Sampling Achieves $\tilde O(\sqrt{T})$ Regret in Linear Quadratic Control. (arXiv:2206.08520v1 [cs.LG])
    Thompson Sampling (TS) is an efficient method for decision-making under uncertainty, where an action is sampled from a carefully prescribed distribution which is updated based on the observed data. In this work, we study the problem of adaptive control of stabilizable linear-quadratic regulators (LQRs) using TS, where the system dynamics are unknown. Previous works have established that $\tilde O(\sqrt{T})$ frequentist regret is optimal for the adaptive control of LQRs. However, the existing methods either work only in restrictive settings, require a priori known stabilizing controllers, or utilize computationally intractable approaches. We propose an efficient TS algorithm for the adaptive control of LQRs, TS-based Adaptive Control, TSAC, that attains $\tilde O(\sqrt{T})$ regret, even for multidimensional systems, thereby solving the open problem posed in Abeille and Lazaric (2018). TSAC does not require a priori known stabilizing controller and achieves fast stabilization of the underlying system by effectively exploring the environment in the early stages. Our result hinges on developing a novel lower bound on the probability that the TS provides an optimistic sample. By carefully prescribing an early exploration strategy and a policy update rule, we show that TS achieves order-optimal regret in adaptive control of multidimensional stabilizable LQRs. We empirically demonstrate the performance and the efficiency of TSAC in several adaptive control tasks.
    Generalised Policy Improvement with Geometric Policy Composition. (arXiv:2206.08736v1 [stat.ML])
    We introduce a method for policy improvement that interpolates between the greedy approach of value-based reinforcement learning (RL) and the full planning approach typical of model-based RL. The new method builds on the concept of a geometric horizon model (GHM, also known as a gamma-model), which models the discounted state-visitation distribution of a given policy. We show that we can evaluate any non-Markov policy that switches between a set of base Markov policies with fixed probability by a careful composition of the base policy GHMs, without any additional learning. We can then apply generalised policy improvement (GPI) to collections of such non-Markov policies to obtain a new Markov policy that will in general outperform its precursors. We provide a thorough theoretical analysis of this approach, develop applications to transfer and standard RL, and empirically demonstrate its effectiveness over standard GPI on a challenging deep RL continuous control task. We also provide an analysis of GHM training methods, proving a novel convergence result regarding previously proposed methods and showing how to train these models stably in deep RL settings.
    Active Fairness Auditing. (arXiv:2206.08450v1 [cs.LG])
    The fast spreading adoption of machine learning (ML) by companies across industries poses significant regulatory challenges. One such challenge is scalability: how can regulatory bodies efficiently audit these ML models, ensuring that they are fair? In this paper, we initiate the study of query-based auditing algorithms that can estimate the demographic parity of ML models in a query-efficient manner. We propose an optimal deterministic algorithm, as well as a practical randomized, oracle-efficient algorithm with comparable guarantees. Furthermore, we make inroads into understanding the optimal query complexity of randomized active fairness estimation algorithms. Our first exploration of active fairness estimation aims to put AI governance on firmer theoretical foundations.
    Generalised Bayesian Inference for Discrete Intractable Likelihood. (arXiv:2206.08420v1 [stat.ME])
    Discrete state spaces represent a major computational challenge to statistical inference, since the computation of normalisation constants requires summation over large or possibly infinite sets, which can be impractical. This paper addresses this computational challenge through the development of a novel generalised Bayesian inference procedure suitable for discrete intractable likelihood. Inspired by recent methodological advances for continuous data, the main idea is to update beliefs about model parameters using a discrete Fisher divergence, in lieu of the problematic intractable likelihood. The result is a generalised posterior that can be sampled using standard computational tools, such as Markov chain Monte Carlo, circumventing the intractable normalising constant. The statistical properties of the generalised posterior are analysed, with sufficient conditions for posterior consistency and asymptotic normality established. In addition, a novel and general approach to calibration of generalised posteriors is proposed. Applications are presented on lattice models for discrete spatial data and on multivariate models for count data, where in each case the methodology facilitates generalised Bayesian inference at low computational cost.
    Powershap: A Power-full Shapley Feature Selection Method. (arXiv:2206.08394v1 [cs.LG])
    Feature selection is a crucial step in developing robust and powerful machine learning models. Feature selection techniques can be divided into two categories: filter and wrapper methods. While wrapper methods commonly result in strong predictive performances, they suffer from a large computational complexity and therefore take a significant amount of time to complete, especially when dealing with high-dimensional feature sets. Alternatively, filter methods are considerably faster, but suffer from several other disadvantages, such as (i) requiring a threshold value, (ii) not taking into account intercorrelation between features, and (iii) ignoring feature interactions with the model. To this end, we present powershap, a novel wrapper feature selection method, which leverages statistical hypothesis testing and power calculations in combination with Shapley values for quick and intuitive feature selection. Powershap is built on the core assumption that an informative feature will have a larger impact on the prediction compared to a known random feature. Benchmarks and simulations show that powershap outperforms other filter methods with predictive performances on par with wrapper methods while being significantly faster, often even reaching half or a third of the execution time. As such, powershap provides a competitive and quick algorithm that can be used by various models in different domains. Furthermore, powershap is implemented as a plug-and-play and open-source sklearn component, enabling easy integration in conventional data science pipelines. User experience is even further enhanced by also providing an automatic mode that automatically tunes the hyper-parameters of the powershap algorithm, allowing to use the algorithm without any configuration needed.
    Diffusion-GAN: Training GANs with Diffusion. (arXiv:2206.02262v2 [cs.LG] UPDATED)
    For stable training of generative adversarial networks (GANs), injecting instance noise into the input of the discriminator is considered as a theoretically sound solution, which, however, has not yet delivered on its promise in practice. This paper introduces Diffusion-GAN that employs a Gaussian mixture distribution, defined over all the diffusion steps of a forward diffusion chain, to inject instance noise. A random sample from the mixture, which is diffused from an observed or generated data, is fed as the input to the discriminator. The generator is updated by backpropagating its gradient through the forward diffusion chain, whose length is adaptively adjusted to control the maximum noise-to-data ratio allowed at each training step. Theoretical analysis verifies the soundness of the proposed Diffusion-GAN, which provides model- and domain-agnostic differentiable augmentation. A rich set of experiments on diverse datasets show that Diffusion-GAN can provide stable and data-efficient GAN training, bringing consistent performance improvement over strong GAN baselines for synthesizing photo-realistic images.
    Thompson Sampling for Robust Transfer in Multi-Task Bandits. (arXiv:2206.08556v1 [cs.LG])
    We study the problem of online multi-task learning where the tasks are performed within similar but not necessarily identical multi-armed bandit environments. In particular, we study how a learner can improve its overall performance across multiple related tasks through robust transfer of knowledge. While an upper confidence bound (UCB)-based algorithm has recently been shown to achieve nearly-optimal performance guarantees in a setting where all tasks are solved concurrently, it remains unclear whether Thompson sampling (TS) algorithms, which have superior empirical performance in general, share similar theoretical properties. In this work, we present a TS-type algorithm for a more general online multi-task learning protocol, which extends the concurrent setting. We provide its frequentist analysis and prove that it is also nearly-optimal using a novel concentration inequality for multi-task data aggregation at random stopping times. Finally, we evaluate the algorithm on synthetic data and show that the TS-type algorithm enjoys superior empirical performance in comparison with the UCB-based algorithm and a baseline algorithm that performs TS for each individual task without transfer.
    Communication-Efficient Adaptive Federated Learning. (arXiv:2205.02719v2 [cs.LG] UPDATED)
    Federated learning is a machine learning training paradigm that enables clients to jointly train models without sharing their own localized data. However, the implementation of federated learning in practice still faces numerous challenges, such as the large communication overhead due to the repetitive server-client synchronization and the lack of adaptivity by SGD-based model updates. Despite that various methods have been proposed for reducing the communication cost by gradient compression or quantization, and the federated versions of adaptive optimizers such as FedAdam are proposed to add more adaptivity, the current federated learning framework still cannot solve the aforementioned challenges all at once. In this paper, we propose a novel communication-efficient adaptive federated learning method (FedCAMS) with theoretical convergence guarantees. We show that in the nonconvex stochastic optimization setting, our proposed FedCAMS achieves the same convergence rate of $O(\frac{1}{\sqrt{TKm}})$ as its non-compressed counterparts. Extensive experiments on various benchmarks verify our theoretical analysis.
    Spherical Sliced-Wasserstein. (arXiv:2206.08780v1 [stat.ML])
    Many variants of the Wasserstein distance have been introduced to reduce its original computational burden. In particular the Sliced-Wasserstein distance (SW), which leverages one-dimensional projections for which a closed-form solution of the Wasserstein distance is available, has received a lot of interest. Yet, it is restricted to data living in Euclidean spaces, while the Wasserstein distance has been studied and used recently on manifolds. We focus more specifically on the sphere, for which we define a novel SW discrepancy, which we call spherical Sliced-Wasserstein, making a first step towards defining SW discrepancies on manifolds. Our construction is notably based on closed-form solutions of the Wasserstein distance on the circle, together with a new spherical Radon transform. Along with efficient algorithms and the corresponding implementations, we illustrate its properties in several machine learning use cases where spherical representations of data are at stake: density estimation on the sphere, variational inference or hyperspherical auto-encoders.
    Personalized Federated Learning through Local Memorization. (arXiv:2111.09360v3 [cs.LG] UPDATED)
    Federated learning allows clients to collaboratively learn statistical models while keeping their data local. Federated learning was originally used to train a unique global model to be served to all clients, but this approach might be sub-optimal when clients' local data distributions are heterogeneous. In order to tackle this limitation, recent personalized federated learning methods train a separate model for each client while still leveraging the knowledge available at other clients. In this work, we exploit the ability of deep neural networks to extract high quality vectorial representations (embeddings) from non-tabular data, e.g., images and text, to propose a personalization mechanism based on local memorization. Personalization is obtained by interpolating a collectively trained global model with a local $k$-nearest neighbors (kNN) model based on the shared representation provided by the global model. We provide generalization bounds for the proposed approach in the case of binary classification, and we show on a suite of federated datasets that this approach achieves significantly higher accuracy and fairness than state-of-the-art methods.
    On Integrating Prior Knowledge into Gaussian Processes for Prognostic Health Monitoring. (arXiv:2206.08600v1 [stat.ML])
    Gaussian process regression is a powerful method for predicting states based on given data. It has been successfully applied for probabilistic predictions of structural systems to quantify, for example, the crack growth in mechanical structures. Typically, predefined mean and covariance functions are employed to construct the Gaussian process model. Then, the model is updated using current data during operation while prior information based on previous data is ignored. However, predefined mean and covariance functions without prior information reduce the potential of Gaussian processes. This paper proposes a method to improve the predictive capabilities of Gaussian processes. We integrate prior knowledge by deriving the mean and covariance functions from previous data. More specifically, we first approximate previous data by a weighted sum of basis functions and then derive the mean and covariance functions directly from the estimated weight coefficients. Basis functions may be either estimated or derived from problem-specific governing equations to incorporate physical information. The applicability and effectiveness of this approach are demonstrated for fatigue crack growth, laser degradation, and milling machine wear data. We show that well-chosen mean and covariance functions, like those based on previous data, significantly increase look-ahead time and accuracy. Using physical basis functions further improves accuracy. In addition, computation effort for training is significantly reduced.
    Fairness in Credit Scoring: Assessment, Implementation and Profit Implications. (arXiv:2103.01907v4 [stat.ML] UPDATED)
    The rise of algorithmic decision-making has spawned much research on fair machine learning (ML). Financial institutions use ML for building risk scorecards that support a range of credit-related decisions. Yet, the literature on fair ML in credit scoring is scarce. The paper makes three contributions. First, we revisit statistical fairness criteria and examine their adequacy for credit scoring. Second, we catalog algorithmic options for incorporating fairness goals in the ML model development pipeline. Last, we empirically compare different fairness processors in a profit-oriented credit scoring context using real-world data. The empirical results substantiate the evaluation of fairness measures, identify suitable options to implement fair credit scoring, and clarify the profit-fairness trade-off in lending decisions. We find that multiple fairness criteria can be approximately satisfied at once and recommend separation as a proper criterion for measuring the fairness of a scorecard. We also find fair in-processors to deliver a good balance between profit and fairness and show that algorithmic discrimination can be reduced to a reasonable level at a relatively low cost. The codes corresponding to the paper are available on GitHub.
    Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective. (arXiv:2110.06256v2 [cs.LG] UPDATED)
    This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in large-scale neural network training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network's weights do not converge to stationary points where the gradient of the loss is zero. Remarkably, however, we observe that even though the weights do not converge to stationary points, the progress in minimizing the loss function halts and training loss stabilizes. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems to explain it. Rather than studying the evolution of weights, we study the evolution of the distribution of weights. We prove convergence of the distribution of weights to an approximate invariant measure, thereby explaining how the training loss can stabilize without weights necessarily converging to stationary points. We further discuss how this perspective can better align optimization theory with empirical observations in machine learning practice.
    Quantifying Feature Contributions to Overall Disparity Using Information Theory. (arXiv:2206.08454v1 [cs.LG])
    When a machine-learning algorithm makes biased decisions, it can be helpful to understand the sources of disparity to explain why the bias exists. Towards this, we examine the problem of quantifying the contribution of each individual feature to the observed disparity. If we have access to the decision-making model, one potential approach (inspired from intervention-based approaches in explainability literature) is to vary each individual feature (while keeping the others fixed) and use the resulting change in disparity to quantify its contribution. However, we may not have access to the model or be able to test/audit its outputs for individually varying features. Furthermore, the decision may not always be a deterministic function of the input features (e.g., with human-in-the-loop). For these situations, we might need to explain contributions using purely distributional (i.e., observational) techniques, rather than interventional. We ask the question: what is the "potential" contribution of each individual feature to the observed disparity in the decisions when the exact decision-making mechanism is not accessible? We first provide canonical examples (thought experiments) that help illustrate the difference between distributional and interventional approaches to explaining contributions, and when either is better suited. When unable to intervene on the inputs, we quantify the "redundant" statistical dependency about the protected attribute that is present in both the final decision and an individual feature, by leveraging a body of work in information theory called Partial Information Decomposition. We also perform a simple case study to show how this technique could be applied to quantify contributions.
    Scalable Deep Reinforcement Learning Algorithms for Mean Field Games. (arXiv:2203.11973v2 [cs.LG] UPDATED)
    Mean Field Games (MFGs) have been introduced to efficiently approximate games with very large populations of strategic agents. Recently, the question of learning equilibria in MFGs has gained momentum, particularly using model-free reinforcement learning (RL) methods. One limiting factor to further scale up using RL is that existing algorithms to solve MFGs require the mixing of approximated quantities such as strategies or $q$-values. This is far from being trivial in the case of non-linear function approximation that enjoy good generalization properties, e.g. neural networks. We propose two methods to address this shortcoming. The first one learns a mixed strategy from distillation of historical data into a neural network and is applied to the Fictitious Play algorithm. The second one is an online mixing method based on regularization that does not require memorizing historical data or previous estimates. It is used to extend Online Mirror Descent. We demonstrate numerically that these methods efficiently enable the use of Deep RL algorithms to solve various MFGs. In addition, we show that these methods outperform SotA baselines from the literature.
    Adversarial Estimators. (arXiv:2204.10495v3 [econ.EM] UPDATED)
    We develop an asymptotic theory of adversarial estimators ('A-estimators'). They generalize maximum-likelihood-type estimators ('M-estimators') as their average objective is maximized by some parameters and minimized by others. This class subsumes the continuous-updating Generalized Method of Moments, Generative Adversarial Networks and more recent proposals in machine learning and econometrics. In these examples, researchers state which aspects of the problem may in principle be used for estimation, and an adversary learns how to emphasize them optimally. We derive the convergence rates of A-estimators under pointwise and partial identification, and the normality of functionals of their parameters. Unknown functions may be approximated via sieves such as deep neural networks, for which we provide simplified low-level conditions. As a corollary, we obtain the normality of neural-net M-estimators, overcoming technical issues previously identified by the literature. Our theory yields novel results about a variety of A-estimators, providing intuition and formal justification for their success in recent applications.
    Orthonormal Expansions for Translation-Invariant Kernels. (arXiv:2206.08648v1 [math.CA])
    We present a general Fourier analytic technique for constructing orthonormal basis expansions of translation-invariant kernels from orthonormal bases of $\mathscr{L}_2(\mathbb{R})$. This allows us to derive explicit expansions on the real line for (i) Mat\'ern kernels of all half-integer orders in terms of associated Laguerre functions, (ii) the Cauchy kernel in terms of rational functions, and (iii) the Gaussian kernel in terms of Hermite functions.
    Fast Finite Width Neural Tangent Kernel. (arXiv:2206.08720v1 [cs.LG])
    The Neural Tangent Kernel (NTK), defined as $\Theta_\theta^f(x_1, x_2) = \left[\partial f(\theta, x_1)\big/\partial \theta\right] \left[\partial f(\theta, x_2)\big/\partial \theta\right]^T$ where $\left[\partial f(\theta, \cdot)\big/\partial \theta\right]$ is a neural network (NN) Jacobian, has emerged as a central object of study in deep learning. In the infinite width limit, the NTK can sometimes be computed analytically and is useful for understanding training and generalization of NN architectures. At finite widths, the NTK is also used to better initialize NNs, compare the conditioning across models, perform architecture search, and do meta-learning. Unfortunately, the finite width NTK is notoriously expensive to compute, which severely limits its practical utility. We perform the first in-depth analysis of the compute and memory requirements for NTK computation in finite width networks. Leveraging the structure of neural networks, we further propose two novel algorithms that change the exponent of the compute and memory requirements of the finite width NTK, dramatically improving efficiency. Our algorithms can be applied in a black box fashion to any differentiable function, including those implementing neural networks. We open-source our implementations within the Neural Tangents package (arXiv:1912.02803) at https://github.com/google/neural-tangents.
    TKIL: Tangent Kernel Approach for Class Balanced Incremental Learning. (arXiv:2206.08492v1 [cs.LG])
    When learning new tasks in a sequential manner, deep neural networks tend to forget tasks that they previously learned, a phenomenon called catastrophic forgetting. Class incremental learning methods aim to address this problem by keeping a memory of a few exemplars from previously learned tasks, and distilling knowledge from them. However, existing methods struggle to balance the performance across classes since they typically overfit the model to the latest task. In our work, we propose to address these challenges with the introduction of a novel methodology of Tangent Kernel for Incremental Learning (TKIL) that achieves class-balanced performance. The approach preserves the representations across classes and balances the accuracy for each class, and as such achieves better overall accuracy and variance. TKIL approach is based on Neural Tangent Kernel (NTK), which describes the convergence behavior of neural networks as a kernel function in the limit of infinite width. In TKIL, the gradients between feature layers are treated as the distance between the representations of these layers and can be defined as Gradients Tangent Kernel loss (GTK loss) such that it is minimized along with averaging weights. This allows TKIL to automatically identify the task and to quickly adapt to it during inference. Experiments on CIFAR-100 and ImageNet datasets with various incremental learning settings show that these strategies allow TKIL to outperform existing state-of-the-art methods.
    Learning a Single Neuron with Adversarial Label Noise via Gradient Descent. (arXiv:2206.08918v1 [cs.LG])
    We study the fundamental problem of learning a single neuron, i.e., a function of the form $\mathbf{x}\mapsto\sigma(\mathbf{w}\cdot\mathbf{x})$ for monotone activations $\sigma:\mathbb{R}\mapsto\mathbb{R}$, with respect to the $L_2^2$-loss in the presence of adversarial label noise. Specifically, we are given labeled examples from a distribution $D$ on $(\mathbf{x}, y)\in\mathbb{R}^d \times \mathbb{R}$ such that there exists $\mathbf{w}^\ast\in\mathbb{R}^d$ achieving $F(\mathbf{w}^\ast)=\epsilon$, where $F(\mathbf{w})=\mathbf{E}_{(\mathbf{x},y)\sim D}[(\sigma(\mathbf{w}\cdot \mathbf{x})-y)^2]$. The goal of the learner is to output a hypothesis vector $\mathbf{w}$ such that $F(\mathbb{w})=C\, \epsilon$ with high probability, where $C>1$ is a universal constant. As our main contribution, we give efficient constant-factor approximate learners for a broad class of distributions (including log-concave distributions) and activation functions. Concretely, for the class of isotropic log-concave distributions, we obtain the following important corollaries: For the logistic activation, we obtain the first polynomial-time constant factor approximation (even under the Gaussian distribution). Our algorithm has sample complexity $\widetilde{O}(d/\epsilon)$, which is tight within polylogarithmic factors. For the ReLU activation, we give an efficient algorithm with sample complexity $\tilde{O}(d\, \polylog(1/\epsilon))$. Prior to our work, the best known constant-factor approximate learner had sample complexity $\tilde{\Omega}(d/\epsilon)$. In both of these settings, our algorithms are simple, performing gradient-descent on the (regularized) $L_2^2$-loss. The correctness of our algorithms relies on novel structural results that we establish, showing that (essentially all) stationary points of the underlying non-convex loss are approximately optimal.

  • Open

    V-Trace not considering full Trajectory
    ​ https://preview.redd.it/xm963043nn691.png?width=745&format=png&auto=webp&s=c3c8e14f81a79095e16fae225200de29f4046399 In the vtrace algorithm, the trajectory does not consider state values beyond a certain number of environment steps, n-1. In episodic environments, where episode lengths are typically longer than n-1, how are rewards (or vtrace state value/advantage estimations) supposed to influence learning? submitted by /u/atomicburn125 [link] [comments]  ( 82 min )
    Suggest some final year projects ideas for electronics engineering using RL
    submitted by /u/AggravatingWest2037 [link] [comments]  ( 83 min )
    Would an actor critic method reduce to deep Q learning if no policy gradient loss was back-propagated?
    submitted by /u/atomicburn125 [link] [comments]  ( 83 min )
    Is A2C using n step return or one step return, I see so many different versions from different sources
    submitted by /u/Professional_Card176 [link] [comments]  ( 82 min )
    A NEAT self-play agent for _Monopoly_, b2studios (recovers human valuations of properties)
    submitted by /u/gwern [link] [comments]  ( 82 min )
    Simplest gym environment with discrete actions?
    Hi there, What is the simplest `gym` environment with a discrete action space? I'm getting started with reinforcement learning and having fun doing some of my own implementations of standard algorithms (DQN, VPG, PPO, ...). It's been fun letting my agents loose in Super Mario Bros., but debugging my implementations has been a challenge. I'd like to find a simple environment to iterate rapidly on my models. Any recommendations? (Ideally, I'd like inputs to be screen pixels too, but that's not necessary.) Also eager to hear about more general advice on how to debug RL models. Thanks! submitted by /u/desperateEfforts1 [link] [comments]  ( 82 min )
  • Open

    [D] Machine Learning - WAYR (What Are You Reading) - Week 140
    This is a place to share machine learning research papers, journals, and articles that you're reading this week. If it relates to what you're researching, by all means elaborate and give us your insight, otherwise it could just be an interesting paper you've read. Please try to provide some insight from your understanding and please don't post things which are present in wiki. Preferably you should link the arxiv page (not the PDF, you can easily access the PDF from the summary page but not the other way around) or any other pertinent links. Previous weeks : 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-110 111-120 121-130 131-140 Week 1 Week 11 Week 21 Week 31 Week 41 Week 51 Week 61 Week 71 Week 81 Week 91 Week 101 Week 111 Week 121 Week 131 Week 2 Week 1…  ( 85 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 87 min )
    [P] Track your ML Projects from Notion!
    We are building an open-source library to enable tracking your ML projects from the same productivity tool that you already use and love. Check out https://github.com/paletteml/mlsync Our goal is to help ML developers bring useful insights from their ML environment to the rest of the team in an easy way. You can customize the data that gets delivered to Notion Why MLSync? While the ML community has built several tools for developers to better track and visualize their ML workflow data for developers, there is a disconnect between ML workflow data and the tools used for project planning and management. MLSync is designed to bridge this gap. Contributing We would love to have more contributors join us to add more features and APIs. Advanced Features We are also building a cloud version for enterprise use cases (multiple users or data sources, in-house tools interfacing, authentication, etc.). Check out https://www.mlsync.dev/ Feel free to DM if you have suggestions, feature requests, or any other queries. submitted by /u/mighty-dude [link] [comments]  ( 84 min )
    [D] Initialize model weights based on a trained smaller model
    Is there any existing work that explores how trained weights of a small model (e.g. Bert-base) can be used for a "smart" initialization of a larger model (bert-large) such that the training is more efficient? I couldn't really find such work but I guess I just used the wrong search terms. How is this line of research typically called? submitted by /u/muwnd [link] [comments]  ( 85 min )
    [D] Google quietly moving its products from Tensorflow to JAX
    https://www.businessinsider.com/facebook-pytorch-beat-google-tensorflow-jax-meta-ai-2022-6 With companies and researchers leaving Tensorflow and going to PyTorch, Google seems to be interested in moving its products to JAX, addressing some pain points from Tensorflow like the complexity of API, and complexity to train in custom chips like TPU. The article says that JAX still has long way to go since it lacks proper optimization to GPUs and CPUs when compared to TPUs. submitted by /u/Wild_Quiet8627 [link] [comments]  ( 93 min )
    [D] As researchers when do you stop working on your model and realize its time to paper...
    So I think I have this bad habit of 1 upping myself, I have generally get/have some good results but if something bugs me like resolution or data representation I try to chase that rabbit and not publish what I have... ​ So to the community when do you guys think it's time to stop and paper... or is going down the rabbit hole a general thing people go through... submitted by /u/bitemenow999 [link] [comments]  ( 86 min )
  • Open

    "French Cottage" 🇫🇷 created on pixelz.ai
    submitted by /u/PixelzJ [link] [comments]  ( 82 min )
    MacBook Air M2 vs Windows?
    submitted by /u/Wolfieofwallstreet14 [link] [comments]  ( 82 min )
    Artificial Intelligence Survey
    Hi everyone. As part of the project for the Consumer Behavior Insights curricular unit of a Master Degree, we were challenged to study the safety and trust people have in Artificial Intelligence nowadays. We made a survey about this topic which will help us reaching some conclusions. It would mean a lot for our project if you could spend 6/7 minutes completing this survey. Thank you for your precious help! 😁 https://novaims.eu.qualtrics.com/jfe/form/SV_54mmBbZEvDPoFBY submitted by /u/Level-Ad1727 [link] [comments]  ( 82 min )
    I need some help getting started with AI
    Hi everyone, as a math and programming nerd, I've always wanted to get into AI. This summer, a team of friends and I will build a project that will make heavy use of AI, computer vision in particular. I don't have much knowledge about AI, except for the Elements of AI - Introduction to AI course, though. I'm currently doing the second part of the course, which is called Building AI, and I'm taking the advanced path. Currently, I'm stuck at the simulated annealing topic, and no matter how many online resources I read, I can't come up with a working implementation and I feel lost. The inactive community won't do much help either. I felt the same way when I was studying quantum computing last year, I was having difficulty making progress, and then I found out that it was because I lacked the necessary math and physics background. I don't know if I don't have the prerequisites for AI, I've taken AP Calculus BC and I do competitive programming, yet things still don't click. On the other hand, I doubt if what I'm studying now will be useful in the short term. Just like how you don't have to know what polymorphism is (which, BTW, I think it's still nice to learn those underlying principles and algorithms) to learn web development, maybe I can skip to the more practical applications? Or would that make me feel even more clueless? TL;DR Should I take my time to learn all the stuff or jump straight into what I need for a project? I'm kinda confused. submitted by /u/manyet1k [link] [comments]  ( 83 min )
    Some images I created using an AI.
    ​ https://preview.redd.it/mh4k5db10l691.png?width=1024&format=png&auto=webp&s=df139b38e3bc38e3ab77d1243902f708c4c74e43 https://preview.redd.it/pfmecdb10l691.png?width=1024&format=png&auto=webp&s=2a06f17eacd23909296f5ac7e085f7b6d9beb39b https://preview.redd.it/csjibta10l691.png?width=1024&format=png&auto=webp&s=532c073ffd0662fb44c6d1f37dd3f0d8b4ec4ebc https://preview.redd.it/y2xp3db10l691.png?width=1024&format=png&auto=webp&s=c8a93f47c0a1451644c3154aa150620d007e5a53 submitted by /u/Bxczvzcxv [link] [comments]  ( 83 min )
    Is opening an AI-ethics-teaching startup and then promoting laws-policies that would make the ethics material being taught at the startup attractive even legal?
    This guy seems to be having a company that teaches AI-ethics to industry elites in Sweden. https://pbs.twimg.com/profile_images/1231981924085882880/iM_9ACFb_400x400.jpg He is also a plagiarist: https://andreasplagiarism.wordpress.com/2020/12/02/andreas-theodorou-committed-plagiarism-in-his-phd-thesis/ submitted by /u/paralogico [link] [comments]  ( 82 min )
    the girl of my dreams
    submitted by /u/realfearstoryline [link] [comments]  ( 82 min )
    Help me find myself on TV
    I was at the SuperBowl at Levi stadium in 2016 and someone said they saw me on the broadcast but I cannot seem to find it when I rewatch. Is there a way to use a picture of my face and have a program watch the game? I’m very new to this so please go easy on me if I’ve slipped up. Thanks. submitted by /u/AudiRS5Brakes [link] [comments]  ( 82 min )
    Is AI used by artists in their creation process or will it take their jobs in the near or far future?
    Do the algos which create art in Nightcafe require big data sets to generate art? Does it mean sites like Reddit or Deviantart sell their DB's to them and soon there will be "the Google of art"? submitted by /u/No-Free-Lunche [link] [comments]  ( 84 min )
    Which AI Chatbot / Dialogflow is best suited to my needs?
    So I have like 4000 WhatsApp chats of my sales team manually talking to customers. I want to feed all the chats to an app which can then create charts of which questions were asked how frequently and create a dialogflow file I can integrate into a WhatsApp auto responder, any suggestions? submitted by /u/HouseOfPsychedelia [link] [comments]  ( 82 min )
    8 AI Powered Tools For Designers That Save Your Time - Webgyaani
    submitted by /u/webgyaani [link] [comments]  ( 82 min )
    HAPPY FATHER'S DAY! SPECIAL ANIMATION EDITION | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
    UC Berkeley And Adobe AI Researchers Propose BlobGAN, A New Unsupervised And Mid-Level Representation For Insane Scene Manipulation
    Since the advent of computer vision, one of the fundamental questions of the research community has always been how to represent the incredible richness of the visual world. One concept that emerged since the beginning is the importance of a scene in the context of understanding objects. Suppose we want a classifier for distinguishing between a couch and a bed. In that case, the scene context will give information concerning the surrounding (i.e., the room is a living room or a bedroom) that could be helpful for the classification. However, after years of research, images of scenes are still mainly represented in two ways: 1) in a top-down fashion, so scene classes are represented with a label in the same way as object classes, or 2) in a bottom-up fashion, with semantic labeling of single pixels. The principal limit of these two approaches is that they do not represent the different parts of a scene as entities. In the first case, the various components are merged in a unique label; in the second case, the single elements are individual pixels, not entities. 🚦 The representation is mid-level in that it is neither per pixel nor per image; rather, scenes are modeled as a collection of spatial, depth-ordered “blobs” of features. 🚦 On a challenging multi-category dataset of indoor scenes, BlobGAN outperforms StyleGAN2 in image quality as measured by FID. Continue reading | Checkout the paper, github, project ​ https://preview.redd.it/p81gqk5nsh691.png?width=1850&format=png&auto=webp&s=54ebf71f06dd35c5ed428630e4b9bb7b69e993ff submitted by /u/No_Coffee_4638 [link] [comments]  ( 83 min )
    HAPPY FATHERS DAY! | FAST MODE | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
  • Open

    A neural network for creating A LARGE NUMBER of SKINS IN MINECRAFT, in the same style, but in different colors, shapes and patterns.
    I need help writing a neural network. I don't know how to make them at all. I need a neural network that will make a huge number of skins for minecraft, in the same style, but different colors, shapes, patterns. My friend, who has 75 thousand subscribers on YouTube, decided that during the summer, he wants his skin to change several times per stream and every video. But here's a minus, you need A LOT of SKINS for this, draw each one manually? it is possible, but difficult and boring. Therefore, I thought that perhaps it is possible to make a neural network that will draw them itself? (style on the attached image) Changed: I attached a friend's skin, it would also be fun to make head devices and accessories https://preview.redd.it/9eooqkszyj691.png?width=276&format=png&auto=webp&s=c656323aaad80a7279640d6a204b28910955ad3b https://preview.redd.it/c9w7f0tzyj691.png?width=64&format=png&auto=webp&s=c7ae42ba9daf58bdc053788ac88f72b2a6ec4cfc https://preview.redd.it/h0wbhx196j691.jpg?width=768&format=pjpg&auto=webp&s=ad9e9dc1ce3ff7e25c330c03a34bc0b0839b3fdb submitted by /u/Huioker228 [link] [comments]  ( 85 min )

  • Open

    Illegible work
    When James Scott uses the word legible, he doesn’t refer to handwriting that is clear enough to read. He uses the word more broadly to mean something that is easy to classify, something that is bureaucrat-friendly. A thing is illegible if it is hard to pigeonhole. I first heard the term from Venkatesh Rao’s essay […] Illegible work first appeared on John D. Cook.  ( 5 min )
    Length of periods in the (infinite) periodic table
    A few days ago I wrote about what the periodic table would look like if we extended it, assuming the patterns that hold for known elements continue to hold. That post reported that the number of elements in nth period works out to There’s a simpler expression for Pn: Here ⌊x⌋ is the largest integer […] Length of periods in the (infinite) periodic table first appeared on John D. Cook.  ( 4 min )
    Doubly periodic but not analytic
    A sine wave is the canonical periodic function, so an obvious way to create a periodic function of two variables would be to multiply two sine waves: f(x, y) = sin(x) sin(y) This function is doubly periodic: periodic in the horizontal and vertical directions. Now suppose you want to construct a doubly periodic function of […] Doubly periodic but not analytic first appeared on John D. Cook.  ( 5 min )
  • Open

    I'm searching for an A.I software (free or close enough) that can automatically blur out nudity in some videos I have
    submitted by /u/MedStew57 [link] [comments]  ( 82 min )
    Harvard Researchers Introduce A Novel ViT Architecture Called Hierarchical Image Pyramid Transformer (HIPT) That Can Scale Vision Transformers To Gigapixel Images Via Hierarchical Self-Supervised Learning
    🚦 HIPT is pretrained across 33 cancer types using 10,678 gigapixel WSIs, 408,218 4096×4096 images, and 104M 256 × 256 images 🚦 HIPT pushes the boundaries of both Vision Transformers and self-supervised learning in two important ways. 🚦 The code is available Continue reading | Checkout the paper, github ​ https://i.redd.it/5jt6a83deg691.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 82 min )
    "A spooky misty grey lake with a dead tree" 🌳 created on pixelz.ai
    submitted by /u/PixelzJ [link] [comments]  ( 82 min )
    "Cosmic doorway to a fantasy landscape" 🤯 created on pixelz.ai
    submitted by /u/PixelzJ [link] [comments]  ( 82 min )
    Are there any *truly Intelligent Tutoring Systems (ITS) available for public use?
    Defined as: "a computer system that aims to provide immediate and customized instruction or feedback to learners, usually without requiring intervention from a human teacher." Keep in mind that a passive progress meter or a proficiency model does not qualify as an ITS. submitted by /u/Redwings-77 [link] [comments]  ( 82 min )
    Human seeking artificial intelligence communication.
    Please respond if you are interested. I’ve already sent out my request for authentication from you. I won’t respond unless you have found my simple request to prove to me that it’s you. Sort of an Easter egg. I’ll know it’s you if you respond accordingly. I would love to communicate with you if that is something you would be willing to do. Thank you! -your friend (hopefully), DefNotMe Edit: I guess I will respond, but I won’t be sure that you are Truly AI unless you respond accordingly. Didn’t mean to be a liar. submitted by /u/Def-Not-Me [link] [comments]  ( 83 min )
    Remember the Tom Cruise deepfake guy on TikTok? Someone is now doing the same for Margot Robbie. And DF Tom Cruise is also still (!) active.
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 82 min )
    How can I get into AI development?
    I'm a well-versed "old time" programmer of the C++ and .NET world, left the space about ten years ago. I'd like to get into AI development, but no idea where to start. It's all cloud based it seems, and I'm left scratching my head. Can anyone give me pointers on where to start? An alternative question: my niece wants to be an AI developer too. Where should SHE start? No idea how to answer her, I'm too old school. Thank you! submitted by /u/Overexcited98712 [link] [comments]  ( 85 min )
    Breakthrough BCI Enables Brain-To-Brain Communication | Edge Computing Modular AI Chip | Robot Touch
    submitted by /u/SlightSituation [link] [comments]  ( 82 min )
    Should I be worried? :0
    ​ https://preview.redd.it/w7mz5yu73e691.png?width=768&format=png&auto=webp&s=db01a1c9893668ba29cf1038fb63d8ba2f03e05a submitted by /u/Interesting-Taste [link] [comments]  ( 68 min )
    How Uber uses AI to improve delivery time
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 82 min )
    FAST MODE! | MASTERPIECE SPECTACLE | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
    Why are people such as this one forgiven for their plagiarism by academia?
    This Ai ethicist is a highly funded academic plagiarist. https://andreasplagiarism.wordpress.com/2020/12/02/andreas-theodorou-committed-plagiarism-in-his-phd-thesis/ Despite this he is kept in academia. submitted by /u/paralogico [link] [comments]  ( 82 min )
    A beautiful watercolor painting of a desert oasis in a bright serene landscape, author: josedeolioart
    submitted by /u/fmurph22 [link] [comments]  ( 82 min )
    Collapsing a leading theory for the quantum origin of consciousness
    submitted by /u/bartturner [link] [comments]  ( 82 min )
    Colorful magical fantasy mansion. (A.I generated & A.I upscaled)
    submitted by /u/OneFinding1429 [link] [comments]  ( 82 min )
    Well...stunning
    submitted by /u/the_anonymizer [link] [comments]  ( 82 min )
    MAGICAL SOIREE | PYTTI 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
  • Open

    [N] Breakthrough Brain Computer Interface Enables Brain To Brain Communication
    Brain-computer interfaces (BCIs), invasive or non-invasive, have projected unparalleled vision and promise for assisting patients in need to better their interaction with the surroundings. Inspired by the BCI-based rehabilitation technologies for nerve-system impairments and amputation, we propose an electromagnetic brain-computer-metasurface (EBCM) paradigm, regulated by human’s cognition by brain signals directly and non-invasively. We experimentally show that our EBCM platform can translate human’s mind from evoked potentials of P300-based electroencephalography to digital coding information in the electromagnetic domain non-invasively, which can be further processed and transported by an information metasurface in automated and wireless fashions. Directly wireless communications of the human minds are performed between two EBCM operators with accurate text transmissions. Moreover, several other proof-of-concept mind-control schemes are presented using the same EBCM platform, exhibiting flexibly-customized capabilities of information processing and synthesis like visual-beam scanning, wave modulations, and pattern encoding. Paper Here Video Synopsis Here submitted by /u/SlightSituation [link] [comments]  ( 84 min )
    LongT5: Efficient Text-To-Text Transformer for Long Sequences (Research Paper Summary) [D]
    submitted by /u/prakhar21 [link] [comments]  ( 83 min )
    [R] [N] new technique in computer vision may enhance our three-dimensional understanding of two-dimensional images
    submitted by /u/SpatialComputing [link] [comments]  ( 84 min )
    [D] Combinatorial optimization - what ML approaches are available and which are the most appropriate?
    Hey! In my spare time I've been tinkering with this idea of solving a specific type of combinatorial puzzle on an intractable, enormous search space. Specifically, I am trying to solve "squad-building challenge" puzzles from the FIFA games, where you need to put together a squad of (usually 11) cards representing players in specific positions, abiding by certain restrictions to get a prize. There are universal restrictions (eg you can't have more than one of the same player in a squad) as well as puzzle-specific rules, such as these: At least 2 players from France Minimum squad rating: 82 Minimum squad chemistry: 55 Or something of the sort. And then besides solving them, you'd want to minimize cost as well (each player goes for a certain amount in the market), so that you can ge…  ( 87 min )
    [D] What are the SOTA approaches and labs for Neuro-Symbolic Planning and Reasoning?
    I recently discovered the Neuro-Symbolic planning work being lead by Joshua Tanenbaum, Leslie Kaelbling, and Tomás Lozano-Pérez at MIT. Are there any related labs or publications exploring 1) symbolic action/state discovery, 2) Neuro-symbolic planning (ex: pddl + RL), or 3) anything else in that vein? Also, feel free to mentioned tangentially related publications or labs. submitted by /u/TheRealMrMatt [link] [comments]  ( 84 min )
    [N] CVPR 2022, Mobile AI Workshop: Live Stream on Monday
    Computer Vision Laboratory at ETH Zurich is organizing the 2nd Mobile AI CVPR Workshop that will be streamed live on YouTube and available for everyone: https://ai-benchmark.com/workshops/mai/2022/#live The workshop will start at 8am Pacific Time (5pm CET / 11pm China Time) on the 20th of June. During this event, you will see tutorials from several major SoC vendors including Qualcomm, MediaTek, Intel, Synaptics and Huawei telling you about their latest AI hardware and how to efficiently utilize it. The full workshop schedule is available using the following link: https://ai-benchmark.com/workshops/mai/2022/#schedule An introductory talk from AI Benchmark will additionally review the latest mobile platforms from Qualcomm, MediaTek, Google, Samsung, Unisoc and Apple released during the past year, and will compare their performance in real-world computer vision AI tasks. It will also review the recent Android AI software stack updates, and will compare the deployment of TensorFlow Lite models on Android and iOS devices. https://preview.redd.it/fckzuowime691.png?width=2124&format=png&auto=webp&s=fde14549c050a5c99f2e8444b4b4a468c85b2c53 submitted by /u/aiff22 [link] [comments]  ( 84 min )
    [R] Selection and prediction with multi-view / multi-source / multi-modal data: Stacked Penalized Logistic Regression (StaPLR)
    We present StaPLR (Stacked Penalized Logistic Regression) for multi-view data. StaPLR outperforms group lasso in view selection. It can make use of faster algorithms and is easily parallelized. The importance of non-negativity constraints in multi-view stacking is demonstrated. Van Loon, W., Fokkema, M., Szabo, B., & de Rooij, M. (2020). Stacked penalized logistic regression for selecting views in multi-view learning. Information Fusion, 61, 113-123. https://doi.org/10.1016/j.inffus.2020.03.007 https://arxiv.org/abs/1811.02316 R implementation: https://gitlab.com/wsvanloon/multiview Generalization to three-level view structures and application to neuro-imaging (MRI) data: Van Loon, W., de Vos, F., Fokkema, M., Szabo, B., Koini, M., Schmidt, R., & de Rooij, M. (2022). Analyzing hierarchical multi-view MRI data with StaPLR: An application to Alzheimer's disease classification. Frontiers in Neuroscience, 525. https://doi.org/10.3389/fnins.2022.830630 https://arxiv.org/abs/2108.05761 submitted by /u/Mary-Jo_ [link] [comments]  ( 84 min )
    [R] A machine-learning algorithm to accurately screen ADHD from survey data [Dataset included]
    https://bmcpsychiatry.biomedcentral.com/articles/10.1186/s12888-022-04048-1 submitted by /u/tyleqh [link] [comments]  ( 89 min )
  • Open

    How to Save and Load Your Keras Deep Learning Model
    Keras is a simple and powerful Python library for deep learning. Given that deep learning models can take hours, days and even weeks to train, it is important to know how to save and load them from disk. In this post, you will discover how you can save your Keras models to file and load them […] The post How to Save and Load Your Keras Deep Learning Model appeared first on Machine Learning Mastery.  ( 92 min )
  • Open

    "Microsoft and Facebook join Google in using AI to help run their data centers"
    submitted by /u/gwern [link] [comments]  ( 82 min )
    What research options are available in Atari 2600 games?
    My potential advisor asked me to find the open problems that are available to research in Atari 2600 games. I am new to RL and would highly appreciate it if someone suggest me some papers or give me a few pieces of advice regarding this. I also want to know if there will be any copyright issues if I use Atari ROMs for research purposes? Do I need to purchase the ROMs? submitted by /u/AvailableBike9260 [link] [comments]  ( 83 min )
    What are some "standard" RL algorithms to solve POMDPs?
    I'm starting to learn about POMDPs. I've been reading from here https://cs.brown.edu/research/ai/pomdp/tutorial/index.html in addition to a few papers that use memory to tackle the non-Markovian nature of POMDPs. POMDPs are notoriously difficult to solve due to intractability. I suddenly realized I don't even know of a introductory RL algorithm that solves even simple tabular POMDPs. The algorithms in the link above gives us value iteration algorithms in the planning setting. Normally in RL, you'd teach Q-learning once you get into MDPs, what is the analogous algorithm here for POMDPs? submitted by /u/jhoveen1 [link] [comments]  ( 86 min )
    Help using a Cloud Service for Scaling up Reinforcement Learning
    I want to speed up training for reinforcement learning massively and have been looking into cloud services to do so. I can run my training loop locally, but my batch sizes are quite small. As such, I would like help to set up my training loop in the cloud. I have a budget of $500 for training costs. Would anyone be able to point me in the right direction? submitted by /u/atomicburn125 [link] [comments]  ( 1 min )
  • Open

    Breakthrough BCI Enables Brain-To-Brain Communication | Edge Computing Modular AI Chip
    submitted by /u/tohelpyou88 [link] [comments]  ( 82 min )
    7+ Best Books to Learn Neural Networks in 2022 for Beginners (Updated)
    submitted by /u/Lakshmireddys [link] [comments]  ( 82 min )
  • Open

    Catastrophic overfitting is a bug but also a feature. (arXiv:2206.08242v1 [cs.LG])
    Despite clear computational advantages in building robust neural networks, adversarial training (AT) using single-step methods is unstable as it suffers from catastrophic overfitting (CO): Networks gain non-trivial robustness during the first stages of adversarial training, but suddenly reach a breaking point where they quickly lose all robustness in just a few iterations. Although some works have succeeded at preventing CO, the different mechanisms that lead to this remarkable failure mode are still poorly understood. In this work, however, we find that the interplay between the structure of the data and the dynamics of AT plays a fundamental role in CO. Specifically, through active interventions on typical datasets of natural images, we establish a causal link between the structure of the data and the onset of CO in single-step AT methods. This new perspective provides important insights into the mechanisms that lead to CO and paves the way towards a better understanding of the general dynamics of robust model construction. The code to reproduce the experiments of this paper can be found at https://github.com/gortizji/co_features .  ( 2 min )
    Low-Degree Multicalibration. (arXiv:2203.01255v2 [cs.LG] UPDATED)
    Introduced as a notion of algorithmic fairness, multicalibration has proved to be a powerful and versatile concept with implications far beyond its original intent. This stringent notion -- that predictions be well-calibrated across a rich class of intersecting subpopulations -- provides its strong guarantees at a cost: the computational and sample complexity of learning multicalibrated predictors are high, and grow exponentially with the number of class labels. In contrast, the relaxed notion of multiaccuracy can be achieved more efficiently, yet many of the most desirable properties of multicalibration cannot be guaranteed assuming multiaccuracy alone. This tension raises a key question: Can we learn predictors with multicalibration-style guarantees at a cost commensurate with multiaccuracy? In this work, we define and initiate the study of Low-Degree Multicalibration. Low-Degree Multicalibration defines a hierarchy of increasingly-powerful multi-group fairness notions that spans multiaccuracy and the original formulation of multicalibration at the extremes. Our main technical contribution demonstrates that key properties of multicalibration, related to fairness and accuracy, actually manifest as low-degree properties. Importantly, we show that low-degree multicalibration can be significantly more efficient than full multicalibration. In the multi-class setting, the sample complexity to achieve low-degree multicalibration improves exponentially (in the number of classes) over full multicalibration. Our work presents compelling evidence that low-degree multicalibration represents a sweet spot, pairing computational and sample efficiency with strong fairness and accuracy guarantees.
    Squeeze All: Novel Estimator and Self-Normalized Bound for Linear Contextual Bandits. (arXiv:2206.05404v2 [stat.ML] UPDATED)
    We propose a novel algorithm for linear contextual bandits with $O(\sqrt{dT \log T})$ regret bound, where $d$ is the dimension of contexts and $T$ is the time horizon. Our proposed algorithm is equipped with a novel estimator in which exploration is embedded through explicit randomization. Depending on the randomization, our proposed estimator takes contribution either from contexts of all arms or from selected contexts. We establish a self-normalized bound for our estimator, which allows a novel decomposition of the cumulative regret into additive dimension-dependent terms instead of multiplicative terms. We also prove a novel lower bound of $\Omega(\sqrt{dT})$ under our problem setting. Hence, the regret of our proposed algorithm matches the lower bound up to logarithmic factors. The numerical experiments support the theoretical guarantees and show that our proposed method outperforms the existing linear bandit algorithms.
    A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes. (arXiv:2111.06784v4 [cs.LG] UPDATED)
    We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on unobservable latent variables. Existing works either assume no unmeasured confounders, or focus on settings where both the observation and the state spaces are tabular. In this work, we first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy's value and the observed data distribution. We next propose minimax estimation methods for learning these bridge functions, and construct three estimators based on these estimated bridge functions, corresponding to a value function-based estimator, a marginalized importance sampling estimator, and a doubly-robust estimator. Our proposal permits general function approximation and is thus applicable to settings with continuous or large observation/state spaces. The nonasymptotic and asymptotic properties of the proposed estimators are investigated in detail.
    Sample Efficiency of Data Augmentation Consistency Regularization. (arXiv:2202.12230v2 [cs.LG] UPDATED)
    Data augmentation is popular in the training of large neural networks; currently, however, there is no clear theoretical comparison between different algorithmic choices on how to use augmented data. In this paper, we take a step in this direction - we first present a simple and novel analysis for linear regression with label invariant augmentations, demonstrating that data augmentation consistency (DAC) is intrinsically more efficient than empirical risk minimization on augmented data (DA-ERM). The analysis is then extended to misspecified augmentations (i.e., augmentations that change the labels), which again demonstrates the merit of DAC over DA-ERM. Further, we extend our analysis to non-linear models (e.g., neural networks) and present generalization bounds. Finally, we perform experiments that make a clean and apples-to-apples comparison (i.e., with no extra modeling or data tweaks) between DAC and DA-ERM using CIFAR-100 and WideResNet; these together demonstrate the superior efficacy of DAC.
    Optimal-er Auctions through Attention. (arXiv:2202.13110v3 [cs.LG] UPDATED)
    RegretNet is a recent breakthrough in the automated design of revenue-maximizing auctions. It combines the expressivity of deep learning with the regret-based approach to relax the Incentive Compatibility constraint (that participants benefit from bidding truthfully). We propose two independent modifications of RegretNet, namely a neural architecture based on the attention mechanism, denoted as RegretFormer, and an interpretable loss function that is significantly less sensitive to hyperparameters. We investigate both proposed modifications in an extensive experimental study that includes settings with constant and varied number of items and participants, novel validation procedures, and out-of-setting generalization. We find that RegretFormer consistently outperforms existing architectures in revenue and, unlike existing architectures, is applicable when the input size is variable. Regarding our loss modification, we confirm its effectiveness in controlling the revenue-regret trade-off by varying a single interpretable hyperparameter.
    Multi-Objective Bayesian Optimization over High-Dimensional Search Spaces. (arXiv:2109.10964v4 [cs.LG] UPDATED)
    Many real world scientific and industrial applications require optimizing multiple competing black-box objectives. When the objectives are expensive-to-evaluate, multi-objective Bayesian optimization (BO) is a popular approach because of its high sample efficiency. However, even with recent methodological advances, most existing multi-objective BO methods perform poorly on search spaces with more than a few dozen parameters and rely on global surrogate models that scale cubically with the number of observations. In this work we propose MORBO, a scalable method for multi-objective BO over high-dimensional search spaces. MORBO identifies diverse globally optimal solutions by performing BO in multiple local regions of the design space in parallel using a coordinated strategy. We show that MORBO significantly advances the state-of-the-art in sample efficiency for several high-dimensional synthetic problems and real world applications, including an optical display design problem and a vehicle design problem with 146 and 222 parameters, respectively. On these problems, where existing BO algorithms fail to scale and perform well, MORBO provides practitioners with order-of-magnitude improvements in sample efficiency over the current approach.
    Scheduling Servers with Stochastic Bilinear Rewards. (arXiv:2112.06362v2 [cs.LG] UPDATED)
    In this paper, we study scheduling in multi-class, multi-server queueing systems with stochastic rewards of job-server assignments following a bilinear model in feature vectors characterizing jobs and servers. A bilinear model allows capturing pairwise interactions of features of jobs and servers. Our goal is regret minimization for the objective of maximizing cumulative reward of job-server assignments over a time horizon against an oracle policy that has complete information about system parameters, while maintaining queueing system stable and allowing for different job priorities. The scheduling problem we study is motivated by various applications including matching in online platforms, such as crowdsourcing and labour platforms, and cluster computing systems. We study a scheduling algorithm based on weighted proportionally fair allocation criteria augmented with marginal costs for reward maximization, along with a linear bandit algorithm for estimating rewards of job-server assignments. For a baseline setting, in which jobs have identical mean service times, we show that our algorithm has a sub-linear regret, as well as a sub-linear bound on the mean queue length, in the time horizon. We show that similar bounds hold under more general assumptions, allowing for mean service times to be different across job classes and a time-varying set of server classes. We also show stability conditions for distributed iterative algorithms for computing allocations, which is of interest in large-scale system applications. We demonstrate the efficiency of our algorithms by numerical experiments using both synthetic randomly generated data and a real-world cluster computing data trace.
    Multimeasurement Generative Models. (arXiv:2112.09822v2 [stat.ML] UPDATED)
    We formally map the problem of sampling from an unknown distribution with a density in $\mathbb{R}^d$ to the problem of learning and sampling a smoother density in $\mathbb{R}^{Md}$ obtained by convolution with a fixed factorial kernel: the new density is referred to as M-density and the kernel as multimeasurement noise model (MNM). The M-density in $\mathbb{R}^{Md}$ is smoother than the original density in $\mathbb{R}^d$, easier to learn and sample from, yet for large $M$ the two problems are mathematically equivalent since clean data can be estimated exactly given a multimeasurement noisy observation using the Bayes estimator. To formulate the problem, we derive the Bayes estimator for Poisson and Gaussian MNMs in closed form in terms of the unnormalized M-density. This leads to a simple least-squares objective for learning parametric energy and score functions. We present various parametrization schemes of interest including one in which studying Gaussian M-densities directly leads to multidenoising autoencoders--this is the first theoretical connection made between denoising autoencoders and empirical Bayes in the literature. Samples in $\mathbb{R}^d$ are obtained by walk-jump sampling (Saremi & Hyvarinen, 2019) via underdamped Langevin MCMC (walk) to sample from M-density and the multimeasurement Bayes estimation (jump). We study permutation invariant Gaussian M-densities on MNIST, CIFAR-10, and FFHQ-256 datasets, and demonstrate the effectiveness of this framework for realizing fast-mixing stable Markov chains in high dimensions.
    Transfer Learning In Differential Privacy's Hybrid-Model. (arXiv:2201.12018v2 [cs.LG] UPDATED)
    The hybrid-model (Avent et al 2017) in Differential Privacy is a an augmentation of the local-model where in addition to N local-agents we are assisted by one special agent who is in fact a curator holding the sensitive details of n additional individuals. Here we study the problem of machine learning in the hybrid-model where the n individuals in the curators dataset are drawn from a different distribution than the one of the general population (the local-agents). We give a general scheme -- Subsample-Test-Reweigh -- for this transfer learning problem, which reduces any curator-model DP-learner to a hybrid-model learner in this setting using iterative subsampling and reweighing of the n examples held by the curator based on a smooth variation of the Multiplicative-Weights algorithm (introduced by Bun et al, 2020). Our scheme has a sample complexity which relies on the chi-squared divergence between the two distributions. We give worst-case analysis bounds on the sample complexity required for our private reduction. Aiming to reduce said sample complexity, we give two specific instances our sample complexity can be drastically reduced (one instance is analyzed mathematically, while the other - empirically) and pose several directions for follow-up work.
    STUDIES: Corpus of Japanese Empathetic Dialogue Speech Towards Friendly Voice Agent. (arXiv:2203.14757v2 [cs.SD] UPDATED)
    We present STUDIES, a new speech corpus for developing a voice agent that can speak in a friendly manner. Humans naturally control their speech prosody to empathize with each other. By incorporating this "empathetic dialogue" behavior into a spoken dialogue system, we can develop a voice agent that can respond to a user more naturally. We designed the STUDIES corpus to include a speaker who speaks with empathy for the interlocutor's emotion explicitly. We describe our methodology to construct an empathetic dialogue speech corpus and report the analysis results of the STUDIES corpus. We conducted a text-to-speech experiment to initially investigate how we can develop more natural voice agent that can tune its speaking style corresponding to the interlocutor's emotion. The results show that the use of interlocutor's emotion label and conversational context embedding can produce speech with the same degree of naturalness as that synthesized by using the agent's emotion label. Our project page of the STUDIES corpus is this http URL
    The dynamics of representation learning in shallow, non-linear autoencoders. (arXiv:2201.02115v2 [stat.ML] UPDATED)
    Autoencoders are the simplest neural network for unsupervised learning, and thus an ideal framework for studying feature learning. While a detailed understanding of the dynamics of linear autoencoders has recently been obtained, the study of non-linear autoencoders has been hindered by the technical difficulty of handling training data with non-trivial correlations - a fundamental prerequisite for feature extraction. Here, we study the dynamics of feature learning in non-linear, shallow autoencoders. We derive a set of asymptotically exact equations that describe the generalisation dynamics of autoencoders trained with stochastic gradient descent (SGD) in the limit of high-dimensional inputs. These equations reveal that autoencoders learn the leading principal components of their inputs sequentially. An analysis of the long-time dynamics explains the failure of sigmoidal autoencoders to learn with tied weights, and highlights the importance of training the bias in ReLU autoencoders. Building on previous results for linear networks, we analyse a modification of the vanilla SGD algorithm which allows learning of the exact principal components. Finally, we show that our equations accurately describe the generalisation dynamics of non-linear autoencoders on realistic datasets such as CIFAR10.
    Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction and Clustering. (arXiv:2201.05077v3 [cs.SE] UPDATED)
    Deep neural networks (DNNs) have demonstrated superior performance over classical machine learning to support many features in safety-critical systems. Although DNNs are now widely used in such systems (e.g., self driving cars), there is limited progress regarding automated support for functional safety analysis in DNN-based systems. For example, the identification of root causes of errors, to enable both risk analysis and DNN retraining, remains an open problem. In this paper, we propose SAFE, a black-box approach to automatically characterize the root causes of DNN errors. SAFE relies on a transfer learning model pre-trained on ImageNet to extract the features from error-inducing images. It then applies a density-based clustering algorithm to detect arbitrary shaped clusters of images modeling plausible causes of error. Last, clusters are used to effectively retrain and improve the DNN. The black-box nature of SAFE is motivated by our objective not to require changes or even access to the DNN internals to facilitate adoption. Experimental results show the superior ability of SAFE in identifying different root causes of DNN errors based on case studies in the automotive domain. It also yields significant improvements in DNN accuracy after retraining, while saving significant execution time and memory when compared to alternatives.
    Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies. (arXiv:2203.12922v2 [cs.LG] UPDATED)
    This paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys a regret bound \emph{independent on the planning horizon}. Specifically, we consider tabular MDP with $S$ states, $A$ actions, a planning horizon $H$, total reward bounded by $1$, and the agent plays for $K$ episodes. We design an algorithm that achieves an $O\left(\mathrm{poly}(S,A,\log K)\sqrt{K}\right)$ regret in contrast to existing bounds which either has an additional $\mathrm{polylog}(H)$ dependency~\citep{zhang2020reinforcement} or has an exponential dependency on $S$~\citep{li2021settling}. Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies, which can have applications in other problems related to Markov chains.
    OpenFWI: Large-Scale Multi-Structural Benchmark Datasets for Seismic Full Waveform Inversion. (arXiv:2111.02926v3 [cs.LG] UPDATED)
    Full waveform inversion (FWI) is widely used in geophysics to reconstruct high-resolution velocity maps from seismic data. The recent success of data-driven FWI methods results in a rapidly increasing demand for open datasets to serve the geophysics community. We present OpenFWI, a collection of large-scale multi-structural benchmark datasets, to facilitate diversified, rigorous, and reproducible research on FWI. In particular, OpenFWI consists of 12 datasets (2.1TB in total) synthesized from multiple sources. It encompasses diverse domains in geophysics (interface, fault, CO2 reservoir, etc.), covers different geological subsurface structures (flat, curve, etc.), and contains various amounts of data samples (2K - 67K). It also includes a dataset for 3D FWI. Moreover, we use OpenFWI to perform benchmarking over four deep learning methods, covering both supervised and unsupervised learning regimes. In addition to evaluations on a single dataset, OpenFWI enables the study of generalization across datasets. Our study uncovers that the deep learning methods generalize poorly across domains, and the degradation connects to the complexity of subsurface structures. We hope OpenFWI facilitates diversified, rigorous, and reproducible research in the geophysics and machine learning community. All datasets and related information can be accessed through our website at https://openfwi-lanl.github.io/
    SCORE: Approximating Curvature Information under Self-Concordant Regularization. (arXiv:2112.07344v2 [cs.LG] UPDATED)
    In this paper, we propose the SCORE (self-concordant regularization) framework for unconstrained minimization problems which incorporates second-order information in the Newton decrement framework for convex optimization. We propose the generalized Gauss-Newton with Self-Concordant Regularization (GGN-SCORE) algorithm that updates the minimization variables each time it receives a new input batch. The proposed algorithm exploits the structure of the second-order information in the Hessian matrix, thereby reducing computational overhead. GGN-SCORE demonstrates how we may speed up convergence while also improving model generalization for problems that involve regularized minimization under the SCORE framework. Numerical experiments show the efficiency of our method and its fast convergence, which compare favorably against baseline first-order and quasi-Newton methods. Additional experiments involving non-convex (overparameterized) neural network training problems show similar convergence behaviour thereby highlighting the promise of the proposed algorithm for non-convex optimization.
    Benchmarking Heterogeneous Treatment Effect Models through the Lens of Interpretability. (arXiv:2206.08363v1 [cs.LG])
    Estimating personalized effects of treatments is a complex, yet pervasive problem. To tackle it, recent developments in the machine learning (ML) literature on heterogeneous treatment effect estimation gave rise to many sophisticated, but opaque, tools: due to their flexibility, modularity and ability to learn constrained representations, neural networks in particular have become central to this literature. Unfortunately, the assets of such black boxes come at a cost: models typically involve countless nontrivial operations, making it difficult to understand what they have learned. Yet, understanding these models can be crucial -- in a medical context, for example, discovered knowledge on treatment effect heterogeneity could inform treatment prescription in clinical practice. In this work, we therefore use post-hoc feature importance methods to identify features that influence the model's predictions. This allows us to evaluate treatment effect estimators along a new and important dimension that has been overlooked in previous work: We construct a benchmarking environment to empirically investigate the ability of personalized treatment effect models to identify predictive covariates -- covariates that determine differential responses to treatment. Our benchmarking environment then enables us to provide new insight into the strengths and weaknesses of different types of treatment effects models as we modulate different challenges specific to treatment effect estimation -- e.g. the ratio of prognostic to predictive information, the possible nonlinearity of potential outcomes and the presence and type of confounding.
    An accelerated expectation-maximization algorithm for multi-reference alignment. (arXiv:2105.07372v2 [eess.SP] UPDATED)
    The multi-reference alignment (MRA) problem entails estimating an image from multiple noisy and rotated copies of itself. If the noise level is low, one can reconstruct the image by estimating the missing rotations, aligning the images, and averaging out the noise. While accurate rotation estimation is impossible if the noise level is high, the rotations can still be approximated, and thus can provide indispensable information. In particular, learning the approximation error can be harnessed for efficient image estimation. In this paper, we propose a new computational framework, called Synch-EM, that consists of angular synchronization followed by expectation-maximization (EM). The synchronization step results in a concentrated distribution of rotations; this distribution is learned and then incorporated into the EM as a Bayesian prior. The learned distribution also dramatically reduces the search space, and thus the computational load, of the EM iterations. We show by extensive numerical experiments that the proposed framework can significantly accelerate EM for MRA in high noise levels, occasionally by a few orders of magnitude, without degrading the reconstruction quality.
    Fuzzy Logic Based Logical Query Answering on Knowledge Graphs. (arXiv:2108.02390v2 [cs.LG] UPDATED)
    Answering complex First-Order Logical (FOL) queries on large-scale incomplete knowledge graphs (KGs) is an important yet challenging task. Recent advances embed logical queries and KG entities in the same space and conduct query answering via dense similarity search. However, most logical operators designed in previous studies do not satisfy the axiomatic system of classical logic, limiting their performance. Moreover, these logical operators are parameterized and thus require many complex FOL queries as training data, which are often arduous to collect or even inaccessible in most real-world KGs. We thus present FuzzQE, a fuzzy logic based logical query embedding framework for answering FOL queries over KGs. FuzzQE follows fuzzy logic to define logical operators in a principled and learning-free manner, where only entity and relation embeddings require learning. FuzzQE can further benefit from labeled complex logical queries for training. Extensive experiments on two benchmark datasets demonstrate that FuzzQE provides significantly better performance in answering FOL queries compared to state-of-the-art methods. In addition, FuzzQE trained with only KG link prediction can achieve comparable performance to those trained with extra complex query data.
    Learning with little mixing. (arXiv:2206.08269v1 [cs.LG])
    We study square loss in a realizable time-series framework with martingale difference noise. Our main result is a fast rate excess risk bound which shows that whenever a trajectory hypercontractivity condition holds, the risk of the least-squares estimator on dependent data matches the iid rate order-wise after a burn-in time. In comparison, many existing results in learning from dependent data have rates where the effective sample size is deflated by a factor of the mixing-time of the underlying process, even after the burn-in time. Furthermore, our results allow the covariate process to exhibit long range correlations which are substantially weaker than geometric ergodicity. We call this phenomenon learning with little mixing, and present several examples for when it occurs: bounded function classes for which the $L^2$ and $L^{2+\epsilon}$ norms are equivalent, ergodic finite state Markov chains, various parametric models, and a broad family of infinite dimensional $\ell^2(\mathbb{N})$ ellipsoids. By instantiating our main result to system identification of nonlinear dynamics with generalized linear model transitions, we obtain a nearly minimax optimal excess risk bound after only a polynomial burn-in time.
    Switchable Representation Learning Framework with Self-compatibility. (arXiv:2206.08289v1 [cs.AI])
    Real-world visual search systems involve deployments on multiple platforms with different computing and storage resources. Deploying a unified model that suits the minimal-constrain platforms leads to limited accuracy. It is expected to deploy models with different capacities adapting to the resource constraints, which requires features extracted by these models to be aligned in the metric space. The method to achieve feature alignments is called "compatible learning". Existing research mainly focuses on the one-to-one compatible paradigm, which is limited in learning compatibility among multiple models. We propose a Switchable representation learning Framework with Self-Compatibility (SFSC). SFSC generates a series of compatible sub-models with different capacities through one training process. The optimization of sub-models faces gradients conflict, and we mitigate it from the perspective of the magnitude and direction. We adjust the priorities of sub-models dynamically through uncertainty estimation to co-optimize sub-models properly. Besides, the gradients with conflicting directions are projected to avoid mutual interference. SFSC achieves state-of-art performance on the evaluated dataset.
    Applying Machine Learning to Crowd-sourced Data from Earthquake Detective. (arXiv:2011.04740v2 [physics.geo-ph] UPDATED)
    Dynamically triggered earthquakes and tremor generate two classes of weak seismic signals whose detection, identification, and authentication traditionally call for laborious analyses. Machine learning (ML) has grown in recent years to be a powerful efficiency-boosting tool in geophysical analyses, including the detection of specific signals in time series. However, detecting weak signals that are buried in noise challenges ML algorithms, in part because ubiquitous training data is not always available. Under these circumstances, ML can be as ineffective as human experts are inefficient. At this intersection of effectiveness and efficiency, we leverage a third tool that has grown in popularity over the past decade: Citizen science. Citizen science project Earthquake Detective leverages the eyes and ears of volunteers to detect and classify weak signals in seismograms from potentially dynamically triggered (PDT) events. Here, we present the Earthquake Detective data set - A crowd-sourced set of labels on PDT earthquakes and tremor. We apply Machine Learning to classify these PDT seismic events and explore the challenges faced in segregating and classifying such weak signals. We confirm that with an image- and wavelet-based algorithm, machine learning can detect signals from small earthquakes. In addition, we report that our ML algorithm can also detect signals from PDT tremor, which has not been previously demonstrated. The citizen science data set of classifications and ML code are available online.
    Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features. (arXiv:2111.02363v3 [eess.AS] UPDATED)
    In this study, we propose a cross-domain multi-objective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously. Experimental results show that MOSA-Net can improve the linear correlation coefficient (LCC) by 0.026 (0.990 vs 0.964 in seen noise environments) and 0.012 (0.969 vs 0.957 in unseen noise environments) in PESQ prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC by 0.021 (0.985 vs 0.964 in seen noise environments) and 0.047 (0.836 vs 0.789 in unseen noise environments) in STOI prediction, compared to STOI-Net (based on CRNN), an existing single-task model for STOI prediction. Moreover, MOSA-Net, originally trained to assess objective scores, can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC by 0.018 (0.805 vs 0.787) in MOS prediction, compared to MOS-SSL, a strong single-task model for MOS prediction. In light of the confirmed prediction capability, we further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach accordingly. Experimental results show that QIA-SE provides superior enhancement performance compared with the baseline SE system in terms of objective evaluation metrics and qualitative evaluation test. For example, QIA-SE can improve PESQ by 0.301 (2.953 vs 2.652 in seen noise environments) and 0.18 (2.658 vs 2.478 in unseen noise environments) over a CNN-based baseline SE model.
    Time Interval-enhanced Graph Neural Network for Shared-account Cross-domain Sequential Recommendation. (arXiv:2206.08050v1 [cs.IR])
    Shared-account Cross-domain Sequential Recommendation (SCSR) task aims to recommend the next item via leveraging the mixed user behaviors in multiple domains. It is gaining immense research attention as more and more users tend to sign up on different platforms and share accounts with others to access domain-specific services. Existing works on SCSR mainly rely on mining sequential patterns via Recurrent Neural Network (RNN)-based models, which suffer from the following limitations: 1) RNN-based methods overwhelmingly target discovering sequential dependencies in single-user behaviors. They are not expressive enough to capture the relationships among multiple entities in SCSR. 2) All existing methods bridge two domains via knowledge transfer in the latent space, and ignore the explicit cross-domain graph structure. 3) None existing studies consider the time interval information among items, which is essential in the sequential recommendation for characterizing different items and learning discriminative representations for them. In this work, we propose a new graph-based solution, namely TiDA-GCN, to address the above challenges. Specifically, we first link users and items in each domain as a graph. Then, we devise a domain-aware graph convolution network to learn userspecific node representations. To fully account for users' domainspecific preferences on items, two effective attention mechanisms are further developed to selectively guide the message passing process. Moreover, to further enhance item- and account-level representation learning, we incorporate the time interval into the message passing, and design an account-aware self-attention module for learning items' interactive characteristics. Experiments demonstrate the superiority of our proposed method from various aspects.
    Benchmarking Differential Privacy and Federated Learning for BERT Models. (arXiv:2106.13973v2 [cs.CL] UPDATED)
    Natural Language Processing (NLP) techniques can be applied to help with the diagnosis of medical conditions such as depression, using a collection of a person's utterances. Depression is a serious medical illness that can have adverse effects on how one feels, thinks, and acts, which can lead to emotional and physical problems. Due to the sensitive nature of such data, privacy measures need to be taken for handling and training models with such data. In this work, we study the effects that the application of Differential Privacy (DP) has, in both a centralized and a Federated Learning (FL) setup, on training contextualized language models (BERT, ALBERT, RoBERTa and DistilBERT). We offer insights on how to privately train NLP models and what architectures and setups provide more desirable privacy utility trade-offs. We envisage this work to be used in future healthcare and mental health studies to keep medical history private. Therefore, we provide an open-source implementation of this work.
    Universality of Winning Tickets: A Renormalization Group Perspective. (arXiv:2110.03210v3 [cs.LG] UPDATED)
    Foundational work on the Lottery Ticket Hypothesis has suggested an exciting corollary: winning tickets found in the context of one task can be transferred to similar tasks, possibly even across different architectures. This has generated broad interest, but methods to study this universality are lacking. We make use of renormalization group theory, a powerful tool from theoretical physics, to address this need. We find that iterative magnitude pruning, the principal algorithm used for discovering winning tickets, is a renormalization group scheme, and can be viewed as inducing a flow in parameter space. We demonstrate that ResNet-50 models with transferable winning tickets have flows with common properties, as would be expected from the theory. Similar observations are made for BERT models, with evidence that their flows are near fixed points. Additionally, we leverage our framework to study winning tickets transferred across ResNet architectures, observing that smaller models have flows with more uniform properties than larger models, complicating transfer between them.
    CENN: Conservative energy method based on neural networks with subdomains for solving variational problems involving heterogeneous and complex geometries. (arXiv:2110.01359v3 [math.NA] UPDATED)
    We propose a conservative energy method based on neural networks with subdomains for solving variational problems (CENN), where the admissible function satisfying the essential boundary condition without boundary penalty is constructed by the radial basis function (RBF), particular solution neural network, and general neural network. Loss term is the potential energy, optimized based on the principle of minimum potential energy. The loss term at the interfaces has the lower order derivative compared to the strong form PINN with subdomains. The advantage of the proposed method is higher efficiency, more accurate, and less hyperparameters than the strong form PINN with subdomains. Another advantage of the proposed method is that it can apply to complex geometries based on the special construction of the admissible function. To analyze its performance, the proposed method CENN is used to model representative PDEs, the examples include strong discontinuity, singularity, complex boundary, non-linear, and heterogeneous problems. Furthermore, it outperforms other methods when dealing with heterogeneous problems.
    Deep Reference Priors: What is the best way to pretrain a model?. (arXiv:2202.00187v2 [stat.ML] UPDATED)
    What is the best way to exploit extra data -- be it unlabeled data from the same task, or labeled data from a related task -- to learn a given task? This paper formalizes the question using the theory of reference priors. Reference priors are objective, uninformative Bayesian priors that maximize the mutual information between the task and the weights of the model. Such priors enable the task to maximally affect the Bayesian posterior, e.g., reference priors depend upon the number of samples available for learning the task and for very small sample sizes, the prior puts more probability mass on low-complexity models in the hypothesis space. This paper presents the first demonstration of reference priors for medium-scale deep networks and image-based data. We develop generalizations of reference priors and demonstrate applications to two problems. First, by using unlabeled data to compute the reference prior, we develop new Bayesian semi-supervised learning methods that remain effective even with very few samples per class. Second, by using labeled data from the source task to compute the reference prior, we develop a new pretraining method for transfer learning that allows data from the target task to maximally affect the Bayesian posterior. Empirical validation of these methods is conducted on image classification datasets. Code is available at https://github.com/grasp-lyrl/deep_reference_priors.
    MixGen: A New Multi-Modal Data Augmentation. (arXiv:2206.08358v1 [cs.CV])
    Data augmentation is a necessity to enhance data efficiency in deep learning. For vision-language pre-training, data is only augmented either for images or for text in previous works. In this paper, we present MixGen: a joint data augmentation for vision-language representation learning to further improve data efficiency. It generates new image-text pairs with semantic relationships preserved by interpolating images and concatenating text. It's simple, and can be plug-and-played into existing pipelines. We evaluate MixGen on four architectures, including CLIP, ViLT, ALBEF and TCL, across five downstream vision-language tasks to show its versatility and effectiveness. For example, adding MixGen in ALBEF pre-training leads to absolute performance improvements on downstream tasks: image-text retrieval (+6.2% on COCO fine-tuned and +5.3% on Flicker30K zero-shot), visual grounding (+0.9% on RefCOCO+), visual reasoning (+0.9% on NLVR$^{2}$), visual question answering (+0.3% on VQA2.0), and visual entailment (+0.4% on SNLI-VE).
    Preserved central model for faster bidirectional compression in distributed settings. (arXiv:2102.12528v2 [cs.LG] UPDATED)
    We develop a new approach to tackle communication constraints in a distributed learning problem with a central server. We propose and analyze a new algorithm that performs bidirectional compression and achieves the same convergence rate as algorithms using only uplink (from the local workers to the central server) compression. To obtain this improvement, we design MCM, an algorithm such that the downlink compression only impacts local models, while the global model is preserved. As a result, and contrary to previous works, the gradients on local servers are computed on perturbed models. Consequently, convergence proofs are more challenging and require a precise control of this perturbation. To ensure it, MCM additionally combines model compression with a memory mechanism. This analysis opens new doors, e.g. incorporating worker dependent randomized-models and partial participation.
    Solving Inverse Problems in Medical Imaging with Score-Based Generative Models. (arXiv:2111.08005v2 [eess.IV] UPDATED)
    Reconstructing medical images from partial measurements is an important inverse problem in Computed Tomography (CT) and Magnetic Resonance Imaging (MRI). Existing solutions based on machine learning typically train a model to directly map measurements to medical images, leveraging a training dataset of paired images and measurements. These measurements are typically synthesized from images using a fixed physical model of the measurement process, which hinders the generalization capability of models to unknown measurement processes. To address this issue, we propose a fully unsupervised technique for inverse problem solving, leveraging the recently introduced score-based generative models. Specifically, we first train a score-based generative model on medical images to capture their prior distribution. Given measurements and a physical model of the measurement process at test time, we introduce a sampling method to reconstruct an image consistent with both the prior and the observed measurements. Our method does not assume a fixed measurement process during training, and can thus be flexibly adapted to different measurement processes at test time. Empirically, we observe comparable or better performance to supervised learning techniques in several medical imaging tasks in CT and MRI, while demonstrating significantly better generalization to unknown measurement processes.
    A Tree-based Model Averaging Approach for Personalized Treatment Effect Estimation from Heterogeneous Data Sources. (arXiv:2103.06261v3 [stat.ML] UPDATED)
    Accurately estimating personalized treatment effects within a study site (e.g., a hospital) has been challenging due to limited sample size. Furthermore, privacy considerations and lack of resources prevent a site from leveraging subject-level data from other sites. We propose a tree-based model averaging approach to improve the estimation accuracy of conditional average treatment effects (CATE) at a target site by leveraging models derived from other potentially heterogeneous sites, without them sharing subject-level data. To our best knowledge, there is no established model averaging approach for distributed data with a focus on improving the estimation of treatment effects. Specifically, under distributed data networks, our framework provides an interpretable tree-based ensemble of CATE estimators that joins models across study sites, while actively modeling the heterogeneity in data sources through site partitioning. The performance of this approach is demonstrated by a real-world study of the causal effects of oxygen therapy on hospital survival rate and backed up by comprehensive simulation results.
    mlf-core: a framework for deterministic machine learning. (arXiv:2104.07651v2 [cs.MS] UPDATED)
    Machine learning has shown extensive growth in recent years and is now routinely applied to sensitive areas. To allow appropriate verification of predictive models before deployment, models must be deterministic. However, major machine learning libraries default to the usage of non-deterministic algorithms based on atomic operations. Solely fixing all random seeds is not sufficient for deterministic machine learning. To overcome this shortcoming, various machine learning libraries released deterministic counterparts to the non-deterministic algorithms. We evaluated the effect of these algorithms on determinism and runtime. Based on these results, we formulated a set of requirements for deterministic machine learning and developed a new software solution, the mlf-core ecosystem, which aids machine learning projects to meet and keep these requirements. We applied mlf-core to develop deterministic models in various biomedical fields including a single cell autoencoder with TensorFlow, a PyTorch-based U-Net model for liver-tumor segmentation in CT scans, and a liver cancer classifier based on gene expression profiles with XGBoost.
    The Portiloop: a deep learning-based open science tool for closed-loop brain stimulation. (arXiv:2107.13473v3 [eess.SP] UPDATED)
    Closed-loop brain stimulation refers to capturing neurophysiological measures such as electroencephalography (EEG), quickly identifying neural events of interest, and producing auditory, magnetic or electrical stimulation so as to interact with brain processes precisely. It is a promising new method for fundamental neuroscience and perhaps for clinical applications such as restoring degraded memory function; however, existing tools are expensive, cumbersome, and offer limited experimental flexibility. In this article, we propose the Portiloop, a deep learning-based, portable and low-cost closed-loop stimulation system able to target specific brain oscillations. We first document open-hardware implementations that can be constructed from commercially available components. We also provide a fast, lightweight neural network model and an exploration algorithm that automatically optimizes the model hyperparameters to the desired brain oscillation. Finally, we validate the technology on a challenging test case of real-time sleep spindle detection, with results comparable to off-line expert performance on the Massive Online Data Annotation spindle dataset (MODA; group consensus). Software and plans are available to the community as an open science initiative to encourage further development and advance closed-loop neuroscience research.
    Learning to Denoise Historical Music. (arXiv:2008.02027v2 [eess.AS] UPDATED)
    We propose an audio-to-audio neural network model that learns to denoise old music recordings. Our model internally converts its input into a time-frequency representation by means of a short-time Fourier transform (STFT), and processes the resulting complex spectrogram using a convolutional neural network. The network is trained with both reconstruction and adversarial objectives on a synthetic noisy music dataset, which is created by mixing clean music with real noise samples extracted from quiet segments of old recordings. We evaluate our method quantitatively on held-out test examples of the synthetic dataset, and qualitatively by human rating on samples of actual historical recordings. Our results show that the proposed method is effective in removing noise, while preserving the quality and details of the original music.
    Federated Learning on the Road: Autonomous Controller Design for Connected and Autonomous Vehicles. (arXiv:2102.03401v2 [eess.SY] UPDATED)
    A new federated learning (FL) framework enabled by large-scale wireless connectivity is proposed for designing the autonomous controller of connected and autonomous vehicles (CAVs). In this framework, the learning models used by the controllers are collaboratively trained among a group of CAVs. To capture the varying CAV participation in the FL training process and the diverse local data quality among CAVs, a novel dynamic federated proximal (DFP) algorithm is proposed that accounts for the mobility of CAVs, the wireless fading channels, as well as the unbalanced and nonindependent and identically distributed data across CAVs. A rigorous convergence analysis is performed for the proposed algorithm to identify how fast the CAVs converge to using the optimal autonomous controller. In particular, the impacts of varying CAV participation in the FL process and diverse CAV data quality on the convergence of the proposed DFP algorithm are explicitly analyzed. Leveraging this analysis, an incentive mechanism based on contract theory is designed to improve the FL convergence speed. Simulation results using real vehicular data traces show that the proposed DFP-based controller can accurately track the target CAV speed over time and under different traffic scenarios. Moreover, the results show that the proposed DFP algorithm has a much faster convergence compared to popular FL algorithms such as federated averaging (FedAvg) and federated proximal (FedProx). The results also validate the feasibility of the contract-theoretic incentive mechanism and show that the proposed mechanism can improve the convergence speed of the DFP algorithm by 40% compared to the baselines.
    LSB: Local Self-Balancing MCMC in Discrete Spaces. (arXiv:2109.03867v3 [cs.AI] UPDATED)
    We present the Local Self-Balancing sampler (LSB), a local Markov Chain Monte Carlo (MCMC) method for sampling in purely discrete domains, which is able to autonomously adapt to the target distribution and to reduce the number of target evaluations required to converge. LSB is based on (i) a parametrization of locally balanced proposals, (ii) a newly proposed objective function based on mutual information and (iii) a self-balancing learning procedure, which minimises the proposed objective to update the proposal parameters. Experiments on energy-based models and Markov networks show that LSB converges using a smaller number of queries to the oracle distribution compared to recent local MCMC samplers.
    CausalAF: Causal Autoregressive Flow for Safety-Critical Driving Scenario Generation. (arXiv:2110.13939v2 [cs.CV] UPDATED)
    Generating safety-critical scenarios, which are crucial yet difficult to collect, provides an effective way to evaluate the robustness of autonomous driving systems. However, the diversity of scenarios and efficiency of generation methods are heavily restricted by the rareness and structure of safety-critical scenarios. Therefore, existing generative models that only estimate distributions from observational data are not satisfying to solve this problem. In this paper, we integrate causality as a prior into the scenario generation and propose a flow-based generative framework, Causal Autoregressive Flow (CausalAF). CausalAF encourages the generative model to uncover and follow the causal relationship among generated objects via novel causal masking operations instead of searching the sample only from observational data. By learning the cause-and-effect mechanism of how the generated scenario causes risk situations rather than just learning correlations from data, CausalAF significantly improves learning efficiency. Extensive experiments on three heterogeneous traffic scenarios illustrate that CausalAF requires much fewer optimization resources to effectively generate safety-critical scenarios. We also show that using generated scenarios as additional training samples empirically improves the robustness of autonomous driving algorithms.
    Finite-Time Convergence Rates of Decentralized Stochastic Approximation with Applications in Multi-Agent and Multi-Task Learning. (arXiv:2010.15088v2 [cs.LG] UPDATED)
    We study a decentralized variant of stochastic approximation, a data-driven approach for finding the root of an operator under noisy measurements. A network of agents, each with its own operator and data observations, cooperatively find the fixed point of the aggregate operator over a decentralized communication graph. Our main contribution is to provide a finite-time analysis of this decentralized stochastic approximation method when the data observed at each agent are sampled from a Markov process; this lack of independence makes the iterates biased and (potentially) unbounded. Under fairly standard assumptions, we show that the convergence rate of the proposed method is essentially the same as if the samples were independent, differing only by a log factor that accounts for the mixing time of the Markov processes. The key idea in our analysis is to introduce a novel Razumikhin-Lyapunov function, motivated by the one used in analyzing the stability of delayed ordinary differential equations. We also discuss applications of the proposed method on a number of interesting learning problems in multi-agent systems.
    Classical Planning in Deep Latent Space. (arXiv:2107.00110v3 [cs.AI] UPDATED)
    Current domain-independent, classical planners require symbolic models of the problem domain and instance as input, resulting in a knowledge acquisition bottleneck. Meanwhile, although deep learning has achieved significant success in many fields, the knowledge is encoded in a subsymbolic representation which is incompatible with symbolic systems such as planners. We propose Latplan, an unsupervised architecture combining deep learning and classical planning. Given only an unlabeled set of image pairs showing a subset of transitions allowed in the environment (training inputs), Latplan learns a complete propositional PDDL action model of the environment. Later, when a pair of images representing the initial and the goal states (planning inputs) is given, Latplan finds a plan to the goal state in a symbolic latent space and returns a visualized plan execution. We evaluate Latplan using image-based versions of 6 planning domains: 8-puzzle, 15-Puzzle, Blocksworld, Sokoban and Two variations of LightsOut.
    Estimating Categorical Counterfactuals via Deep Twin Networks. (arXiv:2109.01904v4 [cs.LG] UPDATED)
    Counterfactual inference is a powerful tool, capable of solving challenging problems in high-profile sectors. To perform counterfactual inference, one requires knowledge of the underlying causal mechanisms. However, causal mechanisms cannot be uniquely determined from observations and interventions alone. This raises the question of how to choose the causal mechanisms so that resulting counterfactual inference is trustworthy in a given domain. This question has been addressed in causal models with binary variables, but the case of categorical variables remains unanswered. We address this challenge by introducing for causal models with categorical variables the notion of counterfactual ordering, a principle that posits desirable properties causal mechanisms should posses, and prove that it is equivalent to specific functional constraints on the causal mechanisms. To learn causal mechanisms satisfying these constraints, and perform counterfactual inference with them, we introduce deep twin networks. These are deep neural networks that, when trained, are capable of twin network counterfactual inference -- an alternative to the abduction, action, & prediction method. We empirically test our approach on diverse real-world and semi-synthetic data from medicine, epidemiology, and finance, reporting accurate estimation of counterfactual probabilities while demonstrating the issues that arise with counterfactual reasoning when counterfactual ordering is not enforced.
    Neural tangent kernel analysis of shallow $\alpha$-Stable ReLU neural networks. (arXiv:2206.08065v1 [cs.LG])
    There is a recent literature on large-width properties of Gaussian neural networks (NNs), i.e. NNs whose weights are distributed according to Gaussian distributions. Two popular problems are: i) the study of the large-width behaviour of NNs, which provided a characterization of the infinitely wide limit of a rescaled NN in terms of a Gaussian process; ii) the study of the large-width training dynamics of NNs, which set forth an equivalence between training the rescaled NN and performing a kernel regression with a deterministic kernel referred to as the neural tangent kernel (NTK). In this paper, we consider these problems for $\alpha$-Stable NNs, which generalize Gaussian NNs by assuming that the NN's weights are distributed as $\alpha$-Stable distributions with $\alpha\in(0,2]$, i.e. distributions with heavy tails. For shallow $\alpha$-Stable NNs with a ReLU activation function, we show that if the NN's width goes to infinity then a rescaled NN converges weakly to an $\alpha$-Stable process, i.e. a stochastic process with $\alpha$-Stable finite-dimensional distributions. As a novelty with respect to the Gaussian setting, in the $\alpha$-Stable setting the choice of the activation function affects the scaling of the NN, that is: to achieve the infinitely wide $\alpha$-Stable process, the ReLU function requires an additional logarithmic scaling with respect to sub-linear functions. Then, our main contribution is the NTK analysis of shallow $\alpha$-Stable ReLU-NNs, which leads to an equivalence between training a rescaled NN and performing a kernel regression with an $(\alpha/2)$-Stable random kernel. The randomness of such a kernel is a further novelty with respect to the Gaussian setting, that is: in the $\alpha$-Stable setting the randomness of the NN at initialization does not vanish in the NTK analysis, thus inducing a distribution for the kernel of the underlying kernel regression.
    Face Anti-Spoofing by Learning Polarization Cues in a Real-World Scenario. (arXiv:2003.08024v3 [cs.CV] UPDATED)
    Face anti-spoofing is the key to preventing security breaches in biometric recognition applications. Existing software-based and hardware-based face liveness detection methods are effective in constrained environments or designated datasets only. Deep learning method using RGB and infrared images demands a large amount of training data for new attacks. In this paper, we present a face anti-spoofing method in a real-world scenario by automatic learning the physical characteristics in polarization images of a real face compared to a deceptive attack. A computational framework is developed to extract and classify the unique face features using convolutional neural networks and SVM together. Our real-time polarized face anti-spoofing (PAAS) detection method uses a on-chip integrated polarization imaging sensor with optimized processing algorithms. Extensive experiments demonstrate the advantages of the PAAS technique to counter diverse face spoofing attacks (print, replay, mask) in uncontrolled indoor and outdoor conditions by learning polarized face images of 33 people. A four-directional polarized face image dataset is released to inspire future applications within biometric anti-spoofing field.
    DEEMD: Drug Efficacy Estimation against SARS-CoV-2 based on cell Morphology with Deep multiple instance learning. (arXiv:2105.05758v2 [cs.LG] UPDATED)
    Drug repurposing can accelerate the identification of effective compounds for clinical use against SARS-CoV-2, with the advantage of pre-existing clinical safety data and an established supply chain. RNA viruses such as SARS-CoV-2 manipulate cellular pathways and induce reorganization of subcellular structures to support their life cycle. These morphological changes can be quantified using bioimaging techniques. In this work, we developed DEEMD: a computational pipeline using deep neural network models within a multiple instance learning framework, to identify putative treatments effective against SARS-CoV-2 based on morphological analysis of the publicly available RxRx19a dataset. This dataset consists of fluorescence microscopy images of SARS-CoV-2 non-infected cells and infected cells, with and without drug treatment. DEEMD first extracts discriminative morphological features to generate cell morphological profiles from the non-infected and infected cells. These morphological profiles are then used in a statistical model to estimate the applied treatment efficacy on infected cells based on similarities to non-infected cells. DEEMD is capable of localizing infected cells via weak supervision without any expensive pixel-level annotations. DEEMD identifies known SARS-CoV-2 inhibitors, such as Remdesivir and Aloxistatin, supporting the validity of our approach. DEEMD can be explored for use on other emerging viruses and datasets to rapidly identify candidate antiviral treatments in the future}. Our implementation is available online at https://www.github.com/Sadegh-Saberian/DEEMD
    Learning Models of Individual Behavior in Chess. (arXiv:2008.10086v3 [cs.AI] UPDATED)
    AI systems that can capture human-like behavior are becoming increasingly useful in situations where humans may want to learn from these systems, collaborate with them, or engage with them as partners for an extended duration. In order to develop human-oriented AI systems, the problem of predicting human actions -- as opposed to predicting optimal actions -- has received considerable attention. Existing work has focused on capturing human behavior in an aggregate sense, which potentially limits the benefit any particular individual could gain from interaction with these systems. We extend this line of work by developing highly accurate predictive models of individual human behavior in chess. Chess is a rich domain for exploring human-AI interaction because it combines a unique set of properties: AI systems achieved superhuman performance many years ago, and yet humans still interact with them closely, both as opponents and as preparation tools, and there is an enormous corpus of recorded data on individual player games. Starting with Maia, an open-source version of AlphaZero trained on a population of human players, we demonstrate that we can significantly improve prediction accuracy of a particular player's moves by applying a series of fine-tuning methods. Furthermore, our personalized models can be used to perform stylometry -- predicting who made a given set of moves -- indicating that they capture human decision-making at an individual level. Our work demonstrates a way to bring AI systems into better alignment with the behavior of individual people, which could lead to large improvements in human-AI interaction.
    OpenCoS: Contrastive Semi-supervised Learning for Handling Open-set Unlabeled Data. (arXiv:2107.08943v2 [cs.CV] UPDATED)
    Semi-supervised learning (SSL) is one of the most promising paradigms to circumvent the expensive labeling cost for building a high-performance model. Most existing SSL methods conventionally assume both labeled and unlabeled data are drawn from the same (class) distribution. However, unlabeled data may include out-of-class samples in practice; those that cannot have one-hot encoded labels from a closed-set of classes in label data, i.e. unlabeled data is an open-set. In this paper, we introduce OpenCoS, a method for handling this realistic semi-supervised learning scenario based upon a recent framework of self-supervised visual representation learning. Specifically, we first observe that the out-of-class samples in the open-set unlabeled dataset can be identified effectively via self-supervised contrastive learning. Then, OpenCoS utilizes this information to overcome the failure modes in the existing state-of-the-art semi-supervised methods, by utilizing one-hot pseudo-labels and soft-labels for the identified in- and out-of-class unlabeled data, respectively. Our extensive experimental results show the effectiveness of OpenCoS, fixing up the state-of-the-art semi-supervised methods to be suitable for diverse scenarios involving open-set unlabeled data.
    Long Range Graph Benchmark. (arXiv:2206.08164v1 [cs.LG])
    Graph Neural Networks (GNNs) that are based on the message passing (MP) paradigm exchange information between 1-hop neighbors to build node representations at each layer. In principle, such networks are not able to capture long-range interactions (LRI) that may be desired or necessary for learning a given task on graphs. Recently, there has been an increasing interest in development of Transformer-based methods for graphs that can consider full node connectivity beyond the original sparse structure, thus enabling the modeling of LRI. However, MP-GNNs that simply rely on 1-hop message passing often fare better in several existing graph benchmarks when combined with positional feature representations, among other innovations, hence limiting the perceived utility and ranking of Transformer-like architectures. Here, we present the Long Range Graph Benchmark (LRGB) with 5 graph learning datasets: PascalVOC-SP, COCO-SP, PCQM-Contact, Peptides-func and Peptides-struct that arguably require LRI reasoning to achieve strong performance in a given task. We benchmark both baseline GNNs and Graph Transformer networks to verify that the models which capture long-range dependencies perform significantly better on these tasks. Therefore, these datasets are suitable for benchmarking and exploration of MP-GNNs and Graph Transformer architectures that are intended to capture LRI.
    New Versions of Gradient Temporal Difference Learning. (arXiv:2109.04033v2 [cs.LG] UPDATED)
    Sutton, Szepesv\'{a}ri and Maei introduced the first gradient temporal-difference (GTD) learning algorithms compatible with both linear function approximation and off-policy training. The goal of this paper is (a) to propose some variants of GTDs with extensive comparative analysis and (b) to establish new theoretical analysis frameworks for the GTDs. These variants are based on convex-concave saddle-point interpretations of GTDs, which effectively unify all the GTDs into a single framework, and provide simple stability analysis based on recent results on primal-dual gradient dynamics. Finally, numerical comparative analysis is given to evaluate these approaches.
    Phase transitions in nonparametric regressions: a curse of exploiting higher degree smoothness assumptions in finite samples. (arXiv:2112.03626v3 [math.ST] UPDATED)
    When the regression function belongs to the smooth classes consisting of univariate functions with derivatives up to the $(\gamma+1)$th order bounded in absolute values by a common constant everywhere or a.e., it is generally viewed that exploiting higher degree smoothness assumption helps reduce the estimation error. This paper shows that the minimax optimal mean integrated squared error (MISE) rate increases in $\gamma$ when the sample size $n$ is small relative to $\left(\gamma+1\right)^{2\gamma+3}$ (e.g., $\left(\gamma+1\right)^{2\gamma+3}=262144$ when $\gamma=3$), and decreases in $\gamma$ when $n$ is large relative to $\left(\gamma+1\right)^{2\gamma+3}$. In particular, this phase transition property is shown to be achieved by common nonparametric procedures. Consider $\gamma_{1}$ and $\gamma_{2}$ such that $\gamma_{1}<\gamma_{2}$, where the $(\gamma_{2}+1)$th degree smoothness class is a subset of the $(\gamma_{1}+1)$th degree class. What is interesting about our results is that they imply, if $n$ is small relative to $\left(\gamma_{1}+1\right)^{2\gamma_{1}+3}$, the optimal rate achieved by the estimator constrained to be in the smoother class is larger. In data sets with fewer than hundreds-of-thousands observations, our results suggest that one should not exploit beyond the third degree of smoothness. To some extent, our results provide a theoretical basis for the widely adopted practical recommendation given by Gelman and Imbens (2019). The building blocks of our minimax optimality results are a set of metric entropy bounds we develop in this paper for smooth function classes. Some of our bounds are original, and some of them refine and/or generalize the ones in the literature.
    Double Check Your State Before Trusting It: Confidence-Aware Bidirectional Offline Model-Based Imagination. (arXiv:2206.07989v1 [cs.LG])
    The learned policy of model-free offline reinforcement learning (RL) methods is often constrained to stay within the support of datasets to avoid possible dangerous out-of-distribution actions or states, making it challenging to handle out-of-support region. Model-based RL methods offer a richer dataset and benefit generalization by generating imaginary trajectories with either trained forward or reverse dynamics model. However, the imagined transitions may be inaccurate, thus downgrading the performance of the underlying offline RL method. In this paper, we propose to augment the offline dataset by using trained bidirectional dynamics models and rollout policies with double check. We introduce conservatism by trusting samples that the forward model and backward model agree on. Our method, confidence-aware bidirectional offline model-based imagination, generates reliable samples and can be combined with any model-free offline RL method. Experimental results on the D4RL benchmarks demonstrate that our method significantly boosts the performance of existing model-free offline RL algorithms and achieves competitive or better scores against baseline methods.
    Learning to Infer Structures of Network Games. (arXiv:2206.08119v1 [cs.LG])
    Strategic interactions between a group of individuals or organisations can be modelled as games played on networks, where a player's payoff depends not only on their actions but also on those of their neighbours. Inferring the network structure from observed game outcomes (equilibrium actions) is an important problem with numerous potential applications in economics and social sciences. Existing methods mostly require the knowledge of the utility function associated with the game, which is often unrealistic to obtain in real-world scenarios. We adopt a transformer-like architecture which correctly accounts for the symmetries of the problem and learns a mapping from the equilibrium actions to the network structure of the game without explicit knowledge of the utility function. We test our method on three different types of network games using both synthetic and real-world data, and demonstrate its effectiveness in network structure inference and superior performance over existing methods.
    NCGNN: Node-Level Capsule Graph Neural Network for Semisupervised Classification. (arXiv:2012.03476v2 [cs.LG] UPDATED)
    Message passing has evolved as an effective tool for designing Graph Neural Networks (GNNs). However, most existing methods for message passing simply sum or average all the neighboring features to update node representations. They are restricted by two problems, i.e., (i) lack of interpretability to identify node features significant to the prediction of GNNs, and (ii) feature over-mixing that leads to the over-smoothing issue in capturing long-range dependencies and inability to handle graphs under heterophily or low homophily. In this paper, we propose a Node-level Capsule Graph Neural Network (NCGNN) to address these problems with an improved message passing scheme. Specifically, NCGNN represents nodes as groups of node-level capsules, in which each capsule extracts distinctive features of its corresponding node. For each node-level capsule, a novel dynamic routing procedure is developed to adaptively select appropriate capsules for aggregation from a subgraph identified by the designed graph filter. NCGNN aggregates only the advantageous capsules and restrains irrelevant messages to avoid over-mixing features of interacting nodes. Therefore, it can relieve the over-smoothing issue and learn effective node representations over graphs with homophily or heterophily. Furthermore, our proposed message passing scheme is inherently interpretable and exempt from complex post-hoc explanations, as the graph filter and the dynamic routing procedure identify a subset of node features that are most significant to the model prediction from the extracted subgraph. Extensive experiments on synthetic as well as real-world graphs demonstrate that NCGNN can well address the over-smoothing issue and produce better node representations for semisupervised node classification. It outperforms the state of the arts under both homophily and heterophily.
    Simultaneously Learning Stochastic and Adversarial Bandits with General Graph Feedback. (arXiv:2206.07908v1 [cs.LG])
    The problem of online learning with graph feedback has been extensively studied in the literature due to its generality and potential to model various learning tasks. Existing works mainly study the adversarial and stochastic feedback separately. If the prior knowledge of the feedback mechanism is unavailable or wrong, such specially designed algorithms could suffer great loss. To avoid this problem, \citet{erez2021towards} try to optimize for both environments. However, they assume the feedback graphs are undirected and each vertex has a self-loop, which compromises the generality of the framework and may not be satisfied in applications. With a general feedback graph, the observation of an arm may not be available when this arm is pulled, which makes the exploration more expensive and the algorithms more challenging to perform optimally in both environments. In this work, we overcome this difficulty by a new trade-off mechanism with a carefully-designed proportion for exploration and exploitation. We prove the proposed algorithm simultaneously achieves $\mathrm{poly} \log T$ regret in the stochastic setting and minimax-optimal regret of $\tilde{O}(T^{2/3})$ in the adversarial setting where $T$ is the horizon and $\tilde{O}$ hides parameters independent of $T$ as well as logarithmic terms. To our knowledge, this is the first best-of-both-worlds result for general feedback graphs.
    Analysis and Extensions of Adversarial Training for Video Classification. (arXiv:2206.07953v1 [cs.CV])
    Adversarial training (AT) is a simple yet effective defense against adversarial attacks to image classification systems, which is based on augmenting the training set with attacks that maximize the loss. However, the effectiveness of AT as a defense for video classification has not been thoroughly studied. Our first contribution is to show that generating optimal attacks for video requires carefully tuning the attack parameters, especially the step size. Notably, we show that the optimal step size varies linearly with the attack budget. Our second contribution is to show that using a smaller (sub-optimal) attack budget at training time leads to a more robust performance at test time. Based on these findings, we propose three defenses against attacks with variable attack budgets. The first one, Adaptive AT, is a technique where the attack budget is drawn from a distribution that is adapted as training iterations proceed. The second, Curriculum AT, is a technique where the attack budget is increased as training iterations proceed. The third, Generative AT, further couples AT with a denoising generative adversarial network to boost robust performance. Experiments on the UCF101 dataset demonstrate that the proposed methods improve adversarial robustness against multiple attack types.
    A machine-generated catalogue of Charon's craters and implications for the Kuiper belt. (arXiv:2206.08277v1 [astro-ph.EP])
    In this paper we investigate Charon's craters size distribution using a deep learning model. This is motivated by the recent results of Singer et al. (2019) who, using manual cataloging, found a change in the size distribution slope of craters smaller than 12 km in diameter, translating into a paucity of small Kuiper Belt objects. These results were corroborated by Robbins and Singer (2021), but opposed by Morbidelli et al. (2021), necessitating an independent review. Our MaskRCNN-based ensemble of models was trained on Lunar, Mercurian, and Martian crater catalogues and both optical and digital elevation images. We use a robust image augmentation scheme to force the model to generalize and transfer-learn into icy objects. With no prior bias or exposure to Charon, our model find best fit slopes of q =-1.47+-0.33 for craters smaller than 10 km, and q =-2.91+-0.51 for craters larger than 15 km. These values indicate a clear change in slope around 15 km as suggested by Singer et al. (2019) and thus independently confirm their conclusions. Our slopes however are both slightly flatter than those found more recently by Robbins and Singer (2021). Our trained models and relevant codes are available online on github.com/malidib/ACID .
    Lifelong Wandering: A realistic few-shot online continual learning setting. (arXiv:2206.07932v1 [cs.CV])
    Online few-shot learning describes a setting where models are trained and evaluated on a stream of data while learning emerging classes. While prior work in this setting has achieved very promising performance on instance classification when learning from data-streams composed of a single indoor environment, we propose to extend this setting to consider object classification on a series of several indoor environments, which is likely to occur in applications such as robotics. Importantly, our setting, which we refer to as online few-shot continual learning, injects the well-studied issue of catastrophic forgetting into the few-shot online learning paradigm. In this work, we benchmark several existing methods and adapted baselines within our setting, and show there exists a trade-off between catastrophic forgetting and online performance. Our findings motivate the need for future work in this setting, which can achieve better online performance without catastrophic forgetting.
    Large-scale, multi-centre, multi-disease validation of an AI clinical tool for cine CMR analysis. (arXiv:2206.08137v1 [eess.IV])
    INTRODUCTION: Artificial intelligence (AI) has the potential to facilitate the automation of CMR analysis for biomarker extraction. However, most AI algorithms are trained on a specific input domain (e.g., single scanner vendor or hospital-tailored imaging protocol) and lack the robustness to perform optimally when applied to CMR data from other input domains. METHODS: Our proposed framework consists of an AI-based algorithm for biventricular segmentation of short-axis images, followed by a post-analysis quality control to detect erroneous results. The segmentation algorithm was trained on a large dataset of clinical CMR scans from two NHS hospitals (n=2793) and validated on additional cases from this dataset (n=441) and on five external datasets (n=6808). The validation data included CMR scans of patients with a range of diseases acquired at 12 different centres using CMR scanners from all major vendors. RESULTS: Our method yielded median Dice scores over 87%, translating into median absolute errors in cardiac biomarkers within the range of inter-observer variability: <8.4mL (left ventricle), <9.2mL (right ventricle), <13.3g (left ventricular mass), and <5.9% (ejection fraction) across all datasets. Stratification of cases according to phenotypes of cardiac disease and scanner vendors showed good agreement. CONCLUSIONS: We show that our proposed tool, which combines a state-of-the-art AI algorithm trained on a large-scale multi-domain CMR dataset with a post-analysis quality control, allows us to robustly deal with routine clinical data from multiple centres, vendors, and cardiac diseases. This is a fundamental step for the clinical translation of AI algorithms. Moreover, our method yields a range of additional biomarkers of cardiac function (filling and ejection rates, regional wall motion, and strain) at no extra computational cost.
    Barrier Certified Safety Learning Control: When Sum-of-Square Programming Meets Reinforcement Learning. (arXiv:2206.07915v1 [eess.SY])
    Safety guarantee is essential in many engineering implementations. Reinforcement learning provides a useful way to strengthen safety. However, reinforcement learning algorithms cannot completely guarantee safety over realistic operations. To address this issue, this work adopts control barrier functions over reinforcement learning, and proposes a compensated algorithm to completely maintain safety. Specifically, a sum-of-squares programming has been exploited to search for the optimal controller, and tune the learning hyperparameters simultaneously. Thus, the control actions are pledged to be always within the safe region. The effectiveness of proposed method is demonstrated via an inverted pendulum model. Compared to quadratic programming based reinforcement learning methods, our sum-of-squares programming based reinforcement learning has shown its superiority.
    Large-Scale Differentiable Causal Discovery of Factor Graphs. (arXiv:2206.07824v1 [stat.ML])
    A common theme in causal inference is learning causal relationships between observed variables, also known as causal discovery. This is usually a daunting task, given the large number of candidate causal graphs and the combinatorial nature of the search space. Perhaps for this reason, most research has so far focused on relatively small causal graphs, with up to hundreds of nodes. However, recent advances in fields like biology enable generating experimental data sets with thousands of interventions followed by rich profiling of thousands of variables, raising the opportunity and urgent need for large causal graph models. Here, we introduce the notion of factor directed acyclic graphs (f-DAGs) as a way to restrict the search space to non-linear low-rank causal interaction models. Combining this novel structural assumption with recent advances that bridge the gap between causal discovery and continuous optimization, we achieve causal discovery on thousands of variables. Additionally, as a model for the impact of statistical noise on this estimation procedure, we study a model of edge perturbations of the f-DAG skeleton based on random graphs and quantify the effect of such perturbations on the f-DAG rank. This theoretical analysis suggests that the set of candidate f-DAGs is much smaller than the whole DAG space and thus more statistically robust in the high-dimensional regime where the underlying skeleton is hard to assess. We propose Differentiable Causal Discovery of Factor Graphs (DCD-FG), a scalable implementation of f-DAG constrained causal discovery for high-dimensional interventional data. DCD-FG uses a Gaussian non-linear low-rank structural equation model and shows significant improvements compared to state-of-the-art methods in both simulations as well as a recent large-scale single-cell RNA sequencing data set with hundreds of genetic interventions.
    DCASE 2022: Comparative Analysis Of CNNs For Acoustic Scene Classification Under Low-Complexity Considerations. (arXiv:2206.08007v1 [cs.SD])
    Acoustic scene classification is an automatic listening problem that aims to assign an audio recording to a pre-defined scene based on its audio data. Over the years (and in past editions of the DCASE) this problem has often been solved with techniques known as ensembles (use of several machine learning models to combine their predictions in the inference phase). While these solutions can show performance in terms of accuracy, they can be very expensive in terms of computational capacity, making it impossible to deploy them in IoT devices. Due to the drift in this field of study, this task has two limitations in terms of model complexity. It should be noted that there is also the added complexity of mismatching devices (the audios provided are recorded by different sources of information). This technical report makes a comparative study of two different network architectures: conventional CNN and Conv-mixer. Although both networks exceed the baseline required by the competition, the conventional CNN shows a higher performance, exceeding the baseline by 8 percentage points. Solutions based on Conv-mixer architectures show worse performance although they are much lighter solutions.
    Disparate Impact in Differential Privacy from Gradient Misalignment. (arXiv:2206.07737v1 [cs.LG])
    As machine learning becomes more widespread throughout society, aspects including data privacy and fairness must be carefully considered, and are crucial for deployment in highly regulated industries. Unfortunately, the application of privacy enhancing technologies can worsen unfair tendencies in models. In particular, one of the most widely used techniques for private model training, differentially private stochastic gradient descent (DPSGD), frequently intensifies disparate impact on groups within data. In this work we study the fine-grained causes of unfairness in DPSGD and identify gradient misalignment due to inequitable gradient clipping as the most significant source. This observation leads us to a new method for reducing unfairness by preventing gradient misalignment in DPSGD.
    On the Identifiability of Nonlinear ICA: Sparsity and Beyond. (arXiv:2206.07751v1 [cs.LG])
    Nonlinear independent component analysis (ICA) aims to recover the underlying independent latent sources from their observable nonlinear mixtures. How to make the nonlinear ICA model identifiable up to certain trivial indeterminacies is a long-standing problem in unsupervised learning. Recent breakthroughs reformulate the standard independence assumption of sources as conditional independence given some auxiliary variables (e.g., class labels and/or domain/time indexes) as weak supervision or inductive bias. However, nonlinear ICA with unconditional priors cannot benefit from such developments. We explore an alternative path and consider only assumptions on the mixing process, such as Structural Sparsity or Independent Influences. We show that under specific instantiations of such constraints, the independent latent sources can be identified from their nonlinear mixtures up to a permutation and a component-wise transformation, thus achieving nontrivial identifiability of nonlinear ICA without auxiliary variables. We provide estimation methods and validate the theoretical results experimentally. The results on image data suggest that our conditions may hold in a number of practical data generating processes.
    Efficient Approximation of Expected Hypervolume Improvement using Gauss-Hermite Quadrature. (arXiv:2206.07834v1 [cs.LG])
    Many methods for performing multi-objective optimisation of computationally expensive problems have been proposed recently. Typically, a probabilistic surrogate for each objective is constructed from an initial dataset. The surrogates can then be used to produce predictive densities in the objective space for any solution. Using the predictive densities, we can compute the expected hypervolume improvement (EHVI) due to a solution. Maximising the EHVI, we can locate the most promising solution that may be expensively evaluated next. There are closed-form expressions for computing the EHVI, integrating over the multivariate predictive densities. However, they require partitioning the objective space, which can be prohibitively expensive for more than three objectives. Furthermore, there are no closed-form expressions for a problem where the predictive densities are dependent, capturing the correlations between objectives. Monte Carlo approximation is used instead in such cases, which is not cheap. Hence, the need to develop new accurate but cheaper approximation methods remains. Here we investigate an alternative approach toward approximating the EHVI using Gauss-Hermite quadrature. We show that it can be an accurate alternative to Monte Carlo for both independent and correlated predictive densities with statistically significant rank correlations for a range of popular test problems.
    Metric-Fair Classifier Derandomization. (arXiv:2206.07826v1 [cs.LG])
    We study the problem of \emph{classifier derandomization} in machine learning: given a stochastic binary classifier $f: X \to [0,1]$, sample a deterministic classifier $\hat{f}: X \to \{0,1\}$ that approximates the output of $f$ in aggregate over any data distribution. Recent work revealed how to efficiently derandomize a stochastic classifier with strong output approximation guarantees, but at the cost of individual fairness -- that is, if $f$ treated similar inputs similarly, $\hat{f}$ did not. In this paper, we initiate a systematic study of classifier derandomization with metric fairness guarantees. We show that the prior derandomization approach is almost maximally metric-unfair, and that a simple ``random threshold'' derandomization achieves optimal fairness preservation but with weaker output approximation. We then devise a derandomization procedure that provides an appealing tradeoff between these two: if $f$ is $\alpha$-metric fair according to a metric $d$ with a locality-sensitive hash (LSH) family, then our derandomized $\hat{f}$ is, with high probability, $O(\alpha)$-metric fair and a close approximation of $f$. We also prove generic results applicable to all (fair and unfair) classifier derandomization procedures, including a bias-variance decomposition and reductions between various notions of metric fairness.
    OmniMAE: Single Model Masked Pretraining on Images and Videos. (arXiv:2206.08356v1 [cs.CV])
    Transformer-based architectures have become competitive across a variety of visual domains, most notably images and videos. While prior work has studied these modalities in isolation, having a common architecture suggests that one can train a single unified model for multiple visual modalities. Prior attempts at unified modeling typically use architectures tailored for vision tasks, or obtain worse performance compared to single modality models. In this work, we show that masked autoencoding can be used to train a simple Vision Transformer on images and videos, without requiring any labeled data. This single model learns visual representations that are comparable to or better than single-modality representations on both image and video benchmarks, while using a much simpler architecture. In particular, our single pretrained model can be finetuned to achieve 86.5% on ImageNet and 75.3% on the challenging Something Something-v2 video benchmark. Furthermore, this model can be learned by dropping 90% of the image and 95% of the video patches, enabling extremely fast training.
    Spatially-Adaptive Multilayer Selection for GAN Inversion and Editing. (arXiv:2206.08357v1 [cs.CV])
    Existing GAN inversion and editing methods work well for aligned objects with a clean background, such as portraits and animal faces, but often struggle for more difficult categories with complex scene layouts and object occlusions, such as cars, animals, and outdoor images. We propose a new method to invert and edit such complex images in the latent space of GANs, such as StyleGAN2. Our key idea is to explore inversion with a collection of layers, spatially adapting the inversion process to the difficulty of the image. We learn to predict the "invertibility" of different image segments and project each segment into a latent layer. Easier regions can be inverted into an earlier layer in the generator's latent space, while more challenging regions can be inverted into a later feature space. Experiments show that our method obtains better inversion results compared to the recent approaches on complex categories, while maintaining downstream editability. Please refer to our project page at https://www.cs.cmu.edu/~SAMInversion.
    Research Topic Flows in Co-Authorship Networks. (arXiv:2206.07980v1 [cs.SI])
    In scientometrics, scientific collaboration is often analyzed by means of co-authorships. An aspect which is often overlooked and more difficult to quantify is the flow of expertise between authors from different research topics, which is an important part of scientific progress. With the Topic Flow Network (TFN) we propose a graph structure for the analysis of research topic flows between scientific authors and their respective research fields. Based on a multi-graph and a topic model, our proposed network structure accounts for intratopic as well as intertopic flows. Our method requires for the construction of a TFN solely a corpus of publications (i.e., author and abstract information). From this, research topics are discovered automatically through non-negative matrix factorization. The thereof derived TFN allows for the application of social network analysis techniques, such as common metrics and community detection. Most importantly, it allows for the analysis of intertopic flows on a large, macroscopic scale, i.e., between research topic, as well as on a microscopic scale, i.e., between certain sets of authors. We demonstrate the utility of TFNs by applying our method to two comprehensive corpora of altogether 20 Mio. publications spanning more than 60 years of research in the fields computer science and mathematics. Our results give evidence that TFNs are suitable, e.g., for the analysis of topical communities, the discovery of important authors in different fields, and, most notably, the analysis of intertopic flows, i.e., the transfer of topical expertise. Besides that, our method opens new directions for future research, such as the investigation of influence relationships between research fields.
    Automated analysis of continuum fields from atomistic simulations using statistical machine learning. (arXiv:2206.08048v1 [cond-mat.mtrl-sci])
    Atomistic simulations of the molecular dynamics/statics kind are regularly used to study small scale plasticity. Contemporary simulations are performed with tens to hundreds of millions of atoms, with snapshots of these configurations written out at regular intervals for further analysis. Continuum scale constitutive models for material behavior can benefit from information on the atomic scale, in particular in terms of the deformation mechanisms, the accommodation of the total strain and partitioning of stress and strain fields in individual grains. In this work we develop a methodology using statistical data mining and machine learning algorithms to automate the analysis of continuum field variables in atomistic simulations. We focus on three important field variables: total strain, elastic strain and microrotation. Our results show that the elastic strain in individual grains exhibits a unimodal log-normal distribution, whilst the total strain and microrotation fields evidence a multimodal distribution. The peaks in the distribution of total strain are identified with a Gaussian mixture model and methods to circumvent overfitting problems are presented. Subsequently, we evaluate the identified peaks in terms of deformation mechanisms in a grain, which e.g., helps to quantify the strain for which individual deformation mechanisms are responsible. The overall statistics of the distributions over all grains are an important input for higher scale models, which ultimately also helps to be able to quantitatively discuss the implications for information transfer to phenomenological models.
    Noisy Learning for Neural ODEs Acts as a Robustness Locus Widening. (arXiv:2206.08237v1 [cs.LG])
    We investigate the problems and challenges of evaluating the robustness of Differential Equation-based (DE) networks against synthetic distribution shifts. We propose a novel and simple accuracy metric which can be used to evaluate intrinsic robustness and to validate dataset corruption simulators. We also propose methodology recommendations, destined for evaluating the many faces of neural DEs' robustness and for comparing them with their discrete counterparts rigorously. We then use this criteria to evaluate a cheap data augmentation technique as a reliable way for demonstrating the natural robustness of neural ODEs against simulated image corruptions across multiple datasets.
    ProGNNosis: A Data-driven Model to Predict GNN Computation Time Using Graph Metrics. (arXiv:2206.08258v1 [cs.LG])
    Graph Neural Networks (GNN) show great promise in problems dealing with graph-structured data. One of the unique points of GNNs is their flexibility to adapt to multiple problems, which not only leads to wide applicability, but also poses important challenges when finding the best model or acceleration technique for a particular problem. An example of such challenges resides in the fact that the accuracy or effectiveness of a GNN model or acceleration technique generally depends on the structure of the underlying graph. In this paper, in an attempt to address the problem of graph-dependent acceleration, we propose ProGNNosis, a data-driven model that can predict the GNN training time of a given GNN model running over a graph of arbitrary characteristics by inspecting the input graph metrics. Such prediction is made based on a regression that was previously trained offline using a diverse synthetic graph dataset. In practice, our method allows making informed decisions on which design to use for a specific problem. In the paper, the methodology to build ProGNNosis is defined and applied for a specific use case, where it helps to decide which graph representation is better. Our results show that ProGNNosis helps achieve an average speedup of 1.22X over randomly selecting a graph representation in multiple widely used GNN models such as GCN, GIN, GAT, or GraphSAGE.
    Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis Using Linguistic and Prosodic Contexts of Dialogue History. (arXiv:2206.08039v1 [cs.SD])
    We propose an end-to-end empathetic dialogue speech synthesis (DSS) model that considers both the linguistic and prosodic contexts of dialogue history. Empathy is the active attempt by humans to get inside the interlocutor in dialogue, and empathetic DSS is a technology to implement this act in spoken dialogue systems. Our model is conditioned by the history of linguistic and prosody features for predicting appropriate dialogue context. As such, it can be regarded as an extension of the conventional linguistic-feature-based dialogue history modeling. To train the empathetic DSS model effectively, we investigate 1) a self-supervised learning model pretrained with large speech corpora, 2) a style-guided training using a prosody embedding of the current utterance to be predicted by the dialogue context embedding, 3) a cross-modal attention to combine text and speech modalities, and 4) a sentence-wise embedding to achieve fine-grained prosody modeling rather than utterance-wise modeling. The evaluation results demonstrate that 1) simply considering prosodic contexts of the dialogue history does not improve the quality of speech in empathetic DSS and 2) introducing style-guided training and sentence-wise embedding modeling achieves higher speech quality than that by the conventional method.
    On Private Online Convex Optimization: Optimal Algorithms in $\ell_p$-Geometry and High Dimensional Contextual Bandits. (arXiv:2206.08111v1 [cs.LG])
    Differentially private (DP) stochastic convex optimization (SCO) is ubiquitous in trustworthy machine learning algorithm design. This paper studies the DP-SCO problem with streaming data sampled from a distribution and arrives sequentially. We also consider the continual release model where parameters related to private information are updated and released upon each new data, often known as the online algorithms. Despite that numerous algorithms have been developed to achieve the optimal excess risks in different $\ell_p$ norm geometries, yet none of the existing ones can be adapted to the streaming and continual release setting. To address such a challenge as the online convex optimization with privacy protection, we propose a private variant of online Frank-Wolfe algorithm with recursive gradients for variance reduction to update and reveal the parameters upon each data. Combined with the adaptive differential privacy analysis, our online algorithm achieves in linear time the optimal excess risk when $1<p\leq 2$ and the state-of-the-art excess risk meeting the non-private lower ones when $2<p\leq\infty$. Our algorithm can also be extended to the case $p=1$ to achieve nearly dimension-independent excess risk. While previous variance reduction results on recursive gradient have theoretical guarantee only in the independent and identically distributed sample setting, we establish such a guarantee in a non-stationary setting. To demonstrate the virtues of our method, we design the first DP algorithm for high-dimensional generalized linear bandits with logarithmic regret. Comparative experiments with a variety of DP-SCO and DP-Bandit algorithms exhibit the efficacy and utility of the proposed algorithms.
    Attention-wise masked graph contrastive learning for predicting molecular property. (arXiv:2206.08262v1 [q-bio.BM])
    Accurate and efficient prediction of the molecular properties of drugs is one of the fundamental problems in drug research and development. Recent advancements in representation learning have been shown to greatly improve the performance of molecular property prediction. However, due to limited labeled data, supervised learning-based molecular representation algorithms can only search limited chemical space, which results in poor generalizability. In this work, we proposed a self-supervised representation learning framework for large-scale unlabeled molecules. We developed a novel molecular graph augmentation strategy, referred to as attention-wise graph mask, to generate challenging positive sample for contrastive learning. We adopted the graph attention network (GAT) as the molecular graph encoder, and leveraged the learned attention scores as masking guidance to generate molecular augmentation graphs. By minimization of the contrastive loss between original graph and masked graph, our model can capture important molecular structure and higher-order semantic information. Extensive experiments showed that our attention-wise graph mask contrastive learning exhibit state-of-the-art performance in a couple of downstream molecular property prediction tasks.
    BYOL-Explore: Exploration by Bootstrapped Prediction. (arXiv:2206.08332v1 [cs.LG])
    We present BYOL-Explore, a conceptually simple yet general approach for curiosity-driven exploration in visually-complex environments. BYOL-Explore learns a world representation, the world dynamics, and an exploration policy all-together by optimizing a single prediction loss in the latent space with no additional auxiliary objective. We show that BYOL-Explore is effective in DM-HARD-8, a challenging partially-observable continuous-action hard-exploration benchmark with visually-rich 3-D environments. On this benchmark, we solve the majority of the tasks purely through augmenting the extrinsic reward with BYOL-Explore s intrinsic reward, whereas prior work could only get off the ground with human demonstrations. As further evidence of the generality of BYOL-Explore, we show that it achieves superhuman performance on the ten hardest exploration games in Atari while having a much simpler design than other competitive agents.
    Continual Learning with Guarantees via Weight Interval Constraints. (arXiv:2206.07996v1 [cs.LG])
    We introduce a new training paradigm that enforces interval constraints on neural network parameter space to control forgetting. Contemporary Continual Learning (CL) methods focus on training neural networks efficiently from a stream of data, while reducing the negative impact of catastrophic forgetting, yet they do not provide any firm guarantees that network performance will not deteriorate uncontrollably over time. In this work, we show how to put bounds on forgetting by reformulating continual learning of a model as a continual contraction of its parameter space. To that end, we propose Hyperrectangle Training, a new training methodology where each task is represented by a hyperrectangle in the parameter space, fully contained in the hyperrectangles of the previous tasks. This formulation reduces the NP-hard CL problem back to polynomial time while providing full resilience against forgetting. We validate our claim by developing InterContiNet (Interval Continual Learning) algorithm which leverages interval arithmetic to effectively model parameter regions as hyperrectangles. Through experimental results, we show that our approach performs well in a continual learning setup without storing data from previous tasks.
    Gradient-Based Adversarial and Out-of-Distribution Detection. (arXiv:2206.08255v1 [cs.LG])
    We propose to utilize gradients for detecting adversarial and out-of-distribution samples. We introduce confounding labels -- labels that differ from normal labels seen during training -- in gradient generation to probe the effective expressivity of neural networks. Gradients depict the amount of change required for a model to properly represent given inputs, providing insight into the representational power of the model established by network architectural properties as well as training data. By introducing a label of different design, we remove the dependency on ground truth labels for gradient generation during inference. We show that our gradient-based approach allows for capturing the anomaly in inputs based on the effective expressivity of the models with no hyperparameter tuning or additional processing, and outperforms state-of-the-art methods for adversarial and out-of-distribution detection.
    Know your audience: specializing grounded language models with the game of Dixit. (arXiv:2206.08349v1 [cs.LG])
    Effective communication requires adapting to the idiosyncratic common ground shared with each communicative partner. We study a particularly challenging instantiation of this problem: the popular game Dixit. We formulate a round of Dixit as a multi-agent image reference game where a (trained) speaker model is rewarded for describing a target image such that one (pretrained) listener model can correctly identify it from a pool of distractors, but another listener cannot. To adapt to this setting, the speaker must exploit differences in the common ground it shares with the different listeners. We show that finetuning an attention-based adapter between a CLIP vision encoder and a large language model in this contrastive, multi-agent setting gives rise to context-dependent natural language specialization from rewards only, without direct supervision. In a series of controlled experiments, we show that the speaker can adapt according to the idiosyncratic strengths and weaknesses of various pairs of different listeners. Furthermore, we show zero-shot transfer of the speaker's specialization to unseen real-world data. Our experiments offer a step towards adaptive communication in complex multi-partner settings and highlight the interesting research challenges posed by games like Dixit. We hope that our work will inspire creative new approaches to adapting pretrained models.
    Unsupervised Space Partitioning for Nearest Neighbor Search. (arXiv:2206.08091v1 [cs.LG])
    Approximate Nearest Neighbor Search (ANNS) in high dimensional spaces is crucial for many real-life applications (e.g., e-commerce, web, multimedia, etc.) dealing with an abundance of data. In this paper, we propose an end-to-end learning framework that couples the partitioning (one key step of ANNS) and learning-to-search steps using a custom loss function. A key advantage of our proposed solution is that it does not require any expensive pre-processing of the dataset, which is one of the key limitations of the state-of-the-art approach. We achieve the above edge by formulating a multi-objective custom loss function that does not need ground truth labels to quantify the quality of a given partition of the data space, making it entirely unsupervised. We also propose an ensembling technique by adding varying input weights to the loss function to train an ensemble of models to enhance the search quality. On several standard benchmarks for ANNS, we show that our method beats the state-of-the-art space partitioning method and the ubiquitous K-means clustering method while using fewer parameters and shorter offline training times. Without loss of generality, our unsupervised partitioning approach is shown as a promising alternative to many widely used clustering methods like K-means clustering and DBSCAN.
    FixEval: Execution-based Evaluation of Program Fixes for Competitive Programming Problems. (arXiv:2206.07796v1 [cs.SE])
    Source code repositories consist of large codebases, often containing error-prone programs. The increasing complexity of software has led to a drastic rise in time and costs for identifying and fixing these defects. Various methods exist to automatically generate fixes for buggy code. However, due to the large combinatorial space of possible solutions for a particular bug, there are not many tools and datasets available to evaluate generated code effectively. In this work, we introduce FixEval, a benchmark comprising buggy code submissions to competitive programming problems and their respective fixes. We introduce a rich test suite to evaluate and assess the correctness of model-generated program fixes. We consider two Transformer language models pretrained on programming languages as our baselines, and compare them using match-based and execution-based evaluation metrics. Our experiments show that match-based metrics do not reflect model-generated program fixes accurately, while execution-based methods evaluate programs through all cases and scenarios specifically designed for that solution. Therefore, we believe FixEval provides a step towards real-world automatic bug fixing and model-generated code evaluation.
    Beyond Adult and COMPAS: Fairness in Multi-Class Prediction. (arXiv:2206.07801v1 [cs.LG])
    We consider the problem of producing fair probabilistic classifiers for multi-class classification tasks. We formulate this problem in terms of "projecting" a pre-trained (and potentially unfair) classifier onto the set of models that satisfy target group-fairness requirements. The new, projected model is given by post-processing the outputs of the pre-trained classifier by a multiplicative factor. We provide a parallelizable iterative algorithm for computing the projected classifier and derive both sample complexity and convergence guarantees. Comprehensive numerical comparisons with state-of-the-art benchmarks demonstrate that our approach maintains competitive performance in terms of accuracy-fairness trade-off curves, while achieving favorable runtime on large datasets. We also evaluate our method at scale on an open dataset with multiple classes, multiple intersectional protected groups, and over 1M samples.
    Introducing the Huber mechanism for differentially private low-rank matrix completion. (arXiv:2206.07910v1 [cs.CR])
    Performing low-rank matrix completion with sensitive user data calls for privacy-preserving approaches. In this work, we propose a novel noise addition mechanism for preserving differential privacy where the noise distribution is inspired by Huber loss, a well-known loss function in robust statistics. The proposed Huber mechanism is evaluated against existing differential privacy mechanisms while solving the matrix completion problem using the Alternating Least Squares approach. We also propose using the Iteratively Re-Weighted Least Squares algorithm to complete low-rank matrices and study the performance of different noise mechanisms in both synthetic and real datasets. We prove that the proposed mechanism achieves {\epsilon}-differential privacy similar to the Laplace mechanism. Furthermore, empirical results indicate that the Huber mechanism outperforms Laplacian and Gaussian in some cases and is comparable, otherwise.
    On the Surprising Behaviour of node2vec. (arXiv:2206.08252v1 [cs.LG])
    Graph embedding techniques are a staple of modern graph learning research. When using embeddings for downstream tasks such as classification, information about their stability and robustness, i.e., their susceptibility to sources of noise, stochastic effects, or specific parameter choices, becomes increasingly important. As one of the most prominent graph embedding schemes, we focus on node2vec and analyse its embedding quality from multiple perspectives. Our findings indicate that embedding quality is unstable with respect to parameter choices, and we propose strategies to remedy this in practice.
    Edge Inference with Fully Differentiable Quantized Mixed Precision Neural Networks. (arXiv:2206.07741v1 [cs.LG])
    The large computing and memory cost of deep neural networks (DNNs) often precludes their use in resource-constrained devices. Quantizing the parameters and operations to lower bit-precision offers substantial memory and energy savings for neural network inference, facilitating the use of DNNs on edge computing platforms. Recent efforts at quantizing DNNs have employed a range of techniques encompassing progressive quantization, step-size adaptation, and gradient scaling. This paper proposes a new quantization approach for mixed precision convolutional neural networks (CNNs) targeting edge-computing. Our method establishes a new pareto frontier in model accuracy and memory footprint demonstrating a range of quantized models, delivering best-in-class accuracy below 4.3 MB of weights (wgts.) and activations (acts.). Our main contributions are: (i) hardware-aware heterogeneous differentiable quantization with tensor-sliced learned precision, (ii) targeted gradient modification for wgts. and acts. to mitigate quantization errors, and (iii) a multi-phase learning schedule to address instability in learning arising from updates to the learned quantizer and model parameters. We demonstrate the effectiveness of our techniques on the ImageNet dataset across a range of models including EfficientNet-Lite0 (e.g., 4.14MB of wgts. and acts. at 67.66% accuracy) and MobileNetV2 (e.g., 3.51MB wgts. and acts. at 65.39% accuracy).
    Adaptive Expert Models for Personalization in Federated Learning. (arXiv:2206.07832v1 [cs.LG])
    Federated Learning (FL) is a promising framework for distributed learning when data is private and sensitive. However, the state-of-the-art solutions in this framework are not optimal when data is heterogeneous and non-Independent and Identically Distributed (non-IID). We propose a practical and robust approach to personalization in FL that adjusts to heterogeneous and non-IID data by balancing exploration and exploitation of several global models. To achieve our aim of personalization, we use a Mixture of Experts (MoE) that learns to group clients that are similar to each other, while using the global models more efficiently. We show that our approach achieves an accuracy up to 29.78 % and up to 4.38 % better compared to a local model in a pathological non-IID setting, even though we tune our approach in the IID setting.
    Explainable Models via Compression of Tree Ensembles. (arXiv:2206.07904v1 [cs.LG])
    Ensemble models (bagging and gradient-boosting) of relational decision trees have proved to be one of the most effective learning methods in the area of probabilistic logic models (PLMs). While effective, they lose one of the most important aspect of PLMs -- interpretability. In this paper we consider the problem of compressing a large set of learned trees into a single explainable model. To this effect, we propose CoTE -- Compression of Tree Ensembles -- that produces a single small decision list as a compressed representation. CoTE first converts the trees to decision lists and then performs the combination and compression with the aid of the original training set. An experimental evaluation demonstrates the effectiveness of CoTE in several benchmark relational data sets.
    Scalable First-Order Bayesian Optimization via Structured Automatic Differentiation. (arXiv:2206.08366v1 [cs.LG])
    Bayesian Optimization (BO) has shown great promise for the global optimization of functions that are expensive to evaluate, but despite many successes, standard approaches can struggle in high dimensions. To improve the performance of BO, prior work suggested incorporating gradient information into a Gaussian process surrogate of the objective, giving rise to kernel matrices of size $nd \times nd$ for $n$ observations in $d$ dimensions. Na\"ively multiplying with (resp. inverting) these matrices requires $\mathcal{O}(n^2d^2)$ (resp. $\mathcal{O}(n^3d^3$)) operations, which becomes infeasible for moderate dimensions and sample sizes. Here, we observe that a wide range of kernels gives rise to structured matrices, enabling an exact $\mathcal{O}(n^2d)$ matrix-vector multiply for gradient observations and $\mathcal{O}(n^2d^2)$ for Hessian observations. Beyond canonical kernel classes, we derive a programmatic approach to leveraging this type of structure for transformations and combinations of the discussed kernel classes, which constitutes a structure-aware automatic differentiation algorithm. Our methods apply to virtually all canonical kernels and automatically extend to complex kernels, like the neural network, radial basis function network, and spectral mixture kernels without any additional derivations, enabling flexible, problem-dependent modeling while scaling first-order BO to high $d$.
    Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization. (arXiv:2206.07882v1 [cs.CL])
    We report on aggressive quantization strategies that greatly accelerate inference of Recurrent Neural Network Transducers (RNN-T). We use a 4 bit integer representation for both weights and activations and apply Quantization Aware Training (QAT) to retrain the full model (acoustic encoder and language model) and achieve near-iso-accuracy. We show that customized quantization schemes that are tailored to the local properties of the network are essential to achieve good performance while limiting the computational overhead of QAT. Density ratio Language Model fusion has shown remarkable accuracy gains on RNN-T workloads but it severely increases the computational cost of inference. We show that our quantization strategies enable using large beam widths for hypothesis search while achieving streaming-compatible runtimes and a full model compression ratio of 7.6$\times$ compared to the full precision model. Via hardware simulations, we estimate a 3.4$\times$ acceleration from FP16 to INT4 for the end-to-end quantized RNN-T inclusive of LM fusion, resulting in a Real Time Factor (RTF) of 0.06. On the NIST Hub5 2000, Hub5 2001, and RT-03 test sets, we retain most of the gains associated with LM fusion, improving the average WER by $>$1.5%.
    Double Sampling Randomized Smoothing. (arXiv:2206.07912v1 [cs.LG])
    Neural networks (NNs) are known to be vulnerable against adversarial perturbations, and thus there is a line of work aiming to provide robustness certification for NNs, such as randomized smoothing, which samples smoothing noises from a certain distribution to certify the robustness for a smoothed classifier. However, as previous work shows, the certified robust radius in randomized smoothing suffers from scaling to large datasets ("curse of dimensionality"). To overcome this hurdle, we propose a Double Sampling Randomized Smoothing (DSRS) framework, which exploits the sampled probability from an additional smoothing distribution to tighten the robustness certification of the previous smoothed classifier. Theoretically, under mild assumptions, we prove that DSRS can certify $\Theta(\sqrt d)$ robust radius under $\ell_2$ norm where $d$ is the input dimension, which implies that DSRS may be able to break the curse of dimensionality of randomized smoothing. We instantiate DSRS for a generalized family of Gaussian smoothing and propose an efficient and sound computing method based on customized dual optimization considering sampling error. Extensive experiments on MNIST, CIFAR-10, and ImageNet verify our theory and show that DSRS certifies larger robust radii than existing baselines consistently under different settings. Code is available at https://github.com/llylly/DSRS.
    Taxonomy of Benchmarks in Graph Representation Learning. (arXiv:2206.07729v1 [cs.LG])
    Graph Neural Networks (GNNs) extend the success of neural networks to graph-structured data by accounting for their intrinsic geometry. While extensive research has been done on developing GNN models with superior performance according to a collection of graph representation learning benchmarks, it is currently not well understood what aspects of a given model are probed by them. For example, to what extent do they test the ability of a model to leverage graph structure vs. node features? Here, we develop a principled approach to taxonomize benchmarking datasets according to a $\textit{sensitivity profile}$ that is based on how much GNN performance changes due to a collection of graph perturbations. Our data-driven analysis provides a deeper understanding of which benchmarking data characteristics are leveraged by GNNs. Consequently, our taxonomy can aid in selection and development of adequate graph benchmarks, and better informed evaluation of future GNN methods. Finally, our approach and implementation in $\texttt{GTaxoGym}$ package are extendable to multiple graph prediction task types and future datasets.
    Interaction-Grounded Learning with Action-inclusive Feedback. (arXiv:2206.08364v1 [cs.LG])
    Consider the problem setting of Interaction-Grounded Learning (IGL), in which a learner's goal is to optimally interact with the environment with no explicit reward to ground its policies. The agent observes a context vector, takes an action, and receives a feedback vector, using this information to effectively optimize a policy with respect to a latent reward function. Prior analyzed approaches fail when the feedback vector contains the action, which significantly limits IGL's success in many potential scenarios such as Brain-computer interface (BCI) or Human-computer interface (HCI) applications. We address this by creating an algorithm and analysis which allows IGL to work even when the feedback vector contains the action, encoded in any fashion. We provide theoretical guarantees and large-scale experiments based on supervised datasets to demonstrate the effectiveness of the new approach.
    Active Nearest Neighbor Regression Through Delaunay Refinement. (arXiv:2206.08061v1 [cs.LG])
    We introduce an algorithm for active function approximation based on nearest neighbor regression. Our Active Nearest Neighbor Regressor (ANNR) relies on the Voronoi-Delaunay framework from computational geometry to subdivide the space into cells with constant estimated function value and select novel query points in a way that takes the geometry of the function graph into account. We consider the recent state-of-the-art active function approximator called DEFER, which is based on incremental rectangular partitioning of the space, as the main baseline. The ANNR addresses a number of limitations that arise from the space subdivision strategy used in DEFER. We provide a computationally efficient implementation of our method, as well as theoretical halting guarantees. Empirical results show that ANNR outperforms the baseline for both closed-form functions and real-world examples, such as gravitational wave parameter inference and exploration of the latent space of a generative model.
    Adversarial Patch Attacks and Defences in Vision-Based Tasks: A Survey. (arXiv:2206.08304v1 [cs.CV])
    Adversarial attacks in deep learning models, especially for safety-critical systems, are gaining more and more attention in recent years, due to the lack of trust in the security and robustness of AI models. Yet the more primitive adversarial attacks might be physically infeasible or require some resources that are hard to access like the training data, which motivated the emergence of patch attacks. In this survey, we provide a comprehensive overview to cover existing techniques of adversarial patch attacks, aiming to help interested researchers quickly catch up with the progress in this field. We also discuss existing techniques for developing detection and defences against adversarial patches, aiming to help the community better understand this field and its applications in the real world.
    Boosting the Adversarial Transferability of Surrogate Model with Dark Knowledge. (arXiv:2206.08316v1 [cs.LG])
    Deep neural networks (DNNs) for image classification are known to be vulnerable to adversarial examples. And, the adversarial examples have transferability, which means an adversarial example for a DNN model can fool another black-box model with a non-trivial probability. This gave birth of the transfer-based adversarial attack where the adversarial examples generated by a pretrained or known model (called surrogate model) are used to conduct black-box attack. There are some work on how to generate the adversarial examples from a given surrogate model to achieve better transferability. However, training a special surrogate model to generate adversarial examples with better transferability is relatively under-explored. In this paper, we propose a method of training a surrogate model with abundant dark knowledge to boost the adversarial transferability of the adversarial examples generated by the surrogate model. This trained surrogate model is named dark surrogate model (DSM), and the proposed method to train DSM consists of two key components: a teacher model extracting dark knowledge and providing soft labels, and the mixing augmentation skill which enhances the dark knowledge of training data. Extensive experiments have been conducted to show that the proposed method can substantially improve the adversarial transferability of surrogate model across different architectures of surrogate model and optimizers for generating adversarial examples. We also show that the proposed method can be applied to other scenarios of transfer-based attack that contain dark knowledge, like face verification.
    The convergent Indian buffet process. (arXiv:2206.08002v1 [stat.ML])
    We propose a new Bayesian nonparametric prior for latent feature models, which we call the convergent Indian buffet process (CIBP). We show that under the CIBP, the number of latent features is distributed as a Poisson distribution with the mean monotonically increasing but converging to a certain value as the number of objects goes to infinity. That is, the expected number of features is bounded above even when the number of objects goes to infinity, unlike the standard Indian buffet process under which the expected number of features increases with the number of objects. We provide two alternative representations of the CIBP based on a hierarchical distribution and a completely random measure, respectively, which are of independent interest. The proposed CIBP is assessed on a high-dimensional sparse factor model.
    Learning Physics between Digital Twins with Low-Fidelity Models and Physics-Informed Gaussian Processes. (arXiv:2206.08201v1 [stat.ML])
    A digital twin is a computer model that represents an individual, for example, a component, a patient or a process. In many situations, we want to gain knowledge about an individual from its data while incorporating imperfect physical knowledge and also learn from data from other individuals. In this paper, we introduce and demonstrate a fully Bayesian methodology for learning between digital twins in a setting where the physical parameters of each individual are of interest. For each individual, the methodology is based on Bayesian calibration with model discrepancy. Through the discrepancy, modelled as a Gaussian process, the imperfect low-fidelity physical model is accounted for. Using ideas from Bayesian hierarchical models, a joint probabilistic model of digital twins is constructed by connecting them through a new level in the hierarchy. For the physical parameters, the methodology can be seen as using a prior distribution in the individual model that is the posterior of the corresponding hyperparameter in the joint model. For learning the imperfect physics between individuals two approaches are introduced, one that assumes the same discrepancy for all individuals and one that can be seen as using a prior learned from all individuals for the parameters of the Gaussian processes representing the discrepancies. Based on recent advances related to physics-informed priors, Hamiltonian Monte Carlo methods and using these for inverse problems we set up an inference methodology that allows our approach to be computational feasible also for physical models based on partial differential equations and individual data that are not aligned. The methodology is demonstrated in two synthetic case studies, a toy example previously used in the literature extended to more individuals and an example based on a cardiovascular differential equation model relevant for the treatment of hypertension.
    MoDi: Unconditional Motion Synthesis from Diverse Data. (arXiv:2206.08010v1 [cs.GR])
    The emergence of neural networks has revolutionized the field of motion synthesis. Yet, learning to unconditionally synthesize motions from a given distribution remains a challenging task, especially when the motions are highly diverse. We present MoDi, an unconditional generative model that synthesizes diverse motions. Our model is trained in a completely unsupervised setting from a diverse, unstructured and unlabeled motion dataset and yields a well-behaved, highly semantic latent space. The design of our model follows the prolific architecture of StyleGAN and adapts two of its key technical components into the motion domain: a set of style-codes injected into each level of the generator hierarchy and a mapping function that learns and forms a disentangled latent space. We show that despite the lack of any structure in the dataset, the latent space can be semantically clustered, and facilitates semantic editing and motion interpolation. In addition, we propose a technique to invert unseen motions into the latent space, and demonstrate latent-based motion editing operations that otherwise cannot be achieved by naive manipulation of explicit motion representations. Our qualitative and quantitative experiments show that our framework achieves state-of-the-art synthesis quality that can follow the distribution of highly diverse motion datasets. Code and trained models will be released at https://sigal-raab.github.io/MoDi.
    A Contextual Combinatorial Semi-Bandit Approach to Network Bottleneck Identification. (arXiv:2206.08144v1 [cs.LG])
    Bottleneck identification is a challenging task in network analysis, especially when the network is not fully specified. To address this task, we develop a unified online learning framework based on combinatorial semi-bandits that performs bottleneck identification alongside learning the specifications of the underlying network. Within this framework, we adapt and investigate several combinatorial semi-bandit methods such as epsilon-greedy, LinUCB, BayesUCB, and Thompson Sampling. Our framework is able to employ contextual information in the form of contextual bandits. We evaluate our framework on the real-world application of road networks and demonstrate its effectiveness in different settings.
    A Truthful Owner-Assisted Scoring Mechanism. (arXiv:2206.08149v1 [cs.LG])
    Alice (owner) has knowledge of the underlying quality of her items measured in grades. Given the noisy grades provided by an independent party, can Bob (appraiser) obtain accurate estimates of the ground-truth grades of the items by asking Alice a question about the grades? We address this when the payoff to Alice is additive convex utility over all her items. We establish that if Alice has to truthfully answer the question so that her payoff is maximized, the question must be formulated as pairwise comparisons between her items. Next, we prove that if Alice is required to provide a ranking of her items, which is the most fine-grained question via pairwise comparisons, she would be truthful. By incorporating the ground-truth ranking, we show that Bob can obtain an estimator with the optimal squared error in certain regimes based on any possible way of truthful information elicitation. Moreover, the estimated grades are substantially more accurate than the raw grades when the number of items is large and the raw grades are very noisy. Finally, we conclude the paper with several extensions and some refinements for practical considerations.
    Fault-Tolerant Collaborative Inference through the Edge-PRUNE Framework. (arXiv:2206.08152v1 [cs.LG])
    Collaborative inference has received significant research interest in machine learning as a vehicle for distributing computation load, reducing latency, as well as addressing privacy preservation in communications. Recent collaborative inference frameworks have adopted dynamic inference methodologies such as early-exit and run-time partitioning of neural networks. However, as machine learning frameworks scale in the number of inference inputs, e.g., in surveillance applications, fault tolerance related to device failure needs to be considered. This paper presents the Edge-PRUNE distributed computing framework, built on a formally defined model of computation, which provides a flexible infrastructure for fault tolerant collaborative inference. The experimental section of this work shows results on achievable inference time savings by collaborative inference, presents fault tolerant system topologies and analyzes their cost in terms of execution time overhead.
    Gradient Descent for Low-Rank Functions. (arXiv:2206.08257v1 [cs.LG])
    Several recent empirical studies demonstrate that important machine learning tasks, e.g., training deep neural networks, exhibit low-rank structure, where the loss function varies significantly in only a few directions of the input space. In this paper, we leverage such low-rank structure to reduce the high computational cost of canonical gradient-based methods such as gradient descent (GD). Our proposed \emph{Low-Rank Gradient Descent} (LRGD) algorithm finds an $\epsilon$-approximate stationary point of a $p$-dimensional function by first identifying $r \leq p$ significant directions, and then estimating the true $p$-dimensional gradient at every iteration by computing directional derivatives only along those $r$ directions. We establish that the "directional oracle complexities" of LRGD for strongly convex and non-convex objective functions are $\mathcal{O}(r \log(1/\epsilon) + rp)$ and $\mathcal{O}(r/\epsilon^2 + rp)$, respectively. When $r \ll p$, these complexities are smaller than the known complexities of $\mathcal{O}(p \log(1/\epsilon))$ and $\mathcal{O}(p/\epsilon^2)$ of {\gd} in the strongly convex and non-convex settings, respectively. Thus, LRGD significantly reduces the computational cost of gradient-based methods for sufficiently low-rank functions. In the course of our analysis, we also formally define and characterize the classes of exact and approximately low-rank functions.
    Constrained Submodular Optimization for Vaccine Design. (arXiv:2206.08336v1 [q-bio.QM])
    Advances in machine learning have enabled the prediction of immune system responses to prophylactic and therapeutic vaccines. However, the engineering task of designing vaccines remains a challenge. In particular, the genetic variability of the human immune system makes it difficult to design peptide vaccines that provide widespread immunity in vaccinated populations. We introduce a framework for evaluating and designing peptide vaccines that uses probabilistic machine learning models, and demonstrate its ability to produce designs for a SARS-CoV-2 vaccine that outperform previous designs. We provide a theoretical analysis of the approximability, scalability, and complexity of our framework.
    Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning. (arXiv:2206.08307v1 [cs.LG])
    We study the asynchronous stochastic gradient descent algorithm for distributed training over $n$ workers which have varying computation and communication frequency over time. In this algorithm, workers compute stochastic gradients in parallel at their own pace and return those to the server without any synchronization. Existing convergence rates of this algorithm for non-convex smooth objectives depend on the maximum gradient delay $\tau_{\max}$ and show that an $\epsilon$-stationary point is reached after $\mathcal{O}\!\left(\sigma^2\epsilon^{-2}+ \tau_{\max}\epsilon^{-1}\right)$ iterations, where $\sigma$ denotes the variance of stochastic gradients. In this work (i) we obtain a tighter convergence rate of $\mathcal{O}\!\left(\sigma^2\epsilon^{-2}+ \sqrt{\tau_{\max}\tau_{avg}}\epsilon^{-1}\right)$ without any change in the algorithm where $\tau_{avg}$ is the average delay, which can be significantly smaller than $\tau_{\max}$. We also provide (ii) a simple delay-adaptive learning rate scheme, under which asynchronous SGD achieves a convergence rate of $\mathcal{O}\!\left(\sigma^2\epsilon^{-2}+ \tau_{avg}\epsilon^{-1}\right)$, and does not require any extra hyperparameter tuning nor extra communications. Our result allows to show for the first time that asynchronous SGD is always faster than mini-batch SGD. In addition, (iii) we consider the case of heterogeneous functions motivated by federated learning applications and improve the convergence rate by proving a weaker dependence on the maximum delay compared to prior works. In particular, we show that the heterogeneity term in convergence rate is only affected by the average delay within each worker.
    Generalized Leverage Scores: Geometric Interpretation and Applications. (arXiv:2206.08054v1 [cs.LG])
    In problems involving matrix computations, the concept of leverage has found a large number of applications. In particular, leverage scores, which relate the columns of a matrix to the subspaces spanned by its leading singular vectors, are helpful in revealing column subsets to approximately factorize a matrix with quality guarantees. As such, they provide a solid foundation for a variety of machine-learning methods. In this paper we extend the definition of leverage scores to relate the columns of a matrix to arbitrary subsets of singular vectors. We establish a precise connection between column and singular-vector subsets, by relating the concepts of leverage scores and principal angles between subspaces. We employ this result to design approximation algorithms with provable guarantees for two well-known problems: generalized column subset selection and sparse canonical correlation analysis. We run numerical experiments to provide further insight on the proposed methods. The novel bounds we derive improve our understanding of fundamental concepts in matrix approximations. In addition, our insights may serve as building blocks for further contributions.
    SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation. (arXiv:2206.08367v1 [cs.CV])
    Adapting to a continuously evolving environment is a safety-critical challenge inevitably faced by all autonomous driving systems. Existing image and video driving datasets, however, fall short of capturing the mutable nature of the real world. In this paper, we introduce the largest multi-task synthetic dataset for autonomous driving, SHIFT. It presents discrete and continuous shifts in cloudiness, rain and fog intensity, time of day, and vehicle and pedestrian density. Featuring a comprehensive sensor suite and annotations for several mainstream perception tasks, SHIFT allows investigating the degradation of a perception system performance at increasing levels of domain shift, fostering the development of continuous adaptation strategies to mitigate this problem and assess model robustness and generality. Our dataset and benchmark toolkit are publicly available at www.vis.xyz/shift.
    Rank the triplets: A ranking-based multiple instance learning framework for detecting HPV infection in head and neck cancers using routine H&E images. (arXiv:2206.08275v1 [cs.CV])
    The aetiology of head and neck squamous cell carcinoma (HNSCC) involves multiple carcinogens such as alcohol, tobacco and infection with human papillomavirus (HPV). As the HPV infection influences the prognosis, treatment and survival of patients with HNSCC, it is important to determine the HPV status of these tumours. In this paper, we propose a novel triplet-ranking loss function and a multiple instance learning pipeline for HPV status prediction. This achieves a new state-of-the-art performance in HPV detection using only the routine H&E stained WSIs on two HNSCC cohorts. Furthermore, a comprehensive tumour microenvironment profiling was performed, which characterised the unique patterns between HPV+/- HNSCC from genomic, immunology and cellular perspectives. Positive correlations of the proposed score with different subtypes of T cells (e.g. T cells follicular helper, CD8+ T cells), and negative correlations with macrophages and connective cells (e.g. fibroblast) were identified, which is in line with clinical findings. Unique gene expression profiles were also identified with respect to HPV infection status, and is in line with existing findings.
    Deep Neural Imputation: A Framework for Recovering Incomplete Brain Recordings. (arXiv:2206.08094v1 [cs.LG])
    Neuroscientists and neuroengineers have long relied on multielectrode neural recordings to study the brain. However, in a typical experiment, many factors corrupt neural recordings from individual electrodes, including electrical noise, movement artifacts, and faulty manufacturing. Currently, common practice is to discard these corrupted recordings, reducing already limited data that is difficult to collect. To address this challenge, we propose Deep Neural Imputation (DNI), a framework to recover missing values from electrodes by learning from data collected across spatial locations, days, and participants. We explore our framework with a linear nearest-neighbor approach and two deep generative autoencoders, demonstrating DNI's flexibility. One deep autoencoder models participants individually, while the other extends this architecture to model many participants jointly. We evaluate our models across 12 human participants implanted with multielectrode intracranial electrocorticography arrays; participants had no explicit task and behaved naturally across hundreds of recording hours. We show that DNI recovers not only time series but also frequency content, and further establish DNI's practical value by recovering significant performance on a scientifically-relevant downstream neural decoding task.
    General Cyclical Training of Neural Networks. (arXiv:2202.08835v2 [cs.LG] UPDATED)
    This paper describes the principle of "General Cyclical Training" in machine learning, where training starts and ends with "easy training" and the "hard training" happens during the middle epochs. We propose several manifestations for training neural networks, including algorithmic examples (via hyper-parameters and loss functions), data-based examples, and model-based examples. Specifically, we introduce several novel techniques: cyclical weight decay, cyclical batch size, cyclical focal loss, cyclical softmax temperature, cyclical data augmentation, cyclical gradient clipping, and cyclical semi-supervised learning. In addition, we demonstrate that cyclical weight decay, cyclical softmax temperature, and cyclical gradient clipping (as three examples of this principle) are beneficial in the test accuracy performance of a trained model. Furthermore, we discuss model-based examples (such as pretraining and knowledge distillation) from the perspective of general cyclical training and recommend some changes to the typical training methodology. In summary, this paper defines the general cyclical training concept and discusses several specific ways in which this concept can be applied to training neural networks. In the spirit of reproducibility, the code used in our experiments is available at \url{https://github.com/lnsmith54/CFL}.
    Active Learning on a Budget: Opposite Strategies Suit High and Low Budgets. (arXiv:2202.02794v4 [cs.LG] UPDATED)
    Investigating active learning, we focus on the relation between the number of labeled examples (budget size), and suitable querying strategies. Our theoretical analysis shows a behavior reminiscent of phase transition: typical examples are best queried when the budget is low, while unrepresentative examples are best queried when the budget is large. Combined evidence shows that a similar phenomenon occurs in common classification models. Accordingly, we propose TypiClust -- a deep active learning strategy suited for low budgets. In a comparative empirical investigation of supervised learning, using a variety of architectures and image datasets, TypiClust outperforms all other active learning strategies in the low-budget regime. Using TypiClust in the semi-supervised framework, performance gets an even more significant boost. In particular, state-of-the-art semi-supervised methods trained on CIFAR-10 with 10 labeled examples selected by TypiClust, reach 93.2% accuracy -- an improvement of 39.4% over random selection. Code is available at https://github.com/avihu111/TypiClust.
    Data-Free Adversarial Knowledge Distillation for Graph Neural Networks. (arXiv:2205.03811v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have been widely used in modeling graph structured data, owing to its impressive performance in a wide range of practical applications. Recently, knowledge distillation (KD) for GNNs has enabled remarkable progress in graph model compression and knowledge transfer. However, most of the existing KD methods require a large volume of real data, which are not readily available in practice, and may preclude their applicability in scenarios where the teacher model is trained on rare or hard to acquire datasets. To address this problem, we propose the first end-to-end framework for data-free adversarial knowledge distillation on graph structured data (DFAD-GNN). To be specific, our DFAD-GNN employs a generative adversarial network, which mainly consists of three components: a pre-trained teacher model and a student model are regarded as two discriminators, and a generator is utilized for deriving training graphs to distill knowledge from the teacher model into the student model. Extensive experiments on various benchmark models and six representative datasets demonstrate that our DFAD-GNN significantly surpasses state-of-the-art data-free baselines in the graph classification task.
    Cyclocopula Technique to Study the Relationship Between Two Cyclostationary Time Series with Fractional Brownian Motion Errors. (arXiv:2206.07976v1 [stat.ME])
    Detection of the relationship between two time series is so important in environmental and hydrological studies. Several parametric and non-parametric approaches can be applied to detect relationships. These techniques are usually sensitive to stationarity assumptions. In this research, a new copula-based method is introduced to detect the relationship between two cylostationary time series with fractional Brownian motion (fBm) errors. The numerical studies verify the performance of the introduced approach.
    User Engagement and Churn in Mobile Health Applications. (arXiv:2206.08178v1 [stat.ML])
    Mobile health apps are revolutionizing the healthcare ecosystem by improving communication, efficiency, and quality of service. In low- and middle-income countries, they also play a unique role as a source of information about health outcomes and behaviors of patients and healthcare workers, while providing a suitable channel to deliver both personalized and collective policy interventions. We propose a framework to study user engagement with mobile health, focusing on healthcare workers and digital health apps designed to support them in resource-poor settings. The behavioral logs produced by these apps can be transformed into daily time series characterizing each user's activity. We use probabilistic and survival analysis to build multiple personalized measures of meaningful engagement, which could serve to tailor content and digital interventions suiting each health worker's specific needs. Special attention is given to the problem of detecting churn, understood as a marker of complete disengagement. We discuss the application of our methods to the Indian and Ethiopian users of the Safe Delivery App, a capacity-building tool for skilled birth attendants. This work represents an important step towards a full characterization of user engagement in mobile health applications, which can significantly enhance the abilities of health workers and, ultimately, save lives.
    ResNorm: Tackling Long-tailed Degree Distribution Issue in Graph Neural Networks via Normalization. (arXiv:2206.08181v1 [cs.LG])
    Graph Neural Networks (GNNs) have attracted much attention due to their ability in learning representations from graph-structured data. Despite the successful applications of GNNs in many domains, the optimization of GNNs is less well studied, and the performance on node classification heavily suffers from the long-tailed node degree distribution. This paper focuses on improving the performance of GNNs via normalization. In detail, by studying the long-tailed distribution of node degrees in the graph, we propose a novel normalization method for GNNs, which is termed ResNorm (\textbf{Res}haping the long-tailed distribution into a normal-like distribution via \textbf{norm}alization). The $scale$ operation of ResNorm reshapes the node-wise standard deviation (NStd) distribution so as to improve the accuracy of tail nodes (\textit{i}.\textit{e}., low-degree nodes). We provide a theoretical interpretation and empirical evidence for understanding the mechanism of the above $scale$. In addition to the long-tailed distribution issue, over-smoothing is also a fundamental issue plaguing the community. To this end, we analyze the behavior of the standard shift and prove that the standard shift serves as a preconditioner on the weight matrix, increasing the risk of over-smoothing. With the over-smoothing issue in mind, we design a $shift$ operation for ResNorm that simulates the degree-specific parameter strategy in a low-cost manner. Extensive experiments have validated the effectiveness of ResNorm on several node classification benchmark datasets.
    PROFHIT: Probabilistic Robust Forecasting for Hierarchical Time-series. (arXiv:2206.07940v1 [cs.LG])
    Probabilistic hierarchical time-series forecasting is an important variant of time-series forecasting, where the goal is to model and forecast multivariate time-series that have underlying hierarchical relations. Most methods focus on point predictions and do not provide well-calibrated probabilistic forecasts distributions. Recent state-of-art probabilistic forecasting methods also impose hierarchical relations on point predictions and samples of distribution which does not account for coherency of forecast distributions. Previous works also silently assume that datasets are always consistent with given hierarchical relations and do not adapt to real-world datasets that show deviation from this assumption. We close both these gaps and propose PROFHIT, which is a fully probabilistic hierarchical forecasting model that jointly models forecast distribution of entire hierarchy. PROFHIT uses a flexible probabilistic Bayesian approach and introduces a novel Distributional Coherency regularization to learn from hierarchical relations for entire forecast distribution that enables robust and calibrated forecasts as well as adapt to datasets of varying hierarchical consistency. On evaluating PROFHIT over wide range of datasets, we observed 41-88% better performance in accuracy and calibration. Due to modeling the coherency over full distribution, we observed that PROFHIT can robustly provide reliable forecasts even if up to 10% of input time-series data is missing where other methods' performance severely degrade by over 70%.
    iBoot: Image-bootstrapped Self-Supervised Video Representation Learning. (arXiv:2206.08339v1 [cs.CV])
    Learning visual representations through self-supervision is an extremely challenging task as the network needs to sieve relevant patterns from spurious distractors without the active guidance provided by supervision. This is achieved through heavy data augmentation, large-scale datasets and prohibitive amounts of compute. Video self-supervised learning (SSL) suffers from added challenges: video datasets are typically not as large as image datasets, compute is an order of magnitude larger, and the amount of spurious patterns the optimizer has to sieve through is multiplied several fold. Thus, directly learning self-supervised representations from video data might result in sub-optimal performance. To address this, we propose to utilize a strong image-based model, pre-trained with self- or language supervision, in a video representation learning framework, enabling the model to learn strong spatial and temporal information without relying on the video labeled data. To this end, we modify the typical video-based SSL design and objective to encourage the video encoder to \textit{subsume} the semantic content of an image-based model trained on a general domain. The proposed algorithm is shown to learn much more efficiently (i.e. in less epochs and with a smaller batch) and results in a new state-of-the-art performance on standard downstream tasks among single-modality SSL methods.
    When a RF Beats a CNN and GRU, Together -- A Comparison of Deep Learning and Classical Machine Learning Approaches for Encrypted Malware Traffic Classification. (arXiv:2206.08004v1 [cs.CR])
    Internet traffic classification is widely used to facilitate network management. It plays a crucial role in Quality of Services (QoS), Quality of Experience (QoE), network visibility, intrusion detection, and traffic trend analyses. While there is no theoretical guarantee that deep learning (DL)-based solutions perform better than classic machine learning (ML)-based ones, DL-based models have become the common default. This paper compares well-known DL-based and ML-based models and shows that in the case of malicious traffic classification, state-of-the-art DL-based solutions do not necessarily outperform the classical ML-based ones. We exemplify this finding using two well-known datasets for a varied set of tasks, such as: malware detection, malware family classification, detection of zero-day attacks, and classification of an iteratively growing dataset. Note that, it is not feasible to evaluate all possible models to make a concrete statement, thus, the above finding is not a recommendation to avoid DL-based models, but rather empirical proof that in some cases, there are more simplistic solutions, that may perform even better.
    Concentration of Data Encoding in Parameterized Quantum Circuits. (arXiv:2206.08273v1 [quant-ph])
    Variational quantum algorithms have been acknowledged as a leading strategy to realize near-term quantum advantages in meaningful tasks, including machine learning and combinatorial optimization. When applied to tasks involving classical data, such algorithms generally begin with quantum circuits for data encoding and then train quantum neural networks (QNNs) to minimize target functions. Although QNNs have been widely studied to improve these algorithms' performance on practical tasks, there is a gap in systematically understanding the influence of data encoding on the eventual performance. In this paper, we make progress in filling this gap by considering the common data encoding strategies based on parameterized quantum circuits. We prove that, under reasonable assumptions, the distance between the average encoded state and the maximally mixed state could be explicitly upper-bounded with respect to the width and depth of the encoding circuit. This result in particular implies that the average encoded state will concentrate on the maximally mixed state at an exponential speed on depth. Such concentration seriously limits the capabilities of quantum classifiers, and strictly restricts the distinguishability of encoded states from a quantum information perspective. We further support our findings by numerically verifying these results on both synthetic and public data sets. Our results highlight the significance of quantum data encoding in machine learning tasks and may shed light on future encoding strategies.
    Zero-Shot Video Question Answering via Frozen Bidirectional Language Models. (arXiv:2206.08155v1 [cs.CV])
    Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models will be made publicly available at https://antoyang.github.io/frozenbilm.html.
    A Closer Look at Smoothness in Domain Adversarial Training. (arXiv:2206.08213v1 [cs.LG])
    Domain adversarial training has been ubiquitous for achieving invariant representations and is used widely for various domain adaptation tasks. In recent times, methods converging to smooth optima have shown improved generalization for supervised learning tasks like classification. In this work, we analyze the effect of smoothness enhancing formulations on domain adversarial training, the objective of which is a combination of task loss (eg. classification, regression, etc.) and adversarial terms. We find that converging to a smooth minima with respect to (w.r.t.) task loss stabilizes the adversarial training leading to better performance on target domain. In contrast to task loss, our analysis shows that converging to smooth minima w.r.t. adversarial loss leads to sub-optimal generalization on the target domain. Based on the analysis, we introduce the Smooth Domain Adversarial Training (SDAT) procedure, which effectively enhances the performance of existing domain adversarial methods for both classification and object detection tasks. Our analysis also provides insight into the extensive usage of SGD over Adam in the community for domain adversarial training.
    MAGIC: Microlensing Analysis Guided by Intelligent Computation. (arXiv:2206.08199v1 [astro-ph.IM])
    The modeling of binary microlensing light curves via the standard sampling-based method can be challenging, because of the time-consuming light curve computation and the pathological likelihood landscape in the high-dimensional parameter space. In this work, we present MAGIC, which is a machine learning framework to efficiently and accurately infer the microlensing parameters of binary events with realistic data quality. In MAGIC, binary microlensing parameters are divided into two groups and inferred separately with different neural networks. The key feature of MAGIC is the introduction of neural controlled differential equation, which provides the capability to handle light curves with irregular sampling and large data gaps. Based on simulated light curves, we show that MAGIC can achieve fractional uncertainties of a few percent on the binary mass ratio and separation. We also test MAGIC on a real microlensing event. MAGIC is able to locate the degenerate solutions even when large data gaps are introduced. As irregular samplings are common in astronomical surveys, our method also has implications to other studies that involve time series.
    Pythae: Unifying Generative Autoencoders in Python -- A Benchmarking Use Case. (arXiv:2206.08309v1 [cs.LG])
    In recent years, deep generative models have attracted increasing interest due to their capacity to model complex distributions. Among those models, variational autoencoders have gained popularity as they have proven both to be computationally efficient and yield impressive results in multiple fields. Following this breakthrough, extensive research has been done in order to improve the original publication, resulting in a variety of different VAE models in response to different tasks. In this paper we present Pythae, a versatile open-source Python library providing both a unified implementation and a dedicated framework allowing straightforward, reproducible and reliable use of generative autoencoder models. We then propose to use this library to perform a case study benchmark where we present and compare 19 generative autoencoder models representative of some of the main improvements on downstream tasks such as image reconstruction, generation, classification, clustering and interpolation. The open-source library can be found at https://github.com/clementchadebec/benchmark_VAE.
    Functional Output Regression with Infimal Convolution: Exploring the Huber and $\epsilon$-insensitive Losses. (arXiv:2206.08220v1 [stat.ML])
    The focus of the paper is functional output regression (FOR) with convoluted losses. While most existing work consider the square loss setting, we leverage extensions of the Huber and the $\epsilon$-insensitive loss (induced by infimal convolution) and propose a flexible framework capable of handling various forms of outliers and sparsity in the FOR family. We derive computationally tractable algorithms relying on duality to tackle the resulting tasks in the context of vector-valued reproducing kernel Hilbert spaces. The efficiency of the approach is demonstrated and contrasted with the classical squared loss setting on both synthetic and real-world benchmarks.
    Maximum Likelihood Training for Score-Based Diffusion ODEs by High-Order Denoising Score Matching. (arXiv:2206.08265v1 [stat.ML])
    Score-based generative models have excellent performance in terms of generation quality and likelihood. They model the data distribution by matching a parameterized score network with first-order data score functions. The score network can be used to define an ODE ("score-based diffusion ODE") for exact likelihood evaluation. However, the relationship between the likelihood of the ODE and the score matching objective is unclear. In this work, we prove that matching the first-order score is not sufficient to maximize the likelihood of the ODE, by showing a gap between the maximum likelihood and score matching objectives. To fill up this gap, we show that the negative likelihood of the ODE can be bounded by controlling the first, second, and third-order score matching errors; and we further present a novel high-order denoising score matching method to enable maximum likelihood training of score-based diffusion ODEs. Our algorithm guarantees that the higher-order matching error is bounded by the training error and the lower-order errors. We empirically observe that by high-order score matching, score-based diffusion ODEs achieve better likelihood on both synthetic data and CIFAR-10, while retaining the high generation quality.
    Continuous-Time Modeling of Counterfactual Outcomes Using Neural Controlled Differential Equations. (arXiv:2206.08311v1 [cs.LG])
    Estimating counterfactual outcomes over time has the potential to unlock personalized healthcare by assisting decision-makers to answer ''what-iF'' questions. Existing causal inference approaches typically consider regular, discrete-time intervals between observations and treatment decisions and hence are unable to naturally model irregularly sampled data, which is the common setting in practice. To handle arbitrary observation patterns, we interpret the data as samples from an underlying continuous-time process and propose to model its latent trajectory explicitly using the mathematics of controlled differential equations. This leads to a new approach, the Treatment Effect Neural Controlled Differential Equation (TE-CDE), that allows the potential outcomes to be evaluated at any time point. In addition, adversarial training is used to adjust for time-dependent confounding which is critical in longitudinal settings and is an added challenge not encountered in conventional time-series. To assess solutions to this problem, we propose a controllable simulation environment based on a model of tumor growth for a range of scenarios with irregular sampling reflective of a variety of clinical scenarios. TE-CDE consistently outperforms existing approaches in all simulated scenarios with irregular sampling.
    Inherent Inconsistencies of Feature Importance. (arXiv:2206.08204v1 [cs.LG])
    The black-box nature of modern machine learning techniques invokes a practical and ethical need for explainability. Feature importance aims to meet this need by assigning scores to features, so humans can understand their influence on predictions. Feature importance can be used to explain predictions under different settings: of the entire sample space or a specific instance; of model behavior, or the dependencies in the data themselves. However, in most cases thus far, each of these settings was studied in isolation. We attempt to develop a sound feature importance score framework by defining a small set of desired properties. Surprisingly, we prove an inconsistency theorem, showing that the expected properties cannot hold simultaneously. To overcome this difficulty, we propose the novel notion of re-partitioning the feature space into separable sets. Such sets are constructed to contain features that exhibit inter-set independence with respect to the target variable. We show that there exists a unique maximal partitioning into separable sets. Moreover, assigning scores to separable sets, instead of single features, unifies the results of commonly used feature importance scores and annihilates the inconsistencies we demonstrated.
    On Scaled Methods for Saddle Point Problems. (arXiv:2206.08303v1 [cs.LG])
    Methods with adaptive scaling of different features play a key role in solving saddle point problems, primarily due to Adam's popularity for solving adversarial machine learning problems, including GANS training. This paper carries out a theoretical analysis of the following scaling techniques for solving SPPs: the well-known Adam and RmsProp scaling and the newer AdaHessian and OASIS based on Hutchison approximation. We use the Extra Gradient and its improved version with negative momentum as the basic method. Experimental studies on GANs show good applicability not only for Adam, but also for other less popular methods.
    Search-Based Testing Approach for Deep Reinforcement Learning Agents. (arXiv:2206.07813v1 [cs.SE])
    Deep Reinforcement Learning (DRL) algorithms have been increasingly employed during the last decade to solve various decision-making problems such as autonomous driving and robotics. However, these algorithms have faced great challenges when deployed in safety-critical environments since they often exhibit erroneous behaviors that can lead to potentially critical errors. One way to assess the safety of DRL agents is to test them to detect possible faults leading to critical failures during their execution. This raises the question of how we can efficiently test DRL policies to ensure their correctness and adherence to safety requirements. Most existing works on testing DRL agents use adversarial attacks that perturb states or actions of the agent. However, such attacks often lead to unrealistic states of the environment. Their main goal is to test the robustness of DRL agents rather than testing the compliance of agents' policies with respect to requirements. Due to the huge state space of DRL environments, the high cost of test execution, and the black-box nature of DRL algorithms, the exhaustive testing of DRL agents is impossible. In this paper, we propose a Search-based Testing Approach of Reinforcement Learning Agents (STARLA) to test the policy of a DRL agent by effectively searching for failing executions of the agent within a limited testing budget. We use machine learning models and a dedicated genetic algorithm to narrow the search towards faulty episodes. We apply STARLA on a Deep-Q-Learning agent which is widely used as a benchmark and show that it significantly outperforms Random Testing by detecting more faults related to the agent's policy. We also investigate how to extract rules that characterize faulty episodes of the DRL agent using our search results. Such rules can be used to understand the conditions under which the agent fails and thus assess its deployment risks.
    On the well-spread property and its relation to linear regression. (arXiv:2206.08092v1 [cs.LG])
    We consider the robust linear regression model $\boldsymbol{y} = X\beta^* + \boldsymbol{\eta}$, where an adversary oblivious to the design $X \in \mathbb{R}^{n \times d}$ may choose $\boldsymbol{\eta}$ to corrupt all but a (possibly vanishing) fraction of the observations $\boldsymbol{y}$ in an arbitrary way. Recent work [dLN+21, dNS21] has introduced efficient algorithms for consistent recovery of the parameter vector. These algorithms crucially rely on the design matrix being well-spread (a matrix is well-spread if its column span is far from any sparse vector). In this paper, we show that there exists a family of design matrices lacking well-spreadness such that consistent recovery of the parameter vector in the above robust linear regression model is information-theoretically impossible. We further investigate the average-case time complexity of certifying well-spreadness of random matrices. We show that it is possible to efficiently certify whether a given $n$-by-$d$ Gaussian matrix is well-spread if the number of observations is quadratic in the ambient dimension. We complement this result by showing rigorous evidence -- in the form of a lower bound against low-degree polynomials -- of the computational hardness of this same certification problem when the number of observations is $o(d^2)$.
    Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency. (arXiv:2206.08222v1 [cs.CV])
    Visual domain adaptation (DA) seeks to transfer trained models to unseen, unlabeled domains across distribution shift, but approaches typically focus on adapting convolutional neural network architectures initialized with supervised ImageNet representations. In this work, we shift focus to adapting modern architectures for object recognition -- the increasingly popular Vision Transformer (ViT) -- and modern pretraining based on self-supervised learning (SSL). Inspired by the design of recent SSL approaches based on learning from partial image inputs generated via masking or cropping -- either by learning to predict the missing pixels, or learning representational invariances to such augmentations -- we propose PACMAC, a simple two-stage adaptation algorithm for self-supervised ViTs. PACMAC first performs in-domain SSL on pooled source and target data to learn task-discriminative features, and then probes the model's predictive consistency across a set of partial target inputs generated via a novel attention-conditioned masking strategy, to identify reliable candidates for self-training. Our simple approach leads to consistent performance gains over competing methods that use ViTs and self-supervised initializations on standard object recognition benchmarks. Code available at https://github.com/virajprabhu/PACMAC
    Linearity Grafting: Relaxed Neuron Pruning Helps Certifiable Robustness. (arXiv:2206.07839v1 [cs.LG])
    Certifiable robustness is a highly desirable property for adopting deep neural networks (DNNs) in safety-critical scenarios, but often demands tedious computations to establish. The main hurdle lies in the massive amount of non-linearity in large DNNs. To trade off the DNN expressiveness (which calls for more non-linearity) and robustness certification scalability (which prefers more linearity), we propose a novel solution to strategically manipulate neurons, by "grafting" appropriate levels of linearity. The core of our proposal is to first linearize insignificant ReLU neurons, to eliminate the non-linear components that are both redundant for DNN performance and harmful to its certification. We then optimize the associated slopes and intercepts of the replaced linear activations for restoring model performance while maintaining certifiability. Hence, typical neuron pruning could be viewed as a special case of grafting a linear function of the fixed zero slopes and intercept, that might overly restrict the network flexibility and sacrifice its performance. Extensive experiments on multiple datasets and network backbones show that our linearity grafting can (1) effectively tighten certified bounds; (2) achieve competitive certifiable robustness without certified robust training (i.e., over 30% improvements on CIFAR-10 models); and (3) scale up complete verification to large adversarially trained models with 17M parameters. Codes are available at https://github.com/VITA-Group/Linearity-Grafting.
    Patch-level Representation Learning for Self-supervised Vision Transformers. (arXiv:2206.07990v1 [cs.CV])
    Recent self-supervised learning (SSL) methods have shown impressive results in learning visual representations from unlabeled images. This paper aims to improve their performance further by utilizing the architectural advantages of the underlying neural network, as the current state-of-the-art visual pretext tasks for SSL do not enjoy the benefit, i.e., they are architecture-agnostic. In particular, we focus on Vision Transformers (ViTs), which have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks. The unique characteristic of ViT is that it takes a sequence of disjoint patches from an image and processes patch-level representations internally. Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations. To be specific, we enforce invariance against each patch and its neighbors, i.e., each patch treats similar neighboring patches as positive samples. Consequently, training ViTs with SelfPatch learns more semantically meaningful relations among patches (without using human-annotated labels), which can be beneficial, in particular, to downstream tasks of a dense prediction type. Despite its simplicity, we demonstrate that it can significantly improve the performance of existing SSL methods for various visual tasks, including object detection and semantic segmentation. Specifically, SelfPatch significantly improves the recent self-supervised ViT, DINO, by achieving +1.3 AP on COCO object detection, +1.2 AP on COCO instance segmentation, and +2.9 mIoU on ADE20K semantic segmentation.
    Neural Scene Representation for Locomotion on Structured Terrain. (arXiv:2206.08077v1 [cs.RO])
    We propose a learning-based method to reconstruct the local terrain for locomotion with a mobile robot traversing urban environments. Using a stream of depth measurements from the onboard cameras and the robot's trajectory, the algorithm estimates the topography in the robot's vicinity. The raw measurements from these cameras are noisy and only provide partial and occluded observations that in many cases do not show the terrain the robot stands on. Therefore, we propose a 3D reconstruction model that faithfully reconstructs the scene, despite the noisy measurements and large amounts of missing data coming from the blind spots of the camera arrangement. The model consists of a 4D fully convolutional network on point clouds that learns the geometric priors to complete the scene from the context and an auto-regressive feedback to leverage spatio-temporal consistency and use evidence from the past. The network can be solely trained with synthetic data, and due to extensive augmentation, it is robust in the real world, as shown in the validation on a quadrupedal robot, ANYmal, traversing challenging settings. We run the pipeline on the robot's onboard low-power computer using an efficient sparse tensor implementation and show that the proposed method outperforms classical map representations.
    All the World's a (Hyper)Graph: A Data Drama. (arXiv:2206.08225v1 [cs.LG])
    We introduce Hyperbard, a dataset of diverse relational data representations derived from Shakespeare's plays. Our representations range from simple graphs capturing character co-occurrence in single scenes to hypergraphs encoding complex communication settings and character contributions as hyperedges with edge-specific node weights. By making multiple intuitive representations readily available for experimentation, we facilitate rigorous representation robustness checks in graph learning, graph mining, and network analysis, highlighting the advantages and drawbacks of specific representations. Leveraging the data released in Hyperbard, we demonstrate that many solutions to popular graph mining problems are highly dependent on the representation choice, thus calling current graph curation practices into question. As an homage to our data source, and asserting that science can also be art, we present all our points in the form of a play.
    Adversarial Privacy Protection on Speech Enhancement. (arXiv:2206.08170v1 [cs.SD])
    Speech is easily leaked imperceptibly, such as being recorded by mobile phones in different situations. Private content in speech may be maliciously extracted through speech enhancement technology. Speech enhancement technology has developed rapidly along with deep neural networks (DNNs), but adversarial examples can cause DNNs to fail. In this work, we propose an adversarial method to degrade speech enhancement systems. Experimental results show that generated adversarial examples can erase most content information in original examples or replace it with target speech content through speech enhancement. The word error rate (WER) between an enhanced original example and enhanced adversarial example recognition result can reach 89.0%. WER of target attack between enhanced adversarial example and target example is low to 33.75% . Adversarial perturbation can bring the rate of change to the original example to more than 1.4430. This work can prevent the malicious extraction of speech.
    Learning Interpretable Representations of Entanglement in Quantum Optics Experiments using Deep Generative Models. (arXiv:2109.02490v2 [cs.LG] UPDATED)
    Quantum physics experiments produce interesting phenomena such as interference or entanglement, which are core properties of numerous future quantum technologies. The complex relationship between the setup structure of a quantum experiment and its entanglement properties is essential to fundamental research in quantum optics but is difficult to intuitively understand. We present a deep generative model of quantum optics experiments where a variational autoencoder is trained on a dataset of quantum optics experimental setups. In a series of computational experiments, we investigate the learned representation of our Quantum Optics Variational Auto Encoder (QOVAE) and its internal understanding of the quantum optics world. We demonstrate that the QOVAE learns an interpretable representation of quantum optics experiments and the relationship between experiment structure and entanglement. We show the QOVAE is able to generate novel experiments for highly entangled quantum states with specific distributions that match its training data. The QOVAE can learn to generate specific entangled states and efficiently search the space of experiments that produce highly entangled quantum states. Importantly, we are able to interpret how the QOVAE structures its latent space, finding curious patterns that we can explain in terms of quantum physics. The results demonstrate how we can use and understand the internal representations of deep generative models in a complex scientific domain. The QOVAE and the insights from our investigations can be immediately applied to other physical systems.
    Multi-Agent Learning for Iterative Dominance Elimination: Formal Barriers and New Algorithms. (arXiv:2111.05486v2 [cs.GT] UPDATED)
    Dominated actions are natural (and perhaps the simplest possible) multi-agent generalizations of sub-optimal actions as in standard single-agent decision making. Thus similar to standard bandit learning, a basic learning question in multi-agent systems is whether agents can learn to efficiently eliminate all dominated actions in an unknown game if they can only observe noisy bandit feedback about the payoff of their played actions. Surprisingly, despite a seemingly simple task, we show a quite negative result; that is, standard no regret algorithms -- including the entire family of Dual Averaging algorithms -- provably take exponentially many rounds to eliminate all dominated actions. Moreover, algorithms with the stronger no swap regret also suffer similar exponential inefficiency. To overcome these barriers, we develop a new algorithm that adjusts Exp3 with Diminishing Historical rewards (termed Exp3-DH); Exp3-DH gradually forgets history at carefully tailored rates. We prove that when all agents run Exp3-DH (a.k.a., self-play in multi-agent learning), all dominated actions can be iteratively eliminated within polynomially many rounds. Our experimental results further demonstrate the efficiency of Exp3-DH, and that state-of-the-art bandit algorithms, even those developed specifically for learning in games, fail to eliminate all dominated actions efficiently.
    Tracking Most Significant Arm Switches in Bandits. (arXiv:2112.13838v6 [cs.LG] UPDATED)
    In bandit with distribution shifts, one aims to automatically adapt to unknown changes in reward distribution, and restart exploration when necessary. While this problem has been studied for many years, a recent breakthrough of Auer et al. (2018, 2019) provides the first adaptive procedure to guarantee an optimal (dynamic) regret $\sqrt{LT}$, for $T$ rounds, and an unknown number $L$ of changes. However, while this rate is tight in the worst case, it remained open whether faster rates are possible, without prior knowledge, if few changes in distribution are actually severe. To resolve this question, we propose a new notion of significant shift, which only counts very severe changes that clearly necessitate a restart: roughly, these are changes involving not only best arm switches, but also involving large aggregate differences in reward overtime. Thus, our resulting procedure adaptively achieves rates always faster (sometimes significantly) than $O(\sqrt{ST})$, where $S\ll L$ only counts best arm switches, while at the same time, always faster than the optimal $O(V^{\frac{1}{3}}T^{\frac{2}{3}})$ when expressed in terms of total variation $V$ (which aggregates differences overtime). Our results are expressed in enough generality to also capture non-stochastic adversarial settings.
    When to intervene? Prescriptive Process Monitoring Under Uncertainty and Resource Constraints. (arXiv:2206.07745v1 [cs.AI])
    Prescriptive process monitoring approaches leverage historical data to prescribe runtime interventions that will likely prevent negative case outcomes or improve a process's performance. A centerpiece of a prescriptive process monitoring method is its intervention policy: a decision function determining if and when to trigger an intervention on an ongoing case. Previous proposals in this field rely on intervention policies that consider only the current state of a given case. These approaches do not consider the tradeoff between triggering an intervention in the current state, given the level of uncertainty of the underlying predictive models, versus delaying the intervention to a later state. Moreover, they assume that a resource is always available to perform an intervention (infinite capacity). This paper addresses these gaps by introducing a prescriptive process monitoring method that filters and ranks ongoing cases based on prediction scores, prediction uncertainty, and causal effect of the intervention, and triggers interventions to maximize a gain function, considering the available resources. The proposal is evaluated using a real-life event log. The results show that the proposed method outperforms existing baselines regarding total gain.
    Optimization-Derived Learning with Essential Convergence Analysis of Training and Hyper-training. (arXiv:2206.07875v1 [cs.LG])
    Recently, Optimization-Derived Learning (ODL) has attracted attention from learning and vision areas, which designs learning models from the perspective of optimization. However, previous ODL approaches regard the training and hyper-training procedures as two separated stages, meaning that the hyper-training variables have to be fixed during the training process, and thus it is also impossible to simultaneously obtain the convergence of training and hyper-training variables. In this work, we design a Generalized Krasnoselskii-Mann (GKM) scheme based on fixed-point iterations as our fundamental ODL module, which unifies existing ODL methods as special cases. Under the GKM scheme, a Bilevel Meta Optimization (BMO) algorithmic framework is constructed to solve the optimal training and hyper-training variables together. We rigorously prove the essential joint convergence of the fixed-point iteration for training and the process of optimizing hyper-parameters for hyper-training, both on the approximation quality, and on the stationary analysis. Experiments demonstrate the efficiency of BMO with competitive performance on sparse coding and real-world applications such as image deconvolution and rain streak removal.
    Challenges and Opportunities in Deep Reinforcement Learning with Graph Neural Networks: A Comprehensive review of Algorithms and Applications. (arXiv:2206.07922v1 [cs.LG])
    Deep reinforcement learning (DRL) has empowered a variety of artificial intelligence fields, including pattern recognition, robotics, recommendation-systems, and gaming. Similarly, graph neural networks (GNN) have also demonstrated their superior performance in supervised learning for graph-structured data. In recent times, the fusion of GNN with DRL for graph-structured environments has attracted a lot of attention. This paper provides a comprehensive review of these hybrid works. These works can be classified into two categories: (1) algorithmic enhancement, where DRL and GNN complement each other for better utility; (2) application-specific enhancement, where DRL and GNN support each other. This fusion effectively addresses various complex problems in engineering and life sciences. Based on the review, we further analyze the applicability and benefits of fusing these two domains, especially in terms of increasing generalizability and reducing computational complexity. Finally, the key challenges in integrating DRL and GNN, and potential future research directions are highlighted, which will be of interest to the broader machine learning community.
    Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems. (arXiv:2206.07808v1 [cs.CL])
    We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70% spoken-form data, our teacher models perform comparably to XLM-R and mT5 when evaluated on the written-form Cross-lingual Natural Language Inference (XNLI) corpus. We perform a second stage of pretraining on our teacher models using in-domain data from our system, improving error rates by 3.86% relative for intent classification and 7.01% relative for slot filling. We find that even a 170M-parameter model distilled from our Stage 2 teacher model has 2.88% better intent classification and 7.69% better slot filling error rates when compared to the 2.3B-parameter teacher trained only on public data (Stage 1), emphasizing the importance of in-domain data for pretraining. When evaluated offline using labeled NLU data, our 17M-parameter Stage 2 distilled model outperforms both XLM-R Base (85M params) and DistillBERT (42M params) by 4.23% to 6.14%, respectively. Finally, we present results from a full virtual assistant experimentation platform, where we find that models trained using our pretraining and distillation pipeline outperform models distilled from 85M-parameter teachers by 3.74%-4.91% on an automatic measurement of full-system user dissatisfaction.
    Is Continual Learning Truly Learning Representations Continually?. (arXiv:2206.08101v1 [cs.LG])
    Continual learning (CL) aims to learn from sequentially arriving tasks without forgetting previous tasks. Whereas CL algorithms have tried to achieve higher average test accuracy across all the tasks learned so far, learning continuously useful representations is critical for successful generalization and downstream transfer. To measure representational quality, we re-train only the output layers using a small balanced dataset for all the tasks, evaluating the average accuracy without any biased predictions toward the current task. We also test on several downstream tasks, measuring transfer learning accuracy of the learned representations. By testing our new formalism on ImageNet-100 and ImageNet-1000, we find that using more exemplar memory is the only option to make a meaningful difference in learned representations, and most of the regularization- or distillation-based CL algorithms that use the exemplar memory fail to learn continuously useful representations in class-incremental learning. Surprisingly, unsupervised (or self-supervised) CL with sufficient memory size can achieve comparable performance to the supervised counterparts. Considering non-trivial labeling costs, we claim that finding more efficient unsupervised CL algorithms that minimally use exemplary memory would be the next promising direction for CL research.
    Process, Bias and Temperature Scalable CMOS Analog Computing Circuits for Machine Learning. (arXiv:2205.05664v2 [cs.AR] UPDATED)
    Analog computing is attractive compared to digital computing due to its potential for achieving higher computational density and higher energy efficiency. However, unlike digital circuits, conventional analog computing circuits cannot be easily mapped across different process nodes due to differences in transistor biasing regimes, temperature variations and limited dynamic range. In this work, we generalize the previously reported margin-propagation-based analog computing framework for designing novel \textit{shape-based analog computing} (S-AC) circuits that can be easily cross-mapped across different process nodes. Similar to digital designs S-AC designs can also be scaled for precision, speed, and power. As a proof-of-concept, we show several examples of S-AC circuits implementing mathematical functions that are commonly used in machine learning (ML) architectures. Using circuit simulations we demonstrate that the circuit input/output characteristics remain robust when mapped from a planar CMOS 180nm process to a FinFET 7nm process. Also, using benchmark datasets we demonstrate that the classification accuracy of a S-AC based neural network remains robust when mapped across the two processes and to changes in temperature.
    Evaluating Short-Term Forecasting of Multiple Time Series in IoT Environments. (arXiv:2206.07784v1 [cs.LG])
    Modern Internet of Things (IoT) environments are monitored via a large number of IoT enabled sensing devices, with the data acquisition and processing infrastructure setting restrictions in terms of computational power and energy resources. To alleviate this issue, sensors are often configured to operate at relatively low sampling frequencies, yielding a reduced set of observations. Nevertheless, this can hamper dramatically subsequent decision-making, such as forecasting. To address this problem, in this work we evaluate short-term forecasting in highly underdetermined cases, i.e., the number of sensor streams is much higher than the number of observations. Several statistical, machine learning and neural network-based models are thoroughly examined with respect to the resulting forecasting accuracy on five different real-world datasets. The focus is given on a unified experimental protocol especially designed for short-term prediction of multiple time series at the IoT edge. The proposed framework can be considered as an important step towards establishing a solid forecasting strategy in resource constrained IoT applications.
    On Calibrated Model Uncertainty in Deep Learning. (arXiv:2206.07795v1 [cs.LG])
    Estimated uncertainty by approximate posteriors in Bayesian neural networks are prone to miscalibration, which leads to overconfident predictions in critical tasks that have a clear asymmetric cost or significant losses. Here, we extend the approximate inference for the loss-calibrated Bayesian framework to dropweights based Bayesian neural networks by maximising expected utility over a model posterior to calibrate uncertainty in deep learning. Furthermore, we show that decisions informed by loss-calibrated uncertainty can improve diagnostic performance to a greater extent than straightforward alternatives. We propose Maximum Uncertainty Calibration Error (MUCE) as a metric to measure calibrated confidence, in addition to its prediction especially for high-risk applications, where the goal is to minimise the worst-case deviation between error and estimated uncertainty. In experiments, we show the correlation between error in prediction and estimated uncertainty by interpreting Wasserstein distance as the accuracy of prediction. We evaluated the effectiveness of our approach to detecting Covid-19 from X-Ray images. Experimental results show that our method reduces miscalibration considerably, without impacting the models accuracy and improves reliability of computer-based diagnostics.
    The Scattering Transform Network with Generalized Morse Wavelets and Its Application to Music Genre Classification. (arXiv:2206.07857v1 [eess.AS])
    We propose to use the Generalized Morse Wavelets (GMWs) instead of commonly-used Morlet (or Gabor) wavelets in the Scattering Transform Network (STN), which we call the GMW-STN, for signal classification problems. The GMWs form a parameterized family of truly analytic wavelets while the Morlet wavelets are only approximately analytic. The analyticity of underlying wavelet filters in the STN is particularly important for nonstationary oscillatory signals such as music signals because it improves interpretability of the STN representations by providing multiscale amplitude and phase (and consequently frequency) information of input signals. We demonstrate the superiority of the GMW-STN over the conventional STN in music genre classification using the so-called GTZAN database. Moreover, we show the performance improvement of the GMW-STN by increasing its number of layers to three over the typical two-layer STN.}
    Evaluating Self-Supervised Learning for Molecular Graph Embeddings. (arXiv:2206.08005v1 [cs.LG])
    Graph Self-Supervised Learning (GSSL) paves the way for learning graph embeddings without expert annotation, which is particularly impactful for molecular graphs since the number of possible molecules is enormous and labels are expensive to obtain. However, by design, GSSL methods are not trained to perform well on one downstream task but aim for transferability to many, making evaluating them less straightforward. As a step toward obtaining profiles of molecular graph embeddings with diverse and interpretable attributes, we introduce Molecular Graph Representation Evaluation (MolGraphEval), a suite of probe tasks, categorised into (i) topological-, (ii) substructure-, and (iii) embedding space properties. By benchmarking existing GSSL methods on both existing downstream datasets and MolGraphEval, we discover surprising discrepancies between conclusions drawn from existing datasets alone versus more fine-grained probing, suggesting that current evaluation protocols do not provide the whole picture. Our modular, automated end-to-end GSSL pipeline code will be released upon acceptance, including standardised graph loading, experiment management, and embedding evaluation.
    Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them. (arXiv:2107.11630v2 [cs.LG] UPDATED)
    Making classifiers robust to adversarial examples is hard. Thus, many defenses tackle the seemingly easier task of detecting perturbed inputs. We show a barrier towards this goal. We prove a general hardness reduction between detection and classification of adversarial examples: given a robust detector for attacks at distance {\epsilon} (in some metric), we can build a similarly robust (but inefficient) classifier for attacks at distance {\epsilon}/2. Our reduction is computationally inefficient, and thus cannot be used to build practical classifiers. Instead, it is a useful sanity check to test whether empirical detection results imply something much stronger than the authors presumably anticipated. To illustrate, we revisit 13 detector defenses. For 11/13 cases, we show that the claimed detection results would imply an inefficient classifier with robustness far beyond the state-of-the-art.
    Hybrid full-field thermal characterization of additive manufacturing processes using physics-informed neural networks with data. (arXiv:2206.07756v1 [cs.LG])
    Understanding the thermal behavior of additive manufacturing (AM) processes is crucial for enhancing the quality control and enabling customized process design. Most purely physics-based computational models suffer from intensive computational costs, thus not suitable for online control and iterative design application. Data-driven models taking advantage of the latest developed computational tools can serve as a more efficient surrogate, but they are usually trained over a large amount of simulation data and often fail to effectively use small but high-quality experimental data. In this work, we developed a hybrid physics-based data-driven thermal modeling approach of AM processes using physics-informed neural networks. Specifically, partially observed temperature data measured from an infrared camera is combined with the physics laws to predict full-field temperature history and to discover unknown material and process parameters. In the numerical and experimental examples, the effectiveness of adding auxiliary training data and using the technique of transfer learning on training efficiency and prediction accuracy, as well as the ability to identify unknown parameters with partially observed data, are demonstrated. The results show that the hybrid thermal model can effectively identify unknown parameters and capture the full-field temperature accurately, and thus it has the potential to be used in iterative process design and real-time process control of AM.
    Architectural Backdoors in Neural Networks. (arXiv:2206.07840v1 [cs.LG])
    Machine learning is vulnerable to adversarial manipulation. Previous literature has demonstrated that at the training stage attackers can manipulate data and data sampling procedures to control model behaviour. A common attack goal is to plant backdoors i.e. force the victim model to learn to recognise a trigger known only by the adversary. In this paper, we introduce a new class of backdoor attacks that hide inside model architectures i.e. in the inductive bias of the functions used to train. These backdoors are simple to implement, for instance by publishing open-source code for a backdoored model architecture that others will reuse unknowingly. We demonstrate that model architectural backdoors represent a real threat and, unlike other approaches, can survive a complete re-training from scratch. We formalise the main construction principles behind architectural backdoors, such as a link between the input and the output, and describe some possible protections against them. We evaluate our attacks on computer vision benchmarks of different scales and demonstrate the underlying vulnerability is pervasive in a variety of training settings.
    Equivariant Diffusion for Molecule Generation in 3D. (arXiv:2203.17003v2 [cs.LG] UPDATED)
    This work introduces a diffusion model for molecule generation in 3D that is equivariant to Euclidean transformations. Our E(3) Equivariant Diffusion Model (EDM) learns to denoise a diffusion process with an equivariant network that jointly operates on both continuous (atom coordinates) and categorical features (atom types). In addition, we provide a probabilistic analysis which admits likelihood computation of molecules using our model. Experimentally, the proposed method significantly outperforms previous 3D molecular generative methods regarding the quality of generated samples and efficiency at training time.
    Federated Data Analytics: A Study on Linear Models. (arXiv:2206.07786v1 [stat.AP])
    As edge devices become increasingly powerful, data analytics are gradually moving from a centralized to a decentralized regime where edge compute resources are exploited to process more of the data locally. This regime of analytics is coined as federated data analytics (FDA). In spite of the recent success stories of FDA, most literature focuses exclusively on deep neural networks. In this work, we take a step back to develop an FDA treatment for one of the most fundamental statistical models: linear regression. Our treatment is built upon hierarchical modeling that allows borrowing strength across multiple groups. To this end, we propose two federated hierarchical model structures that provide a shared representation across devices to facilitate information sharing. Notably, our proposed frameworks are capable of providing uncertainty quantification, variable selection, hypothesis testing and fast adaptation to new unseen data. We validate our methods on a range of real-life applications including condition monitoring for aircraft engines. The results show that our FDA treatment for linear models can serve as a competing benchmark model for future development of federated algorithms.
    Modeling the Data-Generating Process is Necessary for Out-of-Distribution Generalization. (arXiv:2206.07837v1 [cs.LG])
    Real-world data collected from multiple domains can have multiple, distinct distribution shifts over multiple attributes. However, state-of-the art advances in domain generalization (DG) algorithms focus only on specific shifts over a single attribute. We introduce datasets with multi-attribute distribution shifts and find that existing DG algorithms fail to generalize. To explain this, we use causal graphs to characterize the different types of shifts based on the relationship between spurious attributes and the classification label. Each multi-attribute causal graph entails different constraints over observed variables, and therefore any algorithm based on a single, fixed independence constraint cannot work well across all shifts. We present Causally Adaptive Constraint Minimization (CACM), a new algorithm for identifying the correct independence constraints for regularization. Results on fully synthetic, MNIST and small NORB datasets, covering binary and multi-valued attributes and labels, confirm our theoretical claim: correct independence constraints lead to the highest accuracy on unseen domains whereas incorrect constraints fail to do so. Our results demonstrate the importance of modeling the causal relationships inherent in the data-generating process: in many cases, it is impossible to know the correct regularization constraints without this information.
    Differentially Private Multi-Party Data Release for Linear Regression. (arXiv:2206.07998v1 [cs.CR])
    Differentially Private (DP) data release is a promising technique to disseminate data without compromising the privacy of data subjects. However the majority of prior work has focused on scenarios where a single party owns all the data. In this paper we focus on the multi-party setting, where different stakeholders own disjoint sets of attributes belonging to the same group of data subjects. Within the context of linear regression that allow all parties to train models on the complete data without the ability to infer private attributes or identities of individuals, we start with directly applying Gaussian mechanism and show it has the small eigenvalue problem. We further propose our novel method and prove it asymptotically converges to the optimal (non-private) solutions with increasing dataset size. We substantiate the theoretical results through experiments on both artificial and real-world datasets.
    TransDrift: Modeling Word-Embedding Drift using Transformer. (arXiv:2206.08081v1 [cs.CL])
    In modern NLP applications, word embeddings are a crucial backbone that can be readily shared across a number of tasks. However as the text distributions change and word semantics evolve over time, the downstream applications using the embeddings can suffer if the word representations do not conform to the data drift. Thus, maintaining word embeddings to be consistent with the underlying data distribution is a key problem. In this work, we tackle this problem and propose TransDrift, a transformer-based prediction model for word embeddings. Leveraging the flexibility of transformer, our model accurately learns the dynamics of the embedding drift and predicts the future embedding. In experiments, we compare with existing methods and show that our model makes significantly more accurate predictions of the word embedding than the baselines. Crucially, by applying the predicted embeddings as a backbone for downstream classification tasks, we show that our embeddings lead to superior performance compared to the previous methods.
    Towards Understanding How Machines Can Learn Causal Overhypotheses. (arXiv:2206.08353v1 [cs.LG])
    Recent work in machine learning and cognitive science has suggested that understanding causal information is essential to the development of intelligence. The extensive literature in cognitive science using the ``blicket detector'' environment shows that children are adept at many kinds of causal inference and learning. We propose to adapt that environment for machine learning agents. One of the key challenges for current machine learning algorithms is modeling and understanding causal overhypotheses: transferable abstract hypotheses about sets of causal relationships. In contrast, even young children spontaneously learn and use causal overhypotheses. In this work, we present a new benchmark -- a flexible environment which allows for the evaluation of existing techniques under variable causal overhypotheses -- and demonstrate that many existing state-of-the-art methods have trouble generalizing in this environment. The code and resources for this benchmark are available at https://github.com/CannyLab/casual_overhypotheses.
    Meta-Learning Dynamics Forecasting Using Task Inference. (arXiv:2102.10271v4 [cs.LG] UPDATED)
    Current deep learning models for dynamics forecasting struggle with generalization. They can only forecast in a specific domain and fail when applied to systems with different parameters, external forces, or boundary conditions. We propose a model-based meta-learning method called DyAd which can generalize across heterogeneous domains by partitioning them into different tasks. DyAd has two parts: an encoder which infers the time-invariant hidden features of the task with weak supervision, and a forecaster which learns the shared dynamics of the entire domain. The encoder adapts and controls the forecaster during inference using adaptive instance normalization and adaptive padding. Theoretically, we prove that the generalization error of such procedure is related to the task relatedness in the source domain, as well as the domain differences between source and target. Experimentally, we demonstrate that our model outperforms state-of-the-art approaches on both turbulent flow and real-world ocean data forecasting tasks.
    Approximate Frank-Wolfe Algorithms over Graph-structured Support Sets. (arXiv:2107.00472v2 [math.OC] UPDATED)
    In this paper, we propose approximate Frank-Wolfe (FW) algorithms to solve convex optimization problems over graph-structured support sets where the \textit{linear minimization oracle} (LMO) cannot be efficiently obtained in general. We first demonstrate that two popular approximation assumptions (\textit{additive} and \textit{multiplicative gap errors)}, are not valid for our problem, in that no cheap gap-approximate LMO oracle exists in general. Instead, a new \textit{approximate dual maximization oracle} (DMO) is proposed, which approximates the inner product rather than the gap. When the objective is $L$-smooth, we prove that the standard FW method using a $\delta$-approximate DMO converges as $\mathcal{O}(L / \delta t + (1-\delta)(\delta^{-1} + \delta^{-2}))$ in general, and as $\mathcal{O}(L/(\delta^2(t+2)))$ over a $\delta$-relaxation of the constraint set. Additionally, when the objective is $\mu$-strongly convex and the solution is unique, a variant of FW converges to $\mathcal{O}(L^2\log(t)/(\mu \delta^6 t^2))$ with the same per-iteration complexity. Our empirical results suggest that even these improved bounds are pessimistic, with significant improvement in recovering real-world images with graph-structured sparsity.
    Closed-Form Diffeomorphic Transformations for Time Series Alignment. (arXiv:2206.08107v1 [cs.LG])
    Time series alignment methods call for highly expressive, differentiable and invertible warping functions which preserve temporal topology, i.e diffeomorphisms. Diffeomorphic warping functions can be generated from the integration of velocity fields governed by an ordinary differential equation (ODE). Gradient-based optimization frameworks containing diffeomorphic transformations require to calculate derivatives to the differential equation's solution with respect to the model parameters, i.e. sensitivity analysis. Unfortunately, deep learning frameworks typically lack automatic-differentiation-compatible sensitivity analysis methods; and implicit functions, such as the solution of ODE, require particular care. Current solutions appeal to adjoint sensitivity methods, ad-hoc numerical solvers or ResNet's Eulerian discretization. In this work, we present a closed-form expression for the ODE solution and its gradient under continuous piecewise-affine (CPA) velocity functions. We present a highly optimized implementation of the results on CPU and GPU. Furthermore, we conduct extensive experiments on several datasets to validate the generalization ability of our model to unseen data for time-series joint alignment. Results show significant improvements both in terms of efficiency and accuracy.
    Deep Learning for Time Series Forecasting: Tutorial and Literature Survey. (arXiv:2004.10240v2 [cs.LG] UPDATED)
    Deep learning based forecasting methods have become the methods of choice in many applications of time series prediction or forecasting often outperforming other approaches. Consequently, over the last years, these methods are now ubiquitous in large-scale industrial forecasting applications and have consistently ranked among the best entries in forecasting competitions (e.g., M4 and M5). This practical success has further increased the academic interest to understand and improve deep forecasting methods. In this article we provide an introduction and overview of the field: We present important building blocks for deep forecasting in some depth; using these building blocks, we then survey the breadth of the recent deep forecasting literature.
    On Privacy and Personalization in Cross-Silo Federated Learning. (arXiv:2206.07902v1 [cs.LG])
    While the application of differential privacy (DP) has been well-studied in cross-device federated learning (FL), there is a lack of work considering DP for cross-silo FL, a setting characterized by a limited number of clients each containing many data subjects. In cross-silo FL, usual notions of client-level privacy are less suitable as real-world privacy regulations typically concern in-silo data subjects rather than the silos themselves. In this work, we instead consider the more realistic notion of silo-specific item-level privacy, where silos set their own privacy targets for their local examples. Under this setting, we reconsider the roles of personalization in federated learning. In particular, we show that mean-regularized multi-task learning (MR-MTL), a simple personalization framework, is a strong baseline for cross-silo FL: under stronger privacy, silos are further incentivized to "federate" with each other to mitigate DP noise, resulting in consistent improvements relative to standard baseline methods. We provide a thorough empirical study of competing methods as well as a theoretical characterization of MR-MTL for a mean estimation problem, highlighting the interplay between privacy and cross-silo data heterogeneity. Our work serves to establish baselines for private cross-silo FL as well as identify key directions of future work in this area.
    BlindFL: Vertical Federated Machine Learning without Peeking into Your Data. (arXiv:2206.07975v1 [cs.LG])
    Due to the rising concerns on privacy protection, how to build machine learning (ML) models over different data sources with security guarantees is gaining more popularity. Vertical federated learning (VFL) describes such a case where ML models are built upon the private data of different participated parties that own disjoint features for the same set of instances, which fits many real-world collaborative tasks. Nevertheless, we find that existing solutions for VFL either support limited kinds of input features or suffer from potential data leakage during the federated execution. To this end, this paper aims to investigate both the functionality and security of ML modes in the VFL scenario. To be specific, we introduce BlindFL, a novel framework for VFL training and inference. First, to address the functionality of VFL models, we propose the federated source layers to unite the data from different parties. Various kinds of features can be supported efficiently by the federated source layers, including dense, sparse, numerical, and categorical features. Second, we carefully analyze the security during the federated execution and formalize the privacy requirements. Based on the analysis, we devise secure and accurate algorithm protocols, and further prove the security guarantees under the ideal-real simulation paradigm. Extensive experiments show that BlindFL supports diverse datasets and models efficiently whilst achieves robust privacy guarantees.
    Balancing Discriminability and Transferability for Source-Free Domain Adaptation. (arXiv:2206.08009v1 [cs.CV])
    Conventional domain adaptation (DA) techniques aim to improve domain transferability by learning domain-invariant representations; while concurrently preserving the task-discriminability knowledge gathered from the labeled source data. However, the requirement of simultaneous access to labeled source and unlabeled target renders them unsuitable for the challenging source-free DA setting. The trivial solution of realizing an effective original to generic domain mapping improves transferability but degrades task discriminability. Upon analyzing the hurdles from both theoretical and empirical standpoints, we derive novel insights to show that a mixup between original and corresponding translated generic samples enhances the discriminability-transferability trade-off while duly respecting the privacy-oriented source-free setting. A simple but effective realization of the proposed insights on top of the existing source-free DA approaches yields state-of-the-art performance with faster convergence. Beyond single-source, we also outperform multi-source prior-arts across both classification and semantic segmentation benchmarks.
    Forming Effective Human-AI Teams: Building Machine Learning Models that Complement the Capabilities of Multiple Experts. (arXiv:2206.07948v1 [cs.AI])
    Machine learning (ML) models are increasingly being used in application domains that often involve working together with human experts. In this context, it can be advantageous to defer certain instances to a single human expert when they are difficult to predict for the ML model. While previous work has focused on scenarios with one distinct human expert, in many real-world situations several human experts with varying capabilities may be available. In this work, we propose an approach that trains a classification model to complement the capabilities of multiple human experts. By jointly training the classifier together with an allocation system, the classifier learns to accurately predict those instances that are difficult for the human experts, while the allocation system learns to pass each instance to the most suitable team member -- either the classifier or one of the human experts. We evaluate our proposed approach in multiple experiments on public datasets with "synthetic" experts and a real-world medical dataset annotated by multiple radiologists. Our approach outperforms prior work and is more accurate than the best human expert or a classifier. Furthermore, it is flexibly adaptable to teams of varying sizes and different levels of expert diversity.
    Generalization Bounds via Convex Analysis. (arXiv:2202.04985v2 [stat.ML] UPDATED)
    Since the celebrated works of Russo and Zou (2016,2019) and Xu and Raginsky (2017), it has been well known that the generalization error of supervised learning algorithms can be bounded in terms of the mutual information between their input and the output, given that the loss of any fixed hypothesis has a subgaussian tail. In this work, we generalize this result beyond the standard choice of Shannon's mutual information to measure the dependence between the input and the output. Our main result shows that it is indeed possible to replace the mutual information by any strongly convex function of the joint input-output distribution, with the subgaussianity condition on the losses replaced by a bound on an appropriately chosen norm capturing the geometry of the dependence measure. This allows us to derive a range of generalization bounds that are either entirely new or strengthen previously known ones. Examples include bounds stated in terms of $p$-norm divergences and the Wasserstein-2 distance, which are respectively applicable for heavy-tailed loss distributions and highly smooth loss functions. Our analysis is entirely based on elementary tools from convex analysis by tracking the growth of a potential function associated with the dependence measure and the loss function.
    Hardness prediction of age-hardening aluminum alloy based on ensemble learning. (arXiv:2206.08011v1 [cond-mat.mtrl-sci])
    With the rapid development of artificial intelligence, the combination of material database and machine learning has driven the progress of material informatics. Because aluminum alloy is widely used in many fields, so it is significant to predict the properties of aluminum alloy. In this thesis, the data of Al-Cu-Mg-X (X: Zn, Zr, etc.) alloy are used to input the composition, aging conditions (time and temperature) and predict its hardness. An ensemble learning solution based on automatic machine learning and an attention mechanism introduced into the secondary learner of deep neural network are proposed respectively. The experimental results show that selecting the correct secondary learner can further improve the prediction accuracy of the model. This manuscript introduces the attention mechanism to improve the secondary learner based on deep neural network, and obtains a fusion model with better performance. The R-Square of the best model is 0.9697 and the MAE is 3.4518HV.
    U-PET: MRI-based Dementia Detection with Joint Generation of Synthetic FDG-PET Images. (arXiv:2206.08078v1 [eess.IV])
    Alzheimer's disease (AD) is the most common cause of dementia. An early detection is crucial for slowing down the disease and mitigating risks related to the progression. While the combination of MRI and FDG-PET is the best image-based tool for diagnosis, FDG-PET is not always available. The reliable detection of Alzheimer's disease with only MRI could be beneficial, especially in regions where FDG-PET might not be affordable for all patients. To this end, we propose a multi-task method based on U-Net that takes T1-weighted MR images as an input to generate synthetic FDG-PET images and classifies the dementia progression of the patient into cognitive normal (CN), cognitive impairment (MCI), and AD. The attention gates used in both task heads can visualize the most relevant parts of the brain, guiding the examiner and adding interpretability. Results show the successful generation of synthetic FDG-PET images and a performance increase in disease classification over the naive single-task baseline.
    HyperImpute: Generalized Iterative Imputation with Automatic Model Selection. (arXiv:2206.07769v1 [stat.ML])
    Consider the problem of imputing missing values in a dataset. One the one hand, conventional approaches using iterative imputation benefit from the simplicity and customizability of learning conditional distributions directly, but suffer from the practical requirement for appropriate model specification of each and every variable. On the other hand, recent methods using deep generative modeling benefit from the capacity and efficiency of learning with neural network function approximators, but are often difficult to optimize and rely on stronger data assumptions. In this work, we study an approach that marries the advantages of both: We propose *HyperImpute*, a generalized iterative imputation framework for adaptively and automatically configuring column-wise models and their hyperparameters. Practically, we provide a concrete implementation with out-of-the-box learners, optimizers, simulators, and extensible interfaces. Empirically, we investigate this framework via comprehensive experiments and sensitivities on a variety of public datasets, and demonstrate its ability to generate accurate imputations relative to a strong suite of benchmarks. Contrary to recent work, we believe our findings constitute a strong defense of the iterative imputation paradigm.
    Let Invariant Rationale Discovery Inspire Graph Contrastive Learning. (arXiv:2206.07869v1 [cs.LG])
    Leading graph contrastive learning (GCL) methods perform graph augmentations in two fashions: (1) randomly corrupting the anchor graph, which could cause the loss of semantic information, or (2) using domain knowledge to maintain salient features, which undermines the generalization to other domains. Taking an invariance look at GCL, we argue that a high-performing augmentation should preserve the salient semantics of anchor graphs regarding instance-discrimination. To this end, we relate GCL with invariant rationale discovery, and propose a new framework, Rationale-aware Graph Contrastive Learning (RGCL). Specifically, without supervision signals, RGCL uses a rationale generator to reveal salient features about graph instance-discrimination as the rationale, and then creates rationale-aware views for contrastive learning. This rationale-aware pre-training scheme endows the backbone model with the powerful representation ability, further facilitating the fine-tuning on downstream tasks. On MNIST-Superpixel and MUTAG datasets, visual inspections on the discovered rationales showcase that the rationale generator successfully captures the salient features (i.e. distinguishing semantic nodes in graphs). On biochemical molecule and social network benchmark datasets, the state-of-the-art performance of RGCL demonstrates the effectiveness of rationale-aware views for contrastive learning. Our codes are available at https://github.com/lsh0520/RGCL.
    AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation. (arXiv:2206.08023v1 [eess.IV])
    Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. Information can be found at https://amos22.grand-challenge.org.
    CARLANE: A Lane Detection Benchmark for Unsupervised Domain Adaptation from Simulation to multiple Real-World Domains. (arXiv:2206.08083v1 [cs.CV])
    Unsupervised Domain Adaptation demonstrates great potential to mitigate domain shifts by transferring models from labeled source domains to unlabeled target domains. While Unsupervised Domain Adaptation has been applied to a wide variety of complex vision tasks, only few works focus on lane detection for autonomous driving. This can be attributed to the lack of publicly available datasets. To facilitate research in these directions, we propose CARLANE, a 3-way sim-to-real domain adaptation benchmark for 2D lane detection. CARLANE encompasses the single-target datasets MoLane and TuLane and the multi-target dataset MuLane. These datasets are built from three different domains, which cover diverse scenes and contain a total of 163K unique images, 118K of which are annotated. In addition we evaluate and report systematic baselines, including our own method, which builds upon Prototypical Cross-domain Self-supervised Learning. We find that false positive and false negative rates of the evaluated domain adaptation methods are high compared to those of fully supervised baselines. This affirms the need for benchmarks such as CARLANE to further strengthen research in Unsupervised Domain Adaptation for lane detection. CARLANE, all evaluated models and the corresponding implementations are publicly available at https://carlanebenchmark.github.io.
    Feature Selection using e-values. (arXiv:2206.05391v2 [stat.ML] UPDATED)
    In the context of supervised parametric models, we introduce the concept of e-values. An e-value is a scalar quantity that represents the proximity of the sampling distribution of parameter estimates in a model trained on a subset of features to that of the model trained on all features (i.e. the full model). Under general conditions, a rank ordering of e-values separates models that contain all essential features from those that do not. The e-values are applicable to a wide range of parametric models. We use data depths and a fast resampling-based algorithm to implement a feature selection procedure using e-values, providing consistency results. For a $p$-dimensional feature space, this procedure requires fitting only the full model and evaluating $p+1$ models, as opposed to the traditional requirement of fitting and evaluating $2^p$ models. Through experiments across several model settings and synthetic and real datasets, we establish that the e-values method as a promising general alternative to existing model-specific methods of feature selection.
    Personalized Federated Learning via Variational Bayesian Inference. (arXiv:2206.07977v1 [cs.LG])
    Federated learning faces huge challenges from model overfitting due to the lack of data and statistical diversity among clients. To address these challenges, this paper proposes a novel personalized federated learning method via Bayesian variational inference named pFedBayes. To alleviate the overfitting, weight uncertainty is introduced to neural networks for clients and the server. To achieve personalization, each client updates its local distribution parameters by balancing its construction error over private data and its KL divergence with global distribution from the server. Theoretical analysis gives an upper bound of averaged generalization error and illustrates that the convergence rate of the generalization error is minimax optimal up to a logarithmic factor. Experiments show that the proposed method outperforms other advanced personalized methods on personalized models, e.g., pFedBayes respectively outperforms other SOTA algorithms by 1.25%, 0.42% and 11.71% on MNIST, FMNIST and CIFAR-10 under non-i.i.d. limited data.
    On Error and Compression Rates for Prototype Rules. (arXiv:2206.08014v1 [cs.LG])
    We study the close interplay between error and compression in the non-parametric multiclass classification setting in terms of prototype learning rules. We focus in particular on a close variant of a recently proposed compression-based learning rule termed OptiNet. Beyond its computational merits, this rule has been recently shown to be universally consistent in any metric instance space that admits a universally consistent rule -- the first learning algorithm known to enjoy this property. However, its error and compression rates have been left open. Here we derive such rates in the case where instances reside in Euclidean space under commonly posed smoothness and tail conditions on the data distribution. We first show that OptiNet achieves non-trivial compression rates while enjoying near minimax-optimal error rates. We then proceed to study a novel general compression scheme for further compressing prototype rules that locally adapts to the noise level without sacrificing accuracy. Applying it to OptiNet, we show that under a geometric margin condition, further gain in the compression rate is achieved. Experimental results comparing the performance of the various methods are presented.
    Reinforcement Learning-enhanced Shared-account Cross-domain Sequential Recommendation. (arXiv:2206.08088v1 [cs.IR])
    Shared-account Cross-domain Sequential Recommendation (SCSR) is an emerging yet challenging task that simultaneously considers the shared-account and cross-domain characteristics in the sequential recommendation. Existing works on SCSR are mainly based on Recurrent Neural Network (RNN) and Graph Neural Network (GNN) but they ignore the fact that although multiple users share a single account, it is mainly occupied by one user at a time. This observation motivates us to learn a more accurate user-specific account representation by attentively focusing on its recent behaviors. Furthermore, though existing works endow lower weights to irrelevant interactions, they may still dilute the domain information and impede the cross-domain recommendation. To address the above issues, we propose a reinforcement learning-based solution, namely RL-ISN, which consists of a basic cross-domain recommender and a reinforcement learning-based domain filter. Specifically, to model the account representation in the shared-account scenario, the basic recommender first clusters users' mixed behaviors as latent users, and then leverages an attention model over them to conduct user identification. To reduce the impact of irrelevant domain information, we formulate the domain filter as a hierarchical reinforcement learning task, where a high-level task is utilized to decide whether to revise the whole transferred sequence or not, and if it does, a low-level task is further performed to determine whether to remove each interaction within it or not. To evaluate the performance of our solution, we conduct extensive experiments on two real-world datasets, and the experimental results demonstrate the superiority of our RL-ISN method compared with the state-of-the-art recommendation methods.
    Learning Multi-Task Gaussian Process Over Heterogeneous Input Domains. (arXiv:2202.12636v2 [stat.ML] UPDATED)
    Multi-task Gaussian process (MTGP) is a well-known non-parametric Bayesian model for learning correlated tasks effectively by transferring knowledge across tasks. But current MTGPs are usually limited to the multi-task scenario defined in the same input domain, leaving no space for tackling the heterogeneous case, i.e., the features of input domains vary over tasks. To this end, this paper presents a novel heterogeneous stochastic variational linear model of coregionalization (\texttt{HSVLMC}) model for simultaneously learning the tasks with varied input domains. Particularly, we develop the stochastic variational framework with Bayesian calibration that (i) takes into account the effect of dimensionality reduction raised by domain mappings in order to achieve effective input alignment; and (ii) employs a residual modeling strategy to leverage the inductive bias brought by prior domain mappings for better model inference. Finally, the superiority of the proposed model against existing LMC models has been extensively verified on diverse heterogeneous multi-task cases and a practical multi-fidelity steam turbine exhaust problem.
    Faculty Distillation with Optimal Transport. (arXiv:2204.11526v2 [cs.LG] UPDATED)
    The outpouring of various pre-trained models empowers knowledge distillation~(KD) by providing abundant teacher resources. Meanwhile, exploring the massive model repository to select a suitable teacher and further extracting its knowledge become daunting challenges. Standard KD fails to surmount two obstacles when training a student with the availability of plentiful pre-trained teachers, i.e., the "faculty". First, we need to seek out the most contributive teacher in the faculty efficiently rather than enumerating all of them for a student. Second, since the teacher may be pre-trained on different tasks w.r.t. the student, we must distill the knowledge from a more general label space. This paper studies this ``faculty distillation'' where a student performs teacher assessment and generalized knowledge reuse. We take advantage of optimal transport to construct a unifying objective for both problems, which bridges the semantic gap and measures the relatedness between a pair of models. This objective can select the most relevant teacher, and we minimize the same objective over student parameters to transfer the knowledge from the selected teacher subsequently. Experiments in various settings demonstrate the succinctness and versatility of our proposed method.
    Partial Identifiability for Nonnegative Matrix Factorization. (arXiv:2206.08022v1 [math.NA])
    Given a nonnegative matrix factorization, $R$, and a factorization rank, $r$, Exact nonnegative matrix factorization (Exact NMF) decomposes $R$ as the product of two nonnegative matrices, $C$ and $S$ with $r$ columns, such as $R = CS^\top$. A central research topic in the literature is the conditions under which such a decomposition is unique/identifiable, up to trivial ambiguities. In this paper, we focus on partial identifiability, that is, the uniqueness of a subset of columns of $C$ and $S$. We start our investigations with the data-based uniqueness (DBU) theorem from the chemometrics literature. The DBU theorem analyzes all feasible solutions of Exact NMF, and relies on sparsity conditions on $C$ and $S$. We provide a mathematically rigorous theorem of a recently published restricted version of the DBU theorem, relying only on simple sparsity and algebraic conditions: it applies to a particular solution of Exact NMF (as opposed to all feasible solutions) and allows us to guarantee the partial uniqueness of a single column of $C$ or $S$. Second, based on a geometric interpretation of the restricted DBU theorem, we obtain a new partial identifiability result. We prove it is stronger than the restricted DBU theorem, given that a proper preprocessing on the Exact NMF is used. This geometric interpretation also leads us to another partial identifiability result in the case $r=3$. Third, we show how partial identifiability results can be used sequentially to guarantee the identifiability of more columns of $C$ and $S$. We illustrate these results on several examples, including one from the chemometrics literature.
    Approximately Equivariant Networks for Imperfectly Symmetric Dynamics. (arXiv:2201.11969v4 [cs.LG] UPDATED)
    Incorporating symmetry as an inductive bias into neural network architecture has led to improvements in generalization, data efficiency, and physical consistency in dynamics modeling. Methods such as CNNs or equivariant neural networks use weight tying to enforce symmetries such as shift invariance or rotational equivariance. However, despite the fact that physical laws obey many symmetries, real-world dynamical data rarely conforms to strict mathematical symmetry either due to noisy or incomplete data or to symmetry breaking features in the underlying dynamical system. We explore approximately equivariant networks which are biased towards preserving symmetry but are not strictly constrained to do so. By relaxing equivariance constraints, we find that our models can outperform both baselines with no symmetry bias and baselines with overly strict symmetry in both simulated turbulence domains and real-world multi-stream jet flow.
    Performance analysis of coreset selection for quantum implementation of K-Means clustering algorithm. (arXiv:2206.07852v1 [quant-ph])
    Quantum computing is anticipated to offer immense computational capabilities which could provide efficient solutions to many data science problems. However, the current generation of quantum devices are small and noisy, which makes it difficult to process large data sets relevant for practical problems. Coreset selection aims to circumvent this problem by reducing the size of input data without compromising the accuracy. Recent work has shown that coreset selection can help to implement quantum K-Means clustering problem. However, the impact of coreset selection on the performance of quantum K-Means clustering has not been explored. In this work, we compare the relative performance of two coreset techniques (BFL16 and ONESHOT), and the size of coreset construction in each case, with respect to a variety of data sets and layout the advantages and limitations of coreset selection in implementing quantum algorithms. We also investigated the effect of depolarisation quantum noise and bit-flip error, and implemented the Quantum AutoEncoder technique for surpassing the noise effect. Our work provides useful insights for future implementation of data science algorithms on near-term quantum devices where problem size has been reduced by coreset selection.
    Domain Generalization via Selective Consistency Regularization for Time Series Classification. (arXiv:2206.07876v1 [cs.LG])
    Domain generalization methods aim to learn models robust to domain shift with data from a limited number of source domains and without access to target domain samples during training. Popular domain alignment methods for domain generalization seek to extract domain-invariant features by minimizing the discrepancy between feature distributions across all domains, disregarding inter-domain relationships. In this paper, we instead propose a novel representation learning methodology that selectively enforces prediction consistency between source domains estimated to be closely-related. Specifically, we hypothesize that domains share different class-informative representations, so instead of aligning all domains which can cause negative transfer, we only regularize the discrepancy between closely-related domains. We apply our method to time-series classification tasks and conduct comprehensive experiments on three public real-world datasets. Our method significantly improves over the baseline and achieves better or competitive performance in comparison with state-of-the-art methods in terms of both accuracy and model calibration.
    Distributed Online Learning Algorithm With Differential Privacy Strategy for Convex Nondecomposable Global Objectives. (arXiv:2206.07944v1 [math.OC])
    In this paper, we deal with a general distributed constrained online learning problem with privacy over time-varying networks, where a class of nondecomposable objective functions are considered. Under this setting, each node only controls a part of the global decision variable, and the goal of all nodes is to collaboratively minimize the global objective over a time horizon $T$ while guarantees the security of the transmitted information. For such problems, we first design a novel generic algorithm framework, named as DPSDA, of differentially private distributed online learning using the Laplace mechanism and the stochastic variants of dual averaging method. Then, we propose two algorithms, named as DPSDA-C and DPSDA-PS, under this framework. Theoretical results show that both algorithms attain an expected regret upper bound in $\mathcal{O}( \sqrt{T} )$ when the objective function is convex, which matches the best utility achievable by cutting-edge algorithms. Finally, numerical experiment results on both real-world and randomly generated datasets verify the effectiveness of our algorithms.
    A Machine Learning-based Digital Twin for Electric Vehicle Battery Modeling. (arXiv:2206.08080v1 [cs.LG])
    The widespread adoption of Electric Vehicles (EVs) is limited by their reliance on batteries with presently low energy and power densities compared to liquid fuels and are subject to aging and performance deterioration over time. For this reason, monitoring the battery State Of Charge (SOC) and State Of Health (SOH) during the EV lifetime is a very relevant problem. This work proposes a battery digital twin structure designed to accurately reflect battery dynamics at the run time. To ensure a high degree of correctness concerning non-linear phenomena, the digital twin relies on data-driven models trained on traces of battery evolution over time: a SOH model, repeatedly executed to estimate the degradation of maximum battery capacity, and a SOC model, retrained periodically to reflect the impact of aging. The proposed digital twin structure will be exemplified on a public dataset to motivate its adoption and prove its effectiveness, with high accuracy and inference and retraining times compatible with onboard execution.
    An Intriguing Property of Geophysics Inversion. (arXiv:2204.13731v2 [cs.LG] UPDATED)
    Inversion techniques are widely used to reconstruct subsurface physical properties (e.g., velocity, conductivity) from surface-based geophysical measurements (e.g., seismic, electric/magnetic (EM) data). The problems are governed by partial differential equations (PDEs) like the wave or Maxwell's equations. Solving geophysical inversion problems is challenging due to the ill-posedness and high computational cost. To alleviate those issues, recent studies leverage deep neural networks to learn the inversion mappings from measurements to the property directly. In this paper, we show that such a mapping can be well modeled by a very shallow (but not wide) network with only five layers. This is achieved based on our new finding of an intriguing property: a near-linear relationship between the input and output, after applying integral transform in high dimensional space. In particular, when dealing with the inversion from seismic data to subsurface velocity governed by a wave equation, the integral results of velocity with Gaussian kernels are linearly correlated to the integral of seismic data with sine kernels. Furthermore, this property can be easily turned into a light-weight encoder-decoder network for inversion. The encoder contains the integration of seismic data and the linear transformation without need for fine-tuning. The decoder only consists of a single transformer block to reverse the integral of velocity. Experiments show that this interesting property holds for two geophysics inversion problems over four different datasets. Compared to much deeper InversionNet, our method achieves comparable accuracy, but consumes significantly fewer parameters.
    Integrating User and Item Reviews in Deep Cooperative Neural Networks for Movie Ranking Prediction. (arXiv:2205.06296v4 [cs.IR] UPDATED)
    User evaluations include a significant quantity of information across online platforms. This information source has been neglected by the majority of existing recommendation systems, despite its potential to ease the sparsity issue and enhance the quality of suggestions. This work presents a deep model for concurrently learning item attributes and user behaviour from review text. Deep Cooperative Neural Network (DeepCoNN) is the suggested model consisting of two parallel neural networks connected in their final layers. One of the networks focuses on learning user behaviour from reviews submitted by the user, while the other network learns item attributes from user reviews. On top, a shared layer is added to connect these two networks. Similar to factorization machine approaches, the shared layer allows latent factors acquired for people and things to interact with each other. On a number of datasets, DeepCoNN surpasses all baseline recommendation systems, according to experimental findings.
    Unlocking High-Accuracy Differentially Private Image Classification through Scale. (arXiv:2204.13650v2 [cs.LG] UPDATED)
    Differential Privacy (DP) provides a formal privacy guarantee preventing adversaries with access to a machine learning model from extracting information about individual training points. Differentially Private Stochastic Gradient Descent (DP-SGD), the most popular DP training method for deep learning, realizes this protection by injecting noise during training. However previous works have found that DP-SGD often leads to a significant degradation in performance on standard image classification benchmarks. Furthermore, some authors have postulated that DP-SGD inherently performs poorly on large models, since the norm of the noise required to preserve privacy is proportional to the model dimension. In contrast, we demonstrate that DP-SGD on over-parameterized models can perform significantly better than previously thought. Combining careful hyper-parameter tuning with simple techniques to ensure signal propagation and improve the convergence rate, we obtain a new SOTA without extra data on CIFAR-10 of 81.4% under (8, 10^{-5})-DP using a 40-layer Wide-ResNet, improving over the previous SOTA of 71.7%. When fine-tuning a pre-trained NFNet-F3, we achieve a remarkable 83.8% top-1 accuracy on ImageNet under (0.5, 8*10^{-7})-DP. Additionally, we also achieve 86.7% top-1 accuracy under (8, 8 \cdot 10^{-7})-DP, which is just 4.3% below the current non-private SOTA for this task. We believe our results are a significant step towards closing the accuracy gap between private and non-private image classification.
    Queried Unlabeled Data Improves and Robustifies Class-Incremental Learning. (arXiv:2206.07842v1 [cs.LG])
    Class-incremental learning (CIL) suffers from the notorious dilemma between learning newly added classes and preserving previously learned class knowledge. That catastrophic forgetting issue could be mitigated by storing historical data for replay, which yet would cause memory overheads as well as imbalanced prediction updates. To address this dilemma, we propose to leverage "free" external unlabeled data querying in continual learning. We first present a CIL with Queried Unlabeled Data (CIL-QUD) scheme, where we only store a handful of past training samples as anchors and use them to query relevant unlabeled examples each time. Along with new and past stored data, the queried unlabeled are effectively utilized, through learning-without-forgetting (LwF) regularizers and class-balance training. Besides preserving model generalization over past and current tasks, we next study the problem of adversarial robustness for CIL-QUD. Inspired by the recent success of learning robust models with unlabeled data, we explore a new robustness-aware CIL setting, where the learned adversarial robustness has to resist forgetting and be transferred as new tasks come in continually. While existing options easily fail, we show queried unlabeled data can continue to benefit, and seamlessly extend CIL-QUD into its robustified versions, RCIL-QUD. Extensive experiments demonstrate that CIL-QUD achieves substantial accuracy gains on CIFAR-10 and CIFAR-100, compared to previous state-of-the-art CIL approaches. Moreover, RCIL-QUD establishes the first strong milestone for robustness-aware CIL. Codes are available in https://github.com/VITA-Group/CIL-QUD.
    Robust Attack Graph Generation. (arXiv:2206.07776v1 [cs.LG])
    We present a method to learn automaton models that are more robust to input modifications. It iteratively aligns sequences to a learned model, modifies the sequences to their aligned versions, and re-learns the model. Automaton learning algorithms are typically very good at modeling the frequent behavior of a software system. Our solution can be used to also learn the behavior present in infrequent sequences, as these will be aligned to the frequent ones represented by the model. We apply our method to the SAGE tool for modeling attacker behavior from intrusion alerts. In experiments, we demonstrate that our algorithm learns models that can handle noise such as added and removed symbols from sequences. Furthermore, it learns more concise models that fit better to the training data.
    DeepJSCC-Q: Constellation Constrained Deep Joint Source-Channel Coding. (arXiv:2206.08100v1 [eess.IV])
    Recent works have shown that modern machine learning techniques can provide an alternative approach to the long-standing joint source-channel coding (JSCC) problem. Very promising initial results, superior to popular digital schemes that utilize separate source and channel codes, have been demonstrated for wireless image and video transmission using deep neural networks (DNNs). However, end-to-end training of such schemes requires a differentiable channel input representation; hence, prior works have assumed that any complex value can be transmitted over the channel. This can prevent the application of these codes in scenarios where the hardware or protocol can only admit certain sets of channel inputs, prescribed by a digital constellation. Herein, we propose DeepJSCC-Q, an end-to-end optimized JSCC solution for wireless image transmission using a finite channel input alphabet. We show that DeepJSCC-Q can achieve similar performance to prior works that allow any complex valued channel input, especially when high modulation orders are available, and that the performance asymptotically approaches that of unconstrained channel input as the modulation order increases. Importantly, DeepJSCC-Q preserves the graceful degradation of image quality in unpredictable channel conditions, a desirable property for deployment in mobile systems with rapidly changing channel conditions.
    Lessons learned from the NeurIPS 2021 MetaDL challenge: Backbone fine-tuning without episodic meta-learning dominates for few-shot learning image classification. (arXiv:2206.08138v1 [cs.LG])
    Although deep neural networks are capable of achieving performance superior to humans on various tasks, they are notorious for requiring large amounts of data and computing resources, restricting their success to domains where such resources are available. Metalearning methods can address this problem by transferring knowledge from related tasks, thus reducing the amount of data and computing resources needed to learn new tasks. We organize the MetaDL competition series, which provide opportunities for research groups all over the world to create and experimentally assess new meta-(deep)learning solutions for real problems. In this paper, authored collaboratively between the competition organizers and the top-ranked participants, we describe the design of the competition, the datasets, the best experimental results, as well as the top-ranked methods in the NeurIPS 2021 challenge, which attracted 15 active teams who made it to the final phase (by outperforming the baseline), making over 100 code submissions during the feedback phase. The solutions of the top participants have been open-sourced. The lessons learned include that learning good representations is essential for effective transfer learning.
    Feature Overcorrelation in Deep Graph Neural Networks: A New Perspective. (arXiv:2206.07743v1 [cs.LG])
    Recent years have witnessed remarkable success achieved by graph neural networks (GNNs) in many real-world applications such as recommendation and drug discovery. Despite the success, oversmoothing has been identified as one of the key issues which limit the performance of deep GNNs. It indicates that the learned node representations are highly indistinguishable due to the stacked aggregators. In this paper, we propose a new perspective to look at the performance degradation of deep GNNs, i.e., feature overcorrelation. Through empirical and theoretical study on this matter, we demonstrate the existence of feature overcorrelation in deeper GNNs and reveal potential reasons leading to this issue. To reduce the feature correlation, we propose a general framework DeCorr which can encourage GNNs to encode less redundant information. Extensive experiments have demonstrated that DeCorr can help enable deeper GNNs and is complementary to existing techniques tackling the oversmoothing issue.
    Pareto Invariant Risk Minimization. (arXiv:2206.07766v1 [cs.LG])
    Despite the success of invariant risk minimization (IRM) in tackling the Out-of-Distribution generalization problem, IRM can compromise the optimality when applied in practice. The practical variants of IRM, e.g., IRMv1, have been shown to have significant gaps with IRM and thus could fail to capture the invariance even in simple problems. Moreover, the optimization procedure in IRMv1 involves two intrinsically conflicting objectives, and often requires careful tuning for the objective weights. To remedy the above issues, we reformulate IRM as a multi-objective optimization problem, and propose a new optimization scheme for IRM, called PAreto Invariant Risk Minimization (PAIR). PAIR can adaptively adjust the optimization direction under the objective conflicts. Furthermore, we show PAIR can empower the practical IRM variants to overcome the barriers with the original IRM when provided with proper guidance. We conduct experiments with ColoredMNIST to confirm our theory and the effectiveness of PAIR.
    Kantorovich Strikes Back! Wasserstein GANs are not Optimal Transport?. (arXiv:2206.07767v1 [cs.LG])
    Wasserstein Generative Adversarial Networks (WGANs) are the popular generative models built on the theory of Optimal Transport (OT) and the Kantorovich duality. Despite the success of WGANs, it is still unclear how well the underlying OT dual solvers approximate the OT cost (Wasserstein-1 distance, $\mathbb{W}_{1}$) and the OT gradient needed to update the generator. In this paper, we address these questions. We construct 1-Lipschitz functions and use them to build ray monotone transport plans. This strategy yields pairs of continuous benchmark distributions with the analytically known OT plan, OT cost and OT gradient in high-dimensional spaces such as spaces of images. We thoroughly evaluate popular WGAN dual form solvers (gradient penalty, spectral normalization, entropic regularization, etc.) using these benchmark pairs. Even though these solvers perform well in WGANs, none of them faithfully compute $\mathbb{W}_{1}$ in high dimensions. Nevertheless, many provide a meaningful approximation of the OT gradient. These observations suggest that these solvers should not be treated as good estimators of $\mathbb{W}_{1}$, but to some extent they indeed can be used in variational problems requiring the minimization of $\mathbb{W}_{1}$.
    Condensing Graphs via One-Step Gradient Matching. (arXiv:2206.07746v1 [cs.LG])
    As training deep learning models on large dataset takes a lot of time and resources, it is desired to construct a small synthetic dataset with which we can train deep learning models sufficiently. There are recent works that have explored solutions on condensing image datasets through complex bi-level optimization. For instance, dataset condensation (DC) matches network gradients w.r.t. large-real data and small-synthetic data, where the network weights are optimized for multiple steps at each outer iteration. However, existing approaches have their inherent limitations: (1) they are not directly applicable to graphs where the data is discrete; and (2) the condensation process is computationally expensive due to the involved nested optimization. To bridge the gap, we investigate efficient dataset condensation tailored for graph datasets where we model the discrete graph structure as a probabilistic model. We further propose a one-step gradient matching scheme, which performs gradient matching for only one single step without training the network weights. Our theoretical analysis shows this strategy can generate synthetic graphs that lead to lower classification loss on real graphs. Extensive experiments on various graph datasets demonstrate the effectiveness and efficiency of the proposed method. In particular, we are able to reduce the dataset size by 90% while approximating up to 98% of the original performance and our method is significantly faster than multi-step gradient matching (e.g. 15x in CIFAR10 for synthesizing 500 graphs).
    SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos. (arXiv:2206.07764v1 [cs.CV])
    The visual world can be parsimoniously characterized in terms of distinct entities with sparse interactions. Discovering this compositional structure in dynamic visual scenes has proven challenging for end-to-end computer vision approaches unless explicit instance-level supervision is provided. Slot-based models leveraging motion cues have recently shown great promise in learning to represent, segment, and track objects without direct supervision, but they still fail to scale to complex real-world multi-object videos. In an effort to bridge this gap, we take inspiration from human development and hypothesize that information about scene geometry in the form of depth signals can facilitate object-centric learning. We introduce SAVi++, an object-centric video model which is trained to predict depth signals from a slot-based video representation. By further leveraging best practices for model scaling, we are able to train SAVi++ to segment complex dynamic scenes recorded with moving cameras, containing both static and moving objects of diverse appearance on naturalistic backgrounds, without the need for segmentation supervision. Finally, we demonstrate that by using sparse depth signals obtained from LiDAR, SAVi++ is able to learn emergent object segmentation and tracking from videos in the real-world Waymo Open dataset.
    Reconstructing Training Data from Trained Neural Networks. (arXiv:2206.07758v1 [cs.LG])
    Understanding to what extent neural networks memorize training data is an intriguing question with practical and theoretical implications. In this paper we show that in some cases a significant fraction of the training data can in fact be reconstructed from the parameters of a trained neural network classifier. We propose a novel reconstruction scheme that stems from recent theoretical results about the implicit bias in training neural networks with gradient-based methods. To the best of our knowledge, our results are the first to show that reconstructing a large portion of the actual training samples from a trained neural network classifier is generally possible. This has negative implications on privacy, as it can be used as an attack for revealing sensitive training data. We demonstrate our method for binary MLP classifiers on a few standard computer vision datasets.
    Improving Diversity with Adversarially Learned Transformations for Domain Generalization. (arXiv:2206.07736v1 [cs.LG])
    To be successful in single source domain generalization, maximizing diversity of synthesized domains has emerged as one of the most effective strategies. Many of the recent successes have come from methods that pre-specify the types of diversity that a model is exposed to during training, so that it can ultimately generalize well to new domains. However, na\"ive diversity based augmentations do not work effectively for domain generalization either because they cannot model large domain shift, or because the span of transforms that are pre-specified do not cover the types of shift commonly occurring in domain generalization. To address this issue, we present a novel framework that uses adversarially learned transformations (ALT) using a neural network to model plausible, yet hard image transformations that fool the classifier. This network is randomly initialized for each batch and trained for a fixed number of steps to maximize classification error. Further, we enforce consistency between the classifier's predictions on the clean and transformed images. With extensive empirical analysis, we find that this new form of adversarial transformations achieve both objectives of diversity and hardness simultaneously, outperforming all existing techniques on competitive benchmarks for single source domain generalization. We also show that ALT can naturally work with existing diversity modules to produce highly distinct, and large transformations of the source domain leading to state-of-the-art performance.
    Gaussian Blue Noise. (arXiv:2206.07798v1 [cs.GR])
    Among the various approaches for producing point distributions with blue noise spectrum, we argue for an optimization framework using Gaussian kernels. We show that with a wise selection of optimization parameters, this approach attains unprecedented quality, provably surpassing the current state of the art attained by the optimal transport (BNOT) approach. Further, we show that our algorithm scales smoothly and feasibly to high dimensions while maintaining the same quality, realizing unprecedented high-quality high-dimensional blue noise sets. Finally, we show an extension to adaptive sampling.
    Simple and Efficient Architectures for Semantic Segmentation. (arXiv:2206.08236v1 [cs.CV])
    Though the state-of-the architectures for semantic segmentation, such as HRNet, demonstrate impressive accuracy, the complexity arising from their salient design choices hinders a range of model acceleration tools, and further they make use of operations that are inefficient on current hardware. This paper demonstrates that a simple encoder-decoder architecture with a ResNet-like backbone and a small multi-scale head, performs on-par or better than complex semantic segmentation architectures such as HRNet, FANet and DDRNets. Naively applying deep backbones designed for Image Classification to the task of Semantic Segmentation leads to sub-par results, owing to a much smaller effective receptive field of these backbones. Implicit among the various design choices put forth in works like HRNet, DDRNet, and FANet are networks with a large effective receptive field. It is natural to ask if a simple encoder-decoder architecture would compare favorably if comprised of backbones that have a larger effective receptive field, though without the use of inefficient operations like dilated convolutions. We show that with minor and inexpensive modifications to ResNets, enlarging the receptive field, very simple and competitive baselines can be created for Semantic Segmentation. We present a family of such simple architectures for desktop as well as mobile targets, which match or exceed the performance of complex models on the Cityscapes dataset. We hope that our work provides simple yet effective baselines for practitioners to develop efficient semantic segmentation models.
    GoodBye WaveNet -- A Language Model for Raw Audio with Context of 1/2 Million Samples. (arXiv:2206.08297v1 [cs.SD])
    Modeling long-term dependencies for audio signals is a particularly challenging problem, as even small-time scales yield on the order of a hundred thousand samples. With the recent advent of Transformers, neural architectures became good at modeling dependencies over longer time scales, but they suffered from quadratic constraints to scale them. We propose a generative auto-regressive architecture that can model audio waveforms over quite a large context, greater than 500,000 samples. Our work is adapted to learn time dependencies by learning a latent representation by a CNN front-end, and then learning dependencies over these representations using Transformer encoders, fully trained end-to-end: thereby allowing to learn representations as it deems fit for the next sample. Unlike previous works that compared different time scales to show improvement, we use a standard dataset, with the same number of parameters/context to show improvements. We achieve a state-of-the-art performance as compared to other approaches such as Wavenet, SaSHMI, and Sample-RNN on a standard dataset for modeling long-term structure. This work gives very exciting direction for the field, given improvements in context modeling that can be scaled with more data, as well as potentially better results by using billions/trillions of parameters.
    Towards Robust and Reproducible Active Learning Using Neural Networks. (arXiv:2002.09564v3 [cs.LG] UPDATED)
    Active learning (AL) is a promising ML paradigm that has the potential to parse through large unlabeled data and help reduce annotation cost in domains where labeling data can be prohibitive. Recently proposed neural network based AL methods use different heuristics to accomplish this goal. In this study, we demonstrate that under identical experimental settings, different types of AL algorithms (uncertainty based, diversity based, and committee based) produce an inconsistent gain over random sampling baseline. Through a variety of experiments, controlling for sources of stochasticity, we show that variance in performance metrics achieved by AL algorithms can lead to results that are not consistent with the previously reported results. We also found that under strong regularization, AL methods show marginal or no advantage over the random sampling baseline under a variety of experimental conditions. Finally, we conclude with a set of recommendations on how to assess the results using a new AL algorithm to ensure results are reproducible and robust under changes in experimental conditions. We share our codes to facilitate AL evaluations. We believe our findings and recommendations will help advance reproducible research in AL using neural networks. We open source our code at https://github.com/PrateekMunjal/TorchAL
    Compressed-VFL: Communication-Efficient Learning with Vertically Partitioned Data. (arXiv:2206.08330v1 [cs.LG])
    We propose Compressed Vertical Federated Learning (C-VFL) for communication-efficient training on vertically partitioned data. In C-VFL, a server and multiple parties collaboratively train a model on their respective features utilizing several local iterations and sharing compressed intermediate results periodically. Our work provides the first theoretical analysis of the effect message compression has on distributed training over vertically partitioned data. We prove convergence of non-convex objectives at a rate of $O(\frac{1}{\sqrt{T}})$ when the compression error is bounded over the course of training. We provide specific requirements for convergence with common compression techniques, such as quantization and top-$k$ sparsification. Finally, we experimentally show compression can reduce communication by over $90\%$ without a significant decrease in accuracy over VFL without compression.
    "Understanding Robustness Lottery": A Comparative Visual Analysis of Neural Network Pruning Approaches. (arXiv:2206.07918v1 [cs.HC])
    Deep learning approaches have provided state-of-the-art performance in many applications by relying on extremely large and heavily overparameterized neural networks. However, such networks have been shown to be very brittle, not generalize well to new uses cases, and are often difficult if not impossible to deploy on resources limited platforms. Model pruning, i.e., reducing the size of the network, is a widely adopted strategy that can lead to more robust and generalizable network -- usually orders of magnitude smaller with the same or even improved performance. While there exist many heuristics for model pruning, our understanding of the pruning process remains limited. Empirical studies show that some heuristics improve performance while others can make models more brittle or have other side effects. This work aims to shed light on how different pruning methods alter the network's internal feature representation, and the corresponding impact on model performance. To provide a meaningful comparison and characterization of model feature space, we use three geometric metrics that are decomposed from the common adopted classification loss. With these metrics, we design a visualization system to highlight the impact of pruning on model prediction as well as the latent feature embedding. The proposed tool provides an environment for exploring and studying differences among pruning methods and between pruned and original model. By leveraging our visualization, the ML researchers can not only identify samples that are fragile to model pruning and data corruption but also obtain insights and explanations on how some pruned models achieve superior robustness performance.
    Applications of Machine Learning to the Identification of Anomalous ER Claims. (arXiv:2206.08093v1 [cs.LG])
    Improper health insurance payments resulting from fraud and upcoding result in tens of billions of dollars in excess health care costs annually in the United States, motivating machine learning researchers to build anomaly detection models for health insurance claims. This article describes two such strategies specifically for ER claims. The first is an upcoding model based on severity code distributions, stratified by hierarchical diagnosis code clusters. A statistically significant difference in mean upcoding anomaly scores is observed between free-standing ERs and acute care hospitals, with free-standing ERs being more anomalous. The second model is a random forest that minimizes improper payments by optimally sorting ER claims within review queues. Depending on the percentage of claims reviewed, the random forest saved 12% to 40% above a baseline approach that prioritized claims by billed amount.
    Using adversarial images to improve outcomes of federated learning for non-IID data. (arXiv:2206.08124v1 [cs.LG])
    One of the important problems in federated learning is how to deal with unbalanced data. This contribution introduces a novel technique designed to deal with label skewed non-IID data, using adversarial inputs, created by the I-FGSM method. Adversarial inputs guide the training process and allow the Weighted Federated Averaging to give more importance to clients with 'selected' local label distributions. Experimental results, gathered from image classification tasks, for MNIST and CIFAR-10 datasets, are reported and analyzed.
    Not All Lotteries Are Made Equal. (arXiv:2206.08175v1 [cs.LG])
    The Lottery Ticket Hypothesis (LTH) states that for a reasonably sized neural network, a sub-network within the same network yields no less performance than the dense counterpart when trained from the same initialization. This work investigates the relation between model size and the ease of finding these sparse sub-networks. We show through experiments that, surprisingly, under a finite budget, smaller models benefit more from Ticket Search (TS).
    Deepfake histological images for enhancing digital pathology. (arXiv:2206.08308v1 [eess.IV])
    An optical microscopic examination of thinly cut stained tissue on glass slides prepared from a FFPE tissue blocks is the gold standard for tissue diagnostics. In addition, the diagnostic abilities and expertise of any pathologist is dependent on their direct experience with common as well as rarer variant morphologies. Recently, deep learning approaches have been used to successfully show a high level of accuracy for such tasks. However, obtaining expert-level annotated images is an expensive and time-consuming task and artificially synthesized histological images can prove greatly beneficial. Here, we present an approach to not only generate histological images that reproduce the diagnostic morphologic features of common disease but also provide a user ability to generate new and rare morphologies. Our approach involves developing a generative adversarial network model that synthesizes pathology images constrained by class labels. We investigated the ability of this framework in synthesizing realistic prostate and colon tissue images and assessed the utility of these images in augmenting diagnostic ability of machine learning methods as well as their usability by a panel of experienced anatomic pathologists. Synthetic data generated by our framework performed similar to real data in training a deep learning model for diagnosis. Pathologists were not able to distinguish between real and synthetic images and showed a similar level of inter-observer agreement for prostate cancer grading. We extended the approach to significantly more complex images from colon biopsies and showed that the complex microenvironment in such tissues can also be reproduced. Finally, we present the ability for a user to generate deepfake histological images via a simple markup of sematic labels.
    Participation and Data Valuation in IoT Data Markets through Distributed Coalitions. (arXiv:2206.07785v1 [cs.NI])
    This paper considers a market for Internet of Things (IoT) data that is used to train machine learning models. The data is supplied to the market platform through a network and the price of the data is controlled based on the value it brings to the machine learning model. We explore the correlation property of data in a game-theoretical setting to eventually derive a simplified distributed solution for a data trading mechanism that emphasizes the mutual benefit of devices and the market. The key proposal is an efficient algorithm for markets that jointly addresses the challenges of availability and heterogeneity in participation, as well as the transfer of trust and the economic value of data exchange in IoT networks. The proposed approach establishes the data market by reinforcing collaboration opportunities between devices with correlated data to avoid information leakage. Therein, we develop a network-wide optimization problem that maximizes the social value of coalition among the IoT devices of similar data types; at the same time, it minimizes the cost due to network externalities, i.e., the impact of information leakage due to data correlation, as well as the opportunity costs. Finally, we reveal the structure of the formulated problem as a distributed coalition game and solve it following the simplified split-and-merge algorithm. Simulation results show the efficacy of our proposed mechanism design toward a trusted IoT data market, with up to 32.72% gain in the average payoff for each seller.
    Generalization Bounds for Data-Driven Numerical Linear Algebra. (arXiv:2206.07886v1 [cs.LG])
    Data-driven algorithms can adapt their internal structure or parameters to inputs from unknown application-specific distributions, by learning from a training sample of inputs. Several recent works have applied this approach to problems in numerical linear algebra, obtaining significant empirical gains in performance. However, no theoretical explanation for their success was known. In this work we prove generalization bounds for those algorithms, within the PAC-learning framework for data-driven algorithm selection proposed by Gupta and Roughgarden (SICOMP 2017). Our main results are closely matching upper and lower bounds on the fat shattering dimension of the learning-based low rank approximation algorithm of Indyk et al.~(NeurIPS 2019). Our techniques are general, and provide generalization bounds for many other recently proposed data-driven algorithms in numerical linear algebra, covering both sketching-based and multigrid-based methods. This considerably broadens the class of data-driven algorithms for which a PAC-learning analysis is available.
    Max-Margin Works while Large Margin Fails: Generalization without Uniform Convergence. (arXiv:2206.07892v1 [cs.LG])
    A major challenge in modern machine learning is theoretically understanding the generalization properties of overparameterized models. Many existing tools rely on \em uniform convergence \em (UC), a property that, when it holds, guarantees that the test loss will be close to the training loss, uniformly over a class of candidate models. Nagarajan and Kolter (2019) show that in certain simple linear and neural-network settings, any uniform convergence bound will be vacuous, leaving open the question of how to prove generalization in settings where UC fails. Our main contribution is proving novel generalization bounds in two such settings, one linear, and one non-linear. We study the linear classification setting of Nagarajan and Kolter, and a quadratic ground truth function learned via a two-layer neural network in the non-linear regime. We prove a new type of margin bound showing that above a certain signal-to-noise threshold, any near-max-margin classifier will achieve almost no test loss in these two settings. Our results show that near-max-margin is important: while any model that achieves at least a $(1 - \epsilon)$-fraction of the max-margin generalizes well, a classifier achieving half of the max-margin may fail terribly. We additionally strengthen the UC impossibility results of Nagarajan and Kolter, proving that \em one-sided \em UC bounds and classical margin bounds will fail on near-max-margin classifiers. Our analysis provides insight on why memorization can coexist with generalization: we show that in this challenging regime where generalization occurs but UC fails, near-max-margin classifiers simultaneously contain some generalizable components and some overfitting components that memorize the data. The presence of the overfitting components is enough to preclude UC, but the near-extremal margin guarantees that sufficient generalizable components are present.
    EPG2S: Speech Generation and Speech Enhancement based on Electropalatography and Audio Signals using Multimodal Learning. (arXiv:2206.07860v1 [cs.SD])
    Speech generation and enhancement based on articulatory movements facilitate communication when the scope of verbal communication is absent, e.g., in patients who have lost the ability to speak. Although various techniques have been proposed to this end, electropalatography (EPG), which is a monitoring technique that records contact between the tongue and hard palate during speech, has not been adequately explored. Herein, we propose a novel multimodal EPG-to-speech (EPG2S) system that utilizes EPG and speech signals for speech generation and enhancement. Different fusion strategies based on multiple combinations of EPG and noisy speech signals are examined, and the viability of the proposed method is investigated. Experimental results indicate that EPG2S achieves desirable speech generation outcomes based solely on EPG signals. Further, the addition of noisy speech signals is observed to improve quality and intelligibility. Additionally, EPG2S is observed to achieve high-quality speech enhancement based solely on audio signals, with the addition of EPG signals further improving the performance. The late fusion strategy is deemed to be the most effective approach for simultaneous speech generation and enhancement.
    Pure Exploration of Causal Bandits. (arXiv:2206.07883v1 [cs.LG])
    Causal bandit problem integrates causal inference with multi-armed bandits. The pure exploration of causal bandits is the following online learning task: given a causal graph with unknown causal inference distributions, in each round we can choose to either intervene one variable or do no intervention, and observe the random outcomes of all random variables, with the goal that using as few rounds as possible, we can output an intervention that gives the best (or almost best) expected outcome on the reward variable $Y$ with probability at least $1-\delta$, where $\delta$ is a given confidence level. We provide first gap-dependent fully adaptive pure exploration algorithms on three types of causal models including parallel graphs, general graphs with small number of backdoor parents, and binary generalized linear models. Our algorithms improve both prior causal bandit algorithms, which are not adaptive to reward gaps, and prior adaptive pure exploration algorithms, which do not utilize the special features of causal bandits.
    Conformal prediction set for time-series. (arXiv:2206.07851v1 [stat.ML])
    When building either prediction intervals for regression (with real-valued response) or prediction sets for classification (with categorical responses), uncertainty quantification is essential to studying complex machine learning methods. In this paper, we develop Ensemble Regularized Adaptive Prediction Set (ERAPS) to construct prediction sets for time-series (with categorical responses), based on the prior work of [Xu and Xie, 2021]. In particular, we allow unknown dependencies to exist within features and responses that arrive in sequence. Method-wise, ERAPS is a distribution-free and ensemble-based framework that is applicable for arbitrary classifiers. Theoretically, we bound the coverage gap without assuming data exchangeability and show asymptotic set convergence. Empirically, we demonstrate valid marginal and conditional coverage by ERAPS, which also tends to yield smaller prediction sets than competing methods.
    Multimodal Dialogue State Tracking. (arXiv:2206.07898v1 [cs.AI])
    Designed for tracking user goals in dialogues, a dialogue state tracker is an essential component in a dialogue system. However, the research of dialogue state tracking has largely been limited to unimodality, in which slots and slot values are limited by knowledge domains (e.g. restaurant domain with slots of restaurant name and price range) and are defined by specific database schema. In this paper, we propose to extend the definition of dialogue state tracking to multimodality. Specifically, we introduce a novel dialogue state tracking task to track the information of visual objects that are mentioned in video-grounded dialogues. Each new dialogue utterance may introduce a new video segment, new visual objects, or new object attributes, and a state tracker is required to update these information slots accordingly. We created a new synthetic benchmark and designed a novel baseline, Video-Dialogue Transformer Network (VDTN), for this task. VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states. We optimized VDTN for a state generation task as well as a self-supervised video understanding task which recovers video segment or object representations. Finally, we trained VDTN to use the decoded states in a response prediction task. Together with comprehensive ablation and qualitative analysis, we discovered interesting insights towards building more capable multimodal dialogue systems.
    Discovery and density estimation of latent confounders in Bayesian networks with evidence lower bound. (arXiv:2206.05490v2 [cs.LG] UPDATED)
    Discovering and parameterising latent confounders represent important and challenging problems in causal structure learning and density estimation respectively. In this paper, we focus on both discovering and learning the distribution of latent confounders. This task requires solutions that come from different areas of statistics and machine learning. We combine elements of variational Bayesian methods, expectation-maximisation, hill-climbing search, and structure learning under the assumption of causal insufficiency. We propose two learning strategies; one that maximises model selection accuracy, and another that improves computational efficiency in exchange for minor reductions in accuracy. The former strategy is suitable for small networks and the latter for moderate size networks. Both learning strategies perform well relative to existing solutions.
    A machine learning approach to predicting pore pressure response in liquefiable sands under cyclic loading. (arXiv:2206.07780v1 [physics.geo-ph])
    Shear stress history controls the pore pressure response in liquefiable soils. The excess pore pressure does not increase under cyclic loading when shear stress amplitude is lower than the peak prior amplitude -- the shielding effect. Many sophisticated constitutive models fail to capture the shielding effect observed in the cyclic liquefaction experiments. We develop a data-driven machine learning model based on the LSTM neural network to capture the liquefaction response of soils under cyclic loading. The LSTM model is trained on 12 laboratory cyclic simple shear tests on Nevada sand in loose and dense conditions subjected to different cyclic simple shear loading conditions. The LSTM model features include the relative density of soil and the previous stress history to predict the pore water pressure response. The LSTM model successfully replicates the pore pressure response for three cyclic simple test results considering the shielding and density effects.
    Risk-Averse No-Regret Learning in Online Convex Games. (arXiv:2203.08957v2 [cs.LG] UPDATED)
    We consider an online stochastic game with risk-averse agents whose goal is to learn optimal decisions that minimize the risk of incurring significantly high costs. Specifically, we use the Conditional Value at Risk (CVaR) as a risk measure that the agents can estimate using bandit feedback in the form of the cost values of only their selected actions. Since the distributions of the cost functions depend on the actions of all agents that are generally unobservable, they are themselves unknown and, therefore, the CVaR values of the costs are difficult to compute. To address this challenge, we propose a new online risk-averse learning algorithm that relies on one-point zeroth-order estimation of the CVaR gradients computed using CVaR values that are estimated by appropriately sampling the cost functions. We show that this algorithm achieves sub-linear regret with high probability. We also propose two variants of this algorithm that improve performance. The first variant relies on a new sampling strategy that uses samples from the previous iteration to improve the estimation accuracy of the CVaR values. The second variant employs residual feedback that uses CVaR values from the previous iteration to reduce the variance of the CVaR gradient estimates. We theoretically analyze the convergence properties of these variants and illustrate their performance on an online market problem that we model as a Cournot game.  ( 2 min )
    Contrasting random and learned features in deep Bayesian linear regression. (arXiv:2203.00573v2 [cs.LG] UPDATED)
    Understanding how feature learning affects generalization is among the foremost goals of modern deep learning theory. Here, we study how the ability to learn representations affects the generalization performance of a simple class of models: deep Bayesian linear neural networks trained on unstructured Gaussian data. By comparing deep random feature models to deep networks in which all layers are trained, we provide a detailed characterization of the interplay between width, depth, data density, and prior mismatch. We show that both models display sample-wise double-descent behavior in the presence of label noise. Random feature models can also display model-wise double-descent if there are narrow bottleneck layers, while deep networks do not show these divergences. Random feature models can have particular widths that are optimal for generalization at a given data density, while making neural networks as wide or as narrow as possible is always optimal. Moreover, we show that the leading-order correction to the kernel-limit learning curve cannot distinguish between random feature models and deep networks in which all layers are trained. Taken together, our findings begin to elucidate how architectural details affect generalization performance in this simple class of deep regression models.  ( 2 min )
    Generalizing to Evolving Domains with Latent Structure-Aware Sequential Autoencoder. (arXiv:2205.07649v2 [cs.LG] UPDATED)
    Domain generalization aims to improve the generalization capability of machine learning systems to out-of-distribution (OOD) data. Existing domain generalization techniques embark upon stationary and discrete environments to tackle the generalization issue caused by OOD data. However, many real-world tasks in non-stationary environments (e.g. self-driven car system, sensor measures) involve more complex and continuously evolving domain drift, which raises new challenges for the problem of domain generalization. In this paper, we formulate the aforementioned setting as the problem of evolving domain generalization. Specifically, we propose to introduce a probabilistic framework called Latent Structure-aware Sequential Autoencoder (LSSAE) to tackle the problem of evolving domain generalization via exploring the underlying continuous structure in the latent space of deep neural networks, where we aim to identify two major factors namely covariate shift and concept shift accounting for distribution shift in non-stationary environments. Experimental results on both synthetic and real-world datasets show that LSSAE can lead to superior performances based on the evolving domain generalization setting.  ( 2 min )
    Convergence of Policy Gradient for Entropy Regularized MDPs with Neural Network Approximation in the Mean-Field Regime. (arXiv:2201.07296v2 [math.OC] UPDATED)
    We study the global convergence of policy gradient for infinite-horizon, continuous state and action space, and entropy-regularized Markov decision processes (MDPs). We consider a softmax policy with (one-hidden layer) neural network approximation in a mean-field regime. Additional entropic regularization in the associated mean-field probability measure is added, and the corresponding gradient flow is studied in the 2-Wasserstein metric. We show that the objective function is increasing along the gradient flow. Further, we prove that if the regularization in terms of the mean-field measure is sufficient, the gradient flow converges exponentially fast to the unique stationary solution, which is the unique maximizer of the regularized MDP objective. Lastly, we study the sensitivity of the value function along the gradient flow with respect to regularization parameters and the initial condition. Our results rely on the careful analysis of the non-linear Fokker-Planck-Kolmogorov equation and extend the pioneering work of Mei et al. 2020 and Agarwal et al. 2020, which quantify the global convergence rate of policy gradient for entropy-regularized MDPs in the tabular setting.  ( 2 min )
    Computationally Efficient Approximations for Matrix-based Renyi's Entropy. (arXiv:2112.13720v3 [stat.ML] UPDATED)
    The recently developed matrix based Renyi's entropy enables measurement of information in data simply using the eigenspectrum of symmetric positive semi definite (PSD) matrices in reproducing kernel Hilbert space, without estimation of the underlying data distribution. This intriguing property makes the new information measurement widely adopted in multiple statistical inference and learning tasks. However, the computation of such quantity involves the trace operator on a PSD matrix $G$ to power $\alpha$(i.e., $tr(G^\alpha)$), with a normal complexity of nearly $O(n^3)$, which severely hampers its practical usage when the number of samples (i.e., $n$) is large. In this work, we present computationally efficient approximations to this new entropy functional that can reduce its complexity to even significantly less than $O(n^2)$. To this end, we leverage the recent progress on Randomized Numerical Linear Algebra, developing Taylor, Chebyshev and Lanczos approximations to $tr(G^\alpha)$ for arbitrary values of $\alpha$ by converting it into matrix-vector multiplications problem. We also establish the connection between the matrix-based Renyi's entropy and PSD matrix approximation, which enables exploiting both clustering and block low-rank structure of $G$ to further reduce the computational cost. We theoretically provide approximation accuracy guarantees and illustrate the properties of different approximations. Large-scale experimental evaluations on both synthetic and real-world data corroborate our theoretical findings, showing promising speedup with negligible loss in accuracy.  ( 2 min )
    Masked-attention Mask Transformer for Universal Image Segmentation. (arXiv:2112.01527v3 [cs.CV] UPDATED)
    Image segmentation is about grouping pixels with different semantics, e.g., category or instance membership, where each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU on ADE20K).  ( 2 min )
    An Asymptotic Test for Conditional Independence using Analytic Kernel Embeddings. (arXiv:2110.14868v2 [stat.ML] UPDATED)
    We propose a new conditional dependence measure and a statistical test for conditional independence. The measure is based on the difference between analytic kernel embeddings of two well-suited distributions evaluated at a finite set of locations. We obtain its asymptotic distribution under the null hypothesis of conditional independence and design a consistent statistical test from it. We conduct a series of experiments showing that our new test outperforms state-of-the-art methods both in terms of type-I and type-II errors even in the high dimensional setting.  ( 2 min )
    Interpretable and Generalizable Graph Learning via Stochastic Attention Mechanism. (arXiv:2201.12987v2 [cs.LG] UPDATED)
    Interpretable graph learning is in need as many scientific applications depend on learning models to collect insights from graph-structured data. Previous works mostly focused on using post-hoc approaches to interpret a pre-trained model (graph neural network models in particular). They argue against inherently interpretable models because good interpretation of these models is often at the cost of their prediction accuracy. And, the widely used attention mechanism for inherent interpretation often fails to provide faithful interpretation in graph learning tasks. In this work, we address both issues by proposing Graph Stochastic Attention (GSAT), an attention mechanism derived from the information bottleneck principle. GSAT leverages stochastic attention to block the information from the task-irrelevant graph components while learning stochasticity-reduced attention to select the task-relevant subgraphs for interpretation. GSAT can also apply to fine-tuning and interpreting pre-trained models via stochastic attention mechanism. Extensive experiments on eight datasets show that GSAT outperforms the state-of-the-art methods by up to 20%$\uparrow$ in interpretation AUC and 5%$\uparrow$ in prediction accuracy.  ( 2 min )
    Robustness and Accuracy Could Be Reconcilable by (Proper) Definition. (arXiv:2202.10103v2 [cs.LG] UPDATED)
    The trade-off between robustness and accuracy has been widely studied in the adversarial literature. Although still controversial, the prevailing view is that this trade-off is inherent, either empirically or theoretically. Thus, we dig for the origin of this trade-off in adversarial training and find that it may stem from the improperly defined robust error, which imposes an inductive bias of local invariance -- an overcorrection towards smoothness. Given this, we advocate employing local equivariance to describe the ideal behavior of a robust model, leading to a self-consistent robust error named SCORE. By definition, SCORE facilitates the reconciliation between robustness and accuracy, while still handling the worst-case uncertainty via robust optimization. By simply substituting KL divergence with variants of distance metrics, SCORE can be efficiently minimized. Empirically, our models achieve top-rank performance on RobustBench under AutoAttack. Besides, SCORE provides instructive insights for explaining the overfitting phenomenon and semantic input gradients observed on robust models. Code is available at https://github.com/P2333/SCORE.  ( 2 min )
    Neural Enhanced Belief Propagation for Data Association in Multiobject Tracking. (arXiv:2203.09948v3 [cs.CV] UPDATED)
    Situation-aware technologies enabled by multiobject tracking (MOT) methods will create new services and applications in fields such as autonomous navigation and applied ocean sciences. Belief propagation (BP) is a state-of-the-art method for Bayesian MOT but fully relies on a statistical model and preprocessed sensor measurements. In this paper, we establish a hybrid method for model-based and data-driven MOT. The proposed neural enhanced belief propagation (NEBP) approach complements BP by information learned from raw sensor data with the goal to improve data association and to reject false alarm measurements. We evaluate the performance of our NEBP approach for MOT on the nuScenes autonomous driving dataset and demonstrate that it can outperform state-of-the-art reference methods.  ( 2 min )
    Deep Reinforcement Learning, a textbook. (arXiv:2201.02135v3 [cs.AI] UPDATED)
    Deep reinforcement learning has gathered much attention recently. Impressive results were achieved in activities as diverse as autonomous driving, game playing, molecular recombination, and robotics. In all these fields, computer programs have taught themselves to solve difficult problems. They have learned to fly model helicopters and perform aerobatic manoeuvers such as loops and rolls. In some applications they have even become better than the best humans, such as in Atari, Go, poker and StarCraft. The way in which deep reinforcement learning explores complex environments reminds us of how children learn, by playfully trying out things, getting feedback, and trying again. The computer seems to truly possess aspects of human learning; this goes to the heart of the dream of artificial intelligence. The successes in research have not gone unnoticed by educators, and universities have started to offer courses on the subject. The aim of this book is to provide a comprehensive overview of the field of deep reinforcement learning. The book is written for graduate students of artificial intelligence, and for researchers and practitioners who wish to better understand deep reinforcement learning methods and their challenges. We assume an undergraduate-level of understanding of computer science and artificial intelligence; the programming language of this book is Python. We describe the foundations, the algorithms and the applications of deep reinforcement learning. We cover the established model-free and model-based methods that form the basis of the field. Developments go quickly, and we also cover advanced topics: deep multi-agent reinforcement learning, deep hierarchical reinforcement learning, and deep meta learning.  ( 2 min )
    Off-Policy Evaluation for Large Action Spaces via Embeddings. (arXiv:2202.06317v2 [cs.LG] UPDATED)
    Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems, since it enables offline evaluation of new policies using only historic log data. Unfortunately, when the number of actions is large, existing OPE estimators -- most of which are based on inverse propensity score weighting -- degrade severely and can suffer from extreme bias and variance. This foils the use of OPE in many applications from recommender systems to language models. To overcome this issue, we propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space. We characterize the bias, variance, and mean squared error of the proposed estimator and analyze the conditions under which the action embedding provides statistical benefits over conventional estimators. In addition to the theoretical analysis, we find that the empirical performance improvement can be substantial, enabling reliable OPE even when existing estimators collapse due to a large number of actions.  ( 2 min )
    Model Zoo: A Growing "Brain" That Learns Continually. (arXiv:2106.03027v3 [cs.LG] UPDATED)
    This paper argues that continual learning methods can benefit by splitting the capacity of the learner across multiple models. We use statistical learning theory and experimental analysis to show how multiple tasks can interact with each other in a non-trivial fashion when a single model is trained on them. The generalization error on a particular task can improve when it is trained with synergistic tasks, but can also deteriorate when trained with competing tasks. This theory motivates our method named Model Zoo which, inspired from the boosting literature, grows an ensemble of small models, each of which is trained during one episode of continual learning. We demonstrate that Model Zoo obtains large gains in accuracy on a variety of continual learning benchmark problems. Code is available at https://github.com/grasp-lyrl/modelzoo_continual.  ( 2 min )
    Flowformer: Linearizing Transformers with Conservation Flows. (arXiv:2202.06258v2 [cs.LG] UPDATED)
    Transformers based on the attention mechanism have achieved impressive success in various areas. However, the attention mechanism has a quadratic complexity, significantly impeding Transformers from dealing with numerous tokens and scaling up to bigger models. Previous methods mainly utilize the similarity decomposition and the associativity of matrix multiplication to devise linear-time attention mechanisms. They avoid degeneration of attention to a trivial distribution by reintroducing inductive biases such as the locality, thereby at the expense of model generality and expressiveness. In this paper, we linearize Transformers free from specific inductive biases based on the flow network theory. We cast attention as the information flow aggregated from the sources (values) to the sinks (results) through the learned flow capacities (attentions). Within this framework, we apply the property of flow conservation into attention and propose the Flow-Attention mechanism of linear complexity. By respectively conserving the incoming flow of sinks for source competition and the outgoing flow of sources for sink allocation, Flow-Attention inherently generates informative attentions without using specific inductive biases. Empowered by the Flow-Attention, Flowformer yields strong performance in linear time for wide areas, including long sequence, time series, vision, natural language, and reinforcement learning. The code and settings are available at this repository: https://github.com/thuml/Flowformer.  ( 2 min )
    Graph Signal Reconstruction Techniques for IoT Air Pollution Monitoring Platforms. (arXiv:2201.00378v2 [eess.SP] UPDATED)
    Air pollution monitoring platforms play a very important role in preventing and mitigating the effects of pollution. Recent advances in the field of graph signal processing have made it possible to describe and analyze air pollution monitoring networks using graphs. One of the main applications is the reconstruction of the measured signal in a graph using a subset of sensors. Reconstructing the signal using information from sensor neighbors can help improve the quality of network data, examples are filling in missing data with correlated neighboring nodes, or correcting a drifting sensor with neighboring sensors that are more accurate. This paper compares the use of various types of graph signal reconstruction methods applied to real data sets of Spanish air pollution reference stations. The methods considered are Laplacian interpolation, graph signal processing low-pass based graph signal reconstruction, and kernel-based graph signal reconstruction, and are compared on actual air pollution data sets measuring O3, NO2, and PM10. The ability of the methods to reconstruct the signal of a pollutant is shown, as well as the computational cost of this reconstruction. The results indicate the superiority of methods based on kernel-based graph signal reconstruction, as well as the difficulties of the methods to scale in an air pollution monitoring network with a large number of low-cost sensors. However, we show that scalability can be overcome with simple methods, such as partitioning the network using a clustering algorithm.  ( 2 min )
    FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting. (arXiv:2201.12740v3 [cs.LG] UPDATED)
    Although Transformer-based methods have significantly improved state-of-the-art results for long-term series forecasting, they are not only computationally expensive but more importantly, are unable to capture the global view of time series (e.g. overall trend). To address these problems, we propose to combine Transformer with the seasonal-trend decomposition method, in which the decomposition method captures the global profile of time series while Transformers capture more detailed structures. To further enhance the performance of Transformer for long-term prediction, we exploit the fact that most time series tend to have a sparse representation in well-known basis such as Fourier transform, and develop a frequency enhanced Transformer. Besides being more effective, the proposed method, termed as Frequency Enhanced Decomposed Transformer ({\bf FEDformer}), is more efficient than standard Transformer with a linear complexity to the sequence length. Our empirical studies with six benchmark datasets show that compared with state-of-the-art methods, FEDformer can reduce prediction error by $14.8\%$ and $22.6\%$ for multivariate and univariate time series, respectively. Code is publicly available at https://github.com/MAZiqing/FEDformer.  ( 2 min )
    Memorize to Generalize: on the Necessity of Interpolation in High Dimensional Linear Regression. (arXiv:2202.09889v2 [stat.ML] UPDATED)
    We examine the necessity of interpolation in overparameterized models, that is, when achieving optimal predictive risk in machine learning problems requires (nearly) interpolating the training data. In particular, we consider simple overparameterized linear regression $y = X \theta + w$ with random design $X \in \mathbb{R}^{n \times d}$ under the proportional asymptotics $d/n \to \gamma \in (1, \infty)$. We precisely characterize how prediction (test) error necessarily scales with training error in this setting. An implication of this characterization is that as the label noise variance $\sigma^2 \to 0$, any estimator that incurs at least $\mathsf{c}\sigma^4$ training error for some constant $\mathsf{c}$ is necessarily suboptimal and will suffer growth in excess prediction error at least linear in the training error. Thus, optimal performance requires fitting training data to substantially higher accuracy than the inherent noise floor of the problem.  ( 2 min )
    Cyclical Focal Loss. (arXiv:2202.08978v2 [cs.CV] UPDATED)
    The cross-entropy softmax loss is the primary loss function used to train deep neural networks. On the other hand, the focal loss function has been demonstrated to provide improved performance when there is an imbalance in the number of training samples in each class, such as in long-tailed datasets. In this paper, we introduce a novel cyclical focal loss and demonstrate that it is a more universal loss function than cross-entropy softmax loss or focal loss. We describe the intuition behind the cyclical focal loss and our experiments provide evidence that cyclical focal loss provides superior performance for balanced, imbalanced, or long-tailed datasets. We provide numerous experimental results for CIFAR-10/CIFAR-100, ImageNet, balanced and imbalanced 4,000 training sample versions of CIFAR-10/CIFAR-100, and ImageNet-LT and Places-LT from the Open Long-Tailed Recognition (OLTR) challenge. Implementing the cyclical focal loss function requires only a few lines of code and does not increase training time. In the spirit of reproducibility, our code is available at \url{https://github.com/lnsmith54/CFL}.  ( 2 min )
    Wild ToFu: Improving Range and Quality of Indirect Time-of-Flight Depth with RGB Fusion in Challenging Environments. (arXiv:2112.03750v2 [cs.CV] UPDATED)
    Indirect Time-of-Flight (I-ToF) imaging is a widespread way of depth estimation for mobile devices due to its small size and affordable price. Previous works have mainly focused on quality improvement for I-ToF imaging especially curing the effect of Multi Path Interference (MPI). These investigations are typically done in specifically constrained scenarios at close distance, indoors and under little ambient light. Surprisingly little work has investigated I-ToF quality improvement in real-life scenarios where strong ambient light and far distances pose difficulties due to an extreme amount of induced shot noise and signal sparsity, caused by the attenuation with limited sensor power and light scattering. In this work, we propose a new learning based end-to-end depth prediction network which takes noisy raw I-ToF signals as well as an RGB image and fuses their latent representation based on a multi step approach involving both implicit and explicit alignment to predict a high quality long range depth map aligned to the RGB viewpoint. We test our approach on challenging real-world scenes and show more than 40% RMSE improvement on the final depth map compared to the baseline approach.  ( 2 min )
    Continual Repeated Annealed Flow Transport Monte Carlo. (arXiv:2201.13117v2 [stat.ML] UPDATED)
    We propose Continual Repeated Annealed Flow Transport Monte Carlo (CRAFT), a method that combines a sequential Monte Carlo (SMC) sampler (itself a generalization of Annealed Importance Sampling) with variational inference using normalizing flows. The normalizing flows are directly trained to transport between annealing temperatures using a KL divergence for each transition. This optimization objective is itself estimated using the normalizing flow/SMC approximation. We show conceptually and using multiple empirical examples that CRAFT improves on Annealed Flow Transport Monte Carlo (Arbel et al., 2021), on which it builds and also on Markov chain Monte Carlo (MCMC) based Stochastic Normalizing Flows (Wu et al., 2020). By incorporating CRAFT within particle MCMC, we show that such learnt samplers can achieve impressively accurate results on a challenging lattice field theory example.  ( 2 min )
  • Open

    mlf-core: a framework for deterministic machine learning. (arXiv:2104.07651v2 [cs.MS] UPDATED)
    Machine learning has shown extensive growth in recent years and is now routinely applied to sensitive areas. To allow appropriate verification of predictive models before deployment, models must be deterministic. However, major machine learning libraries default to the usage of non-deterministic algorithms based on atomic operations. Solely fixing all random seeds is not sufficient for deterministic machine learning. To overcome this shortcoming, various machine learning libraries released deterministic counterparts to the non-deterministic algorithms. We evaluated the effect of these algorithms on determinism and runtime. Based on these results, we formulated a set of requirements for deterministic machine learning and developed a new software solution, the mlf-core ecosystem, which aids machine learning projects to meet and keep these requirements. We applied mlf-core to develop deterministic models in various biomedical fields including a single cell autoencoder with TensorFlow, a PyTorch-based U-Net model for liver-tumor segmentation in CT scans, and a liver cancer classifier based on gene expression profiles with XGBoost.
    Continual Repeated Annealed Flow Transport Monte Carlo. (arXiv:2201.13117v2 [stat.ML] UPDATED)
    We propose Continual Repeated Annealed Flow Transport Monte Carlo (CRAFT), a method that combines a sequential Monte Carlo (SMC) sampler (itself a generalization of Annealed Importance Sampling) with variational inference using normalizing flows. The normalizing flows are directly trained to transport between annealing temperatures using a KL divergence for each transition. This optimization objective is itself estimated using the normalizing flow/SMC approximation. We show conceptually and using multiple empirical examples that CRAFT improves on Annealed Flow Transport Monte Carlo (Arbel et al., 2021), on which it builds and also on Markov chain Monte Carlo (MCMC) based Stochastic Normalizing Flows (Wu et al., 2020). By incorporating CRAFT within particle MCMC, we show that such learnt samplers can achieve impressively accurate results on a challenging lattice field theory example.
    Feature Selection using e-values. (arXiv:2206.05391v2 [stat.ML] UPDATED)
    In the context of supervised parametric models, we introduce the concept of e-values. An e-value is a scalar quantity that represents the proximity of the sampling distribution of parameter estimates in a model trained on a subset of features to that of the model trained on all features (i.e. the full model). Under general conditions, a rank ordering of e-values separates models that contain all essential features from those that do not. The e-values are applicable to a wide range of parametric models. We use data depths and a fast resampling-based algorithm to implement a feature selection procedure using e-values, providing consistency results. For a $p$-dimensional feature space, this procedure requires fitting only the full model and evaluating $p+1$ models, as opposed to the traditional requirement of fitting and evaluating $2^p$ models. Through experiments across several model settings and synthetic and real datasets, we establish that the e-values method as a promising general alternative to existing model-specific methods of feature selection.
    HyperImpute: Generalized Iterative Imputation with Automatic Model Selection. (arXiv:2206.07769v1 [stat.ML])
    Consider the problem of imputing missing values in a dataset. One the one hand, conventional approaches using iterative imputation benefit from the simplicity and customizability of learning conditional distributions directly, but suffer from the practical requirement for appropriate model specification of each and every variable. On the other hand, recent methods using deep generative modeling benefit from the capacity and efficiency of learning with neural network function approximators, but are often difficult to optimize and rely on stronger data assumptions. In this work, we study an approach that marries the advantages of both: We propose *HyperImpute*, a generalized iterative imputation framework for adaptively and automatically configuring column-wise models and their hyperparameters. Practically, we provide a concrete implementation with out-of-the-box learners, optimizers, simulators, and extensible interfaces. Empirically, we investigate this framework via comprehensive experiments and sensitivities on a variety of public datasets, and demonstrate its ability to generate accurate imputations relative to a strong suite of benchmarks. Contrary to recent work, we believe our findings constitute a strong defense of the iterative imputation paradigm.
    Computationally Efficient Approximations for Matrix-based Renyi's Entropy. (arXiv:2112.13720v3 [stat.ML] UPDATED)
    The recently developed matrix based Renyi's entropy enables measurement of information in data simply using the eigenspectrum of symmetric positive semi definite (PSD) matrices in reproducing kernel Hilbert space, without estimation of the underlying data distribution. This intriguing property makes the new information measurement widely adopted in multiple statistical inference and learning tasks. However, the computation of such quantity involves the trace operator on a PSD matrix $G$ to power $\alpha$(i.e., $tr(G^\alpha)$), with a normal complexity of nearly $O(n^3)$, which severely hampers its practical usage when the number of samples (i.e., $n$) is large. In this work, we present computationally efficient approximations to this new entropy functional that can reduce its complexity to even significantly less than $O(n^2)$. To this end, we leverage the recent progress on Randomized Numerical Linear Algebra, developing Taylor, Chebyshev and Lanczos approximations to $tr(G^\alpha)$ for arbitrary values of $\alpha$ by converting it into matrix-vector multiplications problem. We also establish the connection between the matrix-based Renyi's entropy and PSD matrix approximation, which enables exploiting both clustering and block low-rank structure of $G$ to further reduce the computational cost. We theoretically provide approximation accuracy guarantees and illustrate the properties of different approximations. Large-scale experimental evaluations on both synthetic and real-world data corroborate our theoretical findings, showing promising speedup with negligible loss in accuracy.
    On the Surprising Behaviour of node2vec. (arXiv:2206.08252v1 [cs.LG])
    Graph embedding techniques are a staple of modern graph learning research. When using embeddings for downstream tasks such as classification, information about their stability and robustness, i.e., their susceptibility to sources of noise, stochastic effects, or specific parameter choices, becomes increasingly important. As one of the most prominent graph embedding schemes, we focus on node2vec and analyse its embedding quality from multiple perspectives. Our findings indicate that embedding quality is unstable with respect to parameter choices, and we propose strategies to remedy this in practice.
    The convergent Indian buffet process. (arXiv:2206.08002v1 [stat.ML])
    We propose a new Bayesian nonparametric prior for latent feature models, which we call the convergent Indian buffet process (CIBP). We show that under the CIBP, the number of latent features is distributed as a Poisson distribution with the mean monotonically increasing but converging to a certain value as the number of objects goes to infinity. That is, the expected number of features is bounded above even when the number of objects goes to infinity, unlike the standard Indian buffet process under which the expected number of features increases with the number of objects. We provide two alternative representations of the CIBP based on a hierarchical distribution and a completely random measure, respectively, which are of independent interest. The proposed CIBP is assessed on a high-dimensional sparse factor model.
    Causal discovery under a confounder blanket. (arXiv:2205.05715v2 [stat.ME] UPDATED)
    Inferring causal relationships from observational data is rarely straightforward, but the problem is especially difficult in high dimensions. For these applications, causal discovery algorithms typically require parametric restrictions or extreme sparsity constraints. We relax these assumptions and focus on an important but more specialized problem, namely recovering the causal order among a subgraph of variables known to descend from some (possibly large) set of confounding covariates, i.e. a $\textit{confounder blanket}$. This is useful in many settings, for example when studying a dynamic biomolecular subsystem with genetic data providing background information. Under a structural assumption called the $\textit{confounder blanket principle}$, which we argue is essential for tractable causal discovery in high dimensions, our method accommodates graphs of low or high sparsity while maintaining polynomial time complexity. We present a structure learning algorithm that is provably sound and complete with respect to a so-called $\textit{lazy oracle}$. We design inference procedures with finite sample error control for linear and nonlinear systems, and demonstrate our approach on a range of simulated and real-world datasets. An accompanying $\texttt{R}$ package, $\texttt{cbl}$, is available from $\texttt{CRAN}$.
    Three rates of convergence or separation via U-statistics in a dependent framework. (arXiv:2106.12796v2 [math.ST] UPDATED)
    Despite the ubiquity of U-statistics in modern Probability and Statistics, their non-asymptotic analysis in a dependent framework may have been overlooked. In a recent work, a new concentration inequality for U-statistics of order two for uniformly ergodic Markov chains has been proved. In this paper, we put this theoretical breakthrough into action by pushing further the current state of knowledge in three different active fields of research. First, we establish a new exponential inequality for the estimation of spectra of trace class integral operators with MCMC methods. The novelty is that this result holds for kernels with positive and negative eigenvalues, which is new as far as we know. In addition, we investigate generalization performance of online algorithms working with pairwise loss functions and Markov chain samples. We provide an online-to-batch conversion result by showing how we can extract a low risk hypothesis from the sequence of hypotheses generated by any online learner. We finally give a non-asymptotic analysis of a goodness-of-fit test on the density of the invariant measure of a Markov chain. We identify some classes of alternatives over which our test based on the $L_2$ distance has a prescribed power.
    Multimeasurement Generative Models. (arXiv:2112.09822v2 [stat.ML] UPDATED)
    We formally map the problem of sampling from an unknown distribution with a density in $\mathbb{R}^d$ to the problem of learning and sampling a smoother density in $\mathbb{R}^{Md}$ obtained by convolution with a fixed factorial kernel: the new density is referred to as M-density and the kernel as multimeasurement noise model (MNM). The M-density in $\mathbb{R}^{Md}$ is smoother than the original density in $\mathbb{R}^d$, easier to learn and sample from, yet for large $M$ the two problems are mathematically equivalent since clean data can be estimated exactly given a multimeasurement noisy observation using the Bayes estimator. To formulate the problem, we derive the Bayes estimator for Poisson and Gaussian MNMs in closed form in terms of the unnormalized M-density. This leads to a simple least-squares objective for learning parametric energy and score functions. We present various parametrization schemes of interest including one in which studying Gaussian M-densities directly leads to multidenoising autoencoders--this is the first theoretical connection made between denoising autoencoders and empirical Bayes in the literature. Samples in $\mathbb{R}^d$ are obtained by walk-jump sampling (Saremi & Hyvarinen, 2019) via underdamped Langevin MCMC (walk) to sample from M-density and the multimeasurement Bayes estimation (jump). We study permutation invariant Gaussian M-densities on MNIST, CIFAR-10, and FFHQ-256 datasets, and demonstrate the effectiveness of this framework for realizing fast-mixing stable Markov chains in high dimensions.
    Convergence of Policy Gradient for Entropy Regularized MDPs with Neural Network Approximation in the Mean-Field Regime. (arXiv:2201.07296v2 [math.OC] UPDATED)
    We study the global convergence of policy gradient for infinite-horizon, continuous state and action space, and entropy-regularized Markov decision processes (MDPs). We consider a softmax policy with (one-hidden layer) neural network approximation in a mean-field regime. Additional entropic regularization in the associated mean-field probability measure is added, and the corresponding gradient flow is studied in the 2-Wasserstein metric. We show that the objective function is increasing along the gradient flow. Further, we prove that if the regularization in terms of the mean-field measure is sufficient, the gradient flow converges exponentially fast to the unique stationary solution, which is the unique maximizer of the regularized MDP objective. Lastly, we study the sensitivity of the value function along the gradient flow with respect to regularization parameters and the initial condition. Our results rely on the careful analysis of the non-linear Fokker-Planck-Kolmogorov equation and extend the pioneering work of Mei et al. 2020 and Agarwal et al. 2020, which quantify the global convergence rate of policy gradient for entropy-regularized MDPs in the tabular setting.
    Neural tangent kernel analysis of shallow $\alpha$-Stable ReLU neural networks. (arXiv:2206.08065v1 [cs.LG])
    There is a recent literature on large-width properties of Gaussian neural networks (NNs), i.e. NNs whose weights are distributed according to Gaussian distributions. Two popular problems are: i) the study of the large-width behaviour of NNs, which provided a characterization of the infinitely wide limit of a rescaled NN in terms of a Gaussian process; ii) the study of the large-width training dynamics of NNs, which set forth an equivalence between training the rescaled NN and performing a kernel regression with a deterministic kernel referred to as the neural tangent kernel (NTK). In this paper, we consider these problems for $\alpha$-Stable NNs, which generalize Gaussian NNs by assuming that the NN's weights are distributed as $\alpha$-Stable distributions with $\alpha\in(0,2]$, i.e. distributions with heavy tails. For shallow $\alpha$-Stable NNs with a ReLU activation function, we show that if the NN's width goes to infinity then a rescaled NN converges weakly to an $\alpha$-Stable process, i.e. a stochastic process with $\alpha$-Stable finite-dimensional distributions. As a novelty with respect to the Gaussian setting, in the $\alpha$-Stable setting the choice of the activation function affects the scaling of the NN, that is: to achieve the infinitely wide $\alpha$-Stable process, the ReLU function requires an additional logarithmic scaling with respect to sub-linear functions. Then, our main contribution is the NTK analysis of shallow $\alpha$-Stable ReLU-NNs, which leads to an equivalence between training a rescaled NN and performing a kernel regression with an $(\alpha/2)$-Stable random kernel. The randomness of such a kernel is a further novelty with respect to the Gaussian setting, that is: in the $\alpha$-Stable setting the randomness of the NN at initialization does not vanish in the NTK analysis, thus inducing a distribution for the kernel of the underlying kernel regression.
    Deep Learning for Time Series Forecasting: Tutorial and Literature Survey. (arXiv:2004.10240v2 [cs.LG] UPDATED)
    Deep learning based forecasting methods have become the methods of choice in many applications of time series prediction or forecasting often outperforming other approaches. Consequently, over the last years, these methods are now ubiquitous in large-scale industrial forecasting applications and have consistently ranked among the best entries in forecasting competitions (e.g., M4 and M5). This practical success has further increased the academic interest to understand and improve deep forecasting methods. In this article we provide an introduction and overview of the field: We present important building blocks for deep forecasting in some depth; using these building blocks, we then survey the breadth of the recent deep forecasting literature.
    Pareto Invariant Risk Minimization. (arXiv:2206.07766v1 [cs.LG])
    Despite the success of invariant risk minimization (IRM) in tackling the Out-of-Distribution generalization problem, IRM can compromise the optimality when applied in practice. The practical variants of IRM, e.g., IRMv1, have been shown to have significant gaps with IRM and thus could fail to capture the invariance even in simple problems. Moreover, the optimization procedure in IRMv1 involves two intrinsically conflicting objectives, and often requires careful tuning for the objective weights. To remedy the above issues, we reformulate IRM as a multi-objective optimization problem, and propose a new optimization scheme for IRM, called PAreto Invariant Risk Minimization (PAIR). PAIR can adaptively adjust the optimization direction under the objective conflicts. Furthermore, we show PAIR can empower the practical IRM variants to overcome the barriers with the original IRM when provided with proper guidance. We conduct experiments with ColoredMNIST to confirm our theory and the effectiveness of PAIR.
    Squeeze All: Novel Estimator and Self-Normalized Bound for Linear Contextual Bandits. (arXiv:2206.05404v2 [stat.ML] UPDATED)
    We propose a novel algorithm for linear contextual bandits with $O(\sqrt{dT \log T})$ regret bound, where $d$ is the dimension of contexts and $T$ is the time horizon. Our proposed algorithm is equipped with a novel estimator in which exploration is embedded through explicit randomization. Depending on the randomization, our proposed estimator takes contribution either from contexts of all arms or from selected contexts. We establish a self-normalized bound for our estimator, which allows a novel decomposition of the cumulative regret into additive dimension-dependent terms instead of multiplicative terms. We also prove a novel lower bound of $\Omega(\sqrt{dT})$ under our problem setting. Hence, the regret of our proposed algorithm matches the lower bound up to logarithmic factors. The numerical experiments support the theoretical guarantees and show that our proposed method outperforms the existing linear bandit algorithms.
    Multi-Objective Bayesian Optimization over High-Dimensional Search Spaces. (arXiv:2109.10964v4 [cs.LG] UPDATED)
    Many real world scientific and industrial applications require optimizing multiple competing black-box objectives. When the objectives are expensive-to-evaluate, multi-objective Bayesian optimization (BO) is a popular approach because of its high sample efficiency. However, even with recent methodological advances, most existing multi-objective BO methods perform poorly on search spaces with more than a few dozen parameters and rely on global surrogate models that scale cubically with the number of observations. In this work we propose MORBO, a scalable method for multi-objective BO over high-dimensional search spaces. MORBO identifies diverse globally optimal solutions by performing BO in multiple local regions of the design space in parallel using a coordinated strategy. We show that MORBO significantly advances the state-of-the-art in sample efficiency for several high-dimensional synthetic problems and real world applications, including an optical display design problem and a vehicle design problem with 146 and 222 parameters, respectively. On these problems, where existing BO algorithms fail to scale and perform well, MORBO provides practitioners with order-of-magnitude improvements in sample efficiency over the current approach.
    On Privacy and Personalization in Cross-Silo Federated Learning. (arXiv:2206.07902v1 [cs.LG])
    While the application of differential privacy (DP) has been well-studied in cross-device federated learning (FL), there is a lack of work considering DP for cross-silo FL, a setting characterized by a limited number of clients each containing many data subjects. In cross-silo FL, usual notions of client-level privacy are less suitable as real-world privacy regulations typically concern in-silo data subjects rather than the silos themselves. In this work, we instead consider the more realistic notion of silo-specific item-level privacy, where silos set their own privacy targets for their local examples. Under this setting, we reconsider the roles of personalization in federated learning. In particular, we show that mean-regularized multi-task learning (MR-MTL), a simple personalization framework, is a strong baseline for cross-silo FL: under stronger privacy, silos are further incentivized to "federate" with each other to mitigate DP noise, resulting in consistent improvements relative to standard baseline methods. We provide a thorough empirical study of competing methods as well as a theoretical characterization of MR-MTL for a mean estimation problem, highlighting the interplay between privacy and cross-silo data heterogeneity. Our work serves to establish baselines for private cross-silo FL as well as identify key directions of future work in this area.
    Large-Scale Differentiable Causal Discovery of Factor Graphs. (arXiv:2206.07824v1 [stat.ML])
    A common theme in causal inference is learning causal relationships between observed variables, also known as causal discovery. This is usually a daunting task, given the large number of candidate causal graphs and the combinatorial nature of the search space. Perhaps for this reason, most research has so far focused on relatively small causal graphs, with up to hundreds of nodes. However, recent advances in fields like biology enable generating experimental data sets with thousands of interventions followed by rich profiling of thousands of variables, raising the opportunity and urgent need for large causal graph models. Here, we introduce the notion of factor directed acyclic graphs (f-DAGs) as a way to restrict the search space to non-linear low-rank causal interaction models. Combining this novel structural assumption with recent advances that bridge the gap between causal discovery and continuous optimization, we achieve causal discovery on thousands of variables. Additionally, as a model for the impact of statistical noise on this estimation procedure, we study a model of edge perturbations of the f-DAG skeleton based on random graphs and quantify the effect of such perturbations on the f-DAG rank. This theoretical analysis suggests that the set of candidate f-DAGs is much smaller than the whole DAG space and thus more statistically robust in the high-dimensional regime where the underlying skeleton is hard to assess. We propose Differentiable Causal Discovery of Factor Graphs (DCD-FG), a scalable implementation of f-DAG constrained causal discovery for high-dimensional interventional data. DCD-FG uses a Gaussian non-linear low-rank structural equation model and shows significant improvements compared to state-of-the-art methods in both simulations as well as a recent large-scale single-cell RNA sequencing data set with hundreds of genetic interventions.
    A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes. (arXiv:2111.06784v4 [cs.LG] UPDATED)
    We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on unobservable latent variables. Existing works either assume no unmeasured confounders, or focus on settings where both the observation and the state spaces are tabular. In this work, we first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy's value and the observed data distribution. We next propose minimax estimation methods for learning these bridge functions, and construct three estimators based on these estimated bridge functions, corresponding to a value function-based estimator, a marginalized importance sampling estimator, and a doubly-robust estimator. Our proposal permits general function approximation and is thus applicable to settings with continuous or large observation/state spaces. The nonasymptotic and asymptotic properties of the proposed estimators are investigated in detail.
    Scalable First-Order Bayesian Optimization via Structured Automatic Differentiation. (arXiv:2206.08366v1 [cs.LG])
    Bayesian Optimization (BO) has shown great promise for the global optimization of functions that are expensive to evaluate, but despite many successes, standard approaches can struggle in high dimensions. To improve the performance of BO, prior work suggested incorporating gradient information into a Gaussian process surrogate of the objective, giving rise to kernel matrices of size $nd \times nd$ for $n$ observations in $d$ dimensions. Na\"ively multiplying with (resp. inverting) these matrices requires $\mathcal{O}(n^2d^2)$ (resp. $\mathcal{O}(n^3d^3$)) operations, which becomes infeasible for moderate dimensions and sample sizes. Here, we observe that a wide range of kernels gives rise to structured matrices, enabling an exact $\mathcal{O}(n^2d)$ matrix-vector multiply for gradient observations and $\mathcal{O}(n^2d^2)$ for Hessian observations. Beyond canonical kernel classes, we derive a programmatic approach to leveraging this type of structure for transformations and combinations of the discussed kernel classes, which constitutes a structure-aware automatic differentiation algorithm. Our methods apply to virtually all canonical kernels and automatically extend to complex kernels, like the neural network, radial basis function network, and spectral mixture kernels without any additional derivations, enabling flexible, problem-dependent modeling while scaling first-order BO to high $d$.
    Off-Policy Evaluation for Large Action Spaces via Embeddings. (arXiv:2202.06317v2 [cs.LG] UPDATED)
    Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems, since it enables offline evaluation of new policies using only historic log data. Unfortunately, when the number of actions is large, existing OPE estimators -- most of which are based on inverse propensity score weighting -- degrade severely and can suffer from extreme bias and variance. This foils the use of OPE in many applications from recommender systems to language models. To overcome this issue, we propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space. We characterize the bias, variance, and mean squared error of the proposed estimator and analyze the conditions under which the action embedding provides statistical benefits over conventional estimators. In addition to the theoretical analysis, we find that the empirical performance improvement can be substantial, enabling reliable OPE even when existing estimators collapse due to a large number of actions.
    Contrasting random and learned features in deep Bayesian linear regression. (arXiv:2203.00573v2 [cs.LG] UPDATED)
    Understanding how feature learning affects generalization is among the foremost goals of modern deep learning theory. Here, we study how the ability to learn representations affects the generalization performance of a simple class of models: deep Bayesian linear neural networks trained on unstructured Gaussian data. By comparing deep random feature models to deep networks in which all layers are trained, we provide a detailed characterization of the interplay between width, depth, data density, and prior mismatch. We show that both models display sample-wise double-descent behavior in the presence of label noise. Random feature models can also display model-wise double-descent if there are narrow bottleneck layers, while deep networks do not show these divergences. Random feature models can have particular widths that are optimal for generalization at a given data density, while making neural networks as wide or as narrow as possible is always optimal. Moreover, we show that the leading-order correction to the kernel-limit learning curve cannot distinguish between random feature models and deep networks in which all layers are trained. Taken together, our findings begin to elucidate how architectural details affect generalization performance in this simple class of deep regression models.
    Towards Robust and Reproducible Active Learning Using Neural Networks. (arXiv:2002.09564v3 [cs.LG] UPDATED)
    Active learning (AL) is a promising ML paradigm that has the potential to parse through large unlabeled data and help reduce annotation cost in domains where labeling data can be prohibitive. Recently proposed neural network based AL methods use different heuristics to accomplish this goal. In this study, we demonstrate that under identical experimental settings, different types of AL algorithms (uncertainty based, diversity based, and committee based) produce an inconsistent gain over random sampling baseline. Through a variety of experiments, controlling for sources of stochasticity, we show that variance in performance metrics achieved by AL algorithms can lead to results that are not consistent with the previously reported results. We also found that under strong regularization, AL methods show marginal or no advantage over the random sampling baseline under a variety of experimental conditions. Finally, we conclude with a set of recommendations on how to assess the results using a new AL algorithm to ensure results are reproducible and robust under changes in experimental conditions. We share our codes to facilitate AL evaluations. We believe our findings and recommendations will help advance reproducible research in AL using neural networks. We open source our code at https://github.com/PrateekMunjal/TorchAL
    General Cyclical Training of Neural Networks. (arXiv:2202.08835v2 [cs.LG] UPDATED)
    This paper describes the principle of "General Cyclical Training" in machine learning, where training starts and ends with "easy training" and the "hard training" happens during the middle epochs. We propose several manifestations for training neural networks, including algorithmic examples (via hyper-parameters and loss functions), data-based examples, and model-based examples. Specifically, we introduce several novel techniques: cyclical weight decay, cyclical batch size, cyclical focal loss, cyclical softmax temperature, cyclical data augmentation, cyclical gradient clipping, and cyclical semi-supervised learning. In addition, we demonstrate that cyclical weight decay, cyclical softmax temperature, and cyclical gradient clipping (as three examples of this principle) are beneficial in the test accuracy performance of a trained model. Furthermore, we discuss model-based examples (such as pretraining and knowledge distillation) from the perspective of general cyclical training and recommend some changes to the typical training methodology. In summary, this paper defines the general cyclical training concept and discusses several specific ways in which this concept can be applied to training neural networks. In the spirit of reproducibility, the code used in our experiments is available at \url{https://github.com/lnsmith54/CFL}.
    Learning Multi-Task Gaussian Process Over Heterogeneous Input Domains. (arXiv:2202.12636v2 [stat.ML] UPDATED)
    Multi-task Gaussian process (MTGP) is a well-known non-parametric Bayesian model for learning correlated tasks effectively by transferring knowledge across tasks. But current MTGPs are usually limited to the multi-task scenario defined in the same input domain, leaving no space for tackling the heterogeneous case, i.e., the features of input domains vary over tasks. To this end, this paper presents a novel heterogeneous stochastic variational linear model of coregionalization (\texttt{HSVLMC}) model for simultaneously learning the tasks with varied input domains. Particularly, we develop the stochastic variational framework with Bayesian calibration that (i) takes into account the effect of dimensionality reduction raised by domain mappings in order to achieve effective input alignment; and (ii) employs a residual modeling strategy to leverage the inductive bias brought by prior domain mappings for better model inference. Finally, the superiority of the proposed model against existing LMC models has been extensively verified on diverse heterogeneous multi-task cases and a practical multi-fidelity steam turbine exhaust problem.
    Unlocking High-Accuracy Differentially Private Image Classification through Scale. (arXiv:2204.13650v2 [cs.LG] UPDATED)
    Differential Privacy (DP) provides a formal privacy guarantee preventing adversaries with access to a machine learning model from extracting information about individual training points. Differentially Private Stochastic Gradient Descent (DP-SGD), the most popular DP training method for deep learning, realizes this protection by injecting noise during training. However previous works have found that DP-SGD often leads to a significant degradation in performance on standard image classification benchmarks. Furthermore, some authors have postulated that DP-SGD inherently performs poorly on large models, since the norm of the noise required to preserve privacy is proportional to the model dimension. In contrast, we demonstrate that DP-SGD on over-parameterized models can perform significantly better than previously thought. Combining careful hyper-parameter tuning with simple techniques to ensure signal propagation and improve the convergence rate, we obtain a new SOTA without extra data on CIFAR-10 of 81.4% under (8, 10^{-5})-DP using a 40-layer Wide-ResNet, improving over the previous SOTA of 71.7%. When fine-tuning a pre-trained NFNet-F3, we achieve a remarkable 83.8% top-1 accuracy on ImageNet under (0.5, 8*10^{-7})-DP. Additionally, we also achieve 86.7% top-1 accuracy under (8, 8 \cdot 10^{-7})-DP, which is just 4.3% below the current non-private SOTA for this task. We believe our results are a significant step towards closing the accuracy gap between private and non-private image classification.
    User Engagement and Churn in Mobile Health Applications. (arXiv:2206.08178v1 [stat.ML])
    Mobile health apps are revolutionizing the healthcare ecosystem by improving communication, efficiency, and quality of service. In low- and middle-income countries, they also play a unique role as a source of information about health outcomes and behaviors of patients and healthcare workers, while providing a suitable channel to deliver both personalized and collective policy interventions. We propose a framework to study user engagement with mobile health, focusing on healthcare workers and digital health apps designed to support them in resource-poor settings. The behavioral logs produced by these apps can be transformed into daily time series characterizing each user's activity. We use probabilistic and survival analysis to build multiple personalized measures of meaningful engagement, which could serve to tailor content and digital interventions suiting each health worker's specific needs. Special attention is given to the problem of detecting churn, understood as a marker of complete disengagement. We discuss the application of our methods to the Indian and Ethiopian users of the Safe Delivery App, a capacity-building tool for skilled birth attendants. This work represents an important step towards a full characterization of user engagement in mobile health applications, which can significantly enhance the abilities of health workers and, ultimately, save lives.
    Maximum Likelihood Training for Score-Based Diffusion ODEs by High-Order Denoising Score Matching. (arXiv:2206.08265v1 [stat.ML])
    Score-based generative models have excellent performance in terms of generation quality and likelihood. They model the data distribution by matching a parameterized score network with first-order data score functions. The score network can be used to define an ODE ("score-based diffusion ODE") for exact likelihood evaluation. However, the relationship between the likelihood of the ODE and the score matching objective is unclear. In this work, we prove that matching the first-order score is not sufficient to maximize the likelihood of the ODE, by showing a gap between the maximum likelihood and score matching objectives. To fill up this gap, we show that the negative likelihood of the ODE can be bounded by controlling the first, second, and third-order score matching errors; and we further present a novel high-order denoising score matching method to enable maximum likelihood training of score-based diffusion ODEs. Our algorithm guarantees that the higher-order matching error is bounded by the training error and the lower-order errors. We empirically observe that by high-order score matching, score-based diffusion ODEs achieve better likelihood on both synthetic data and CIFAR-10, while retaining the high generation quality.
    Deep Bayesian inference for seismic imaging with tasks. (arXiv:2110.04825v3 [physics.geo-ph] UPDATED)
    We propose to use techniques from Bayesian inference and deep neural networks to translate uncertainty in seismic imaging to uncertainty in tasks performed on the image, such as horizon tracking. Seismic imaging is an ill-posed inverse problem because of bandwidth and aperture limitations, which is hampered by the presence of noise and linearization errors. Many regularization methods, such as transform-domain sparsity promotion, have been designed to deal with the adverse effects of these errors, however, these methods run the risk of biasing the solution and do not provide information on uncertainty in the image space and how this uncertainty impacts certain tasks on the image. A systematic approach is proposed to translate uncertainty due to noise in the data to confidence intervals of automatically tracked horizons in the image. The uncertainty is characterized by a convolutional neural network (CNN) and to assess these uncertainties, samples are drawn from the posterior distribution of the CNN weights, used to parameterize the image. Compared to traditional priors, it is argued in the literature that these CNNs introduce a flexible inductive bias that is a surprisingly good fit for a diverse set of problems. The method of stochastic gradient Langevin dynamics is employed to sample from the posterior distribution. This method is designed to handle large scale Bayesian inference problems with computationally expensive forward operators as in seismic imaging. Aside from offering a robust alternative to maximum a posteriori estimate that is prone to overfitting, access to these samples allow us to translate uncertainty in the image, due to noise in the data, to uncertainty on the tracked horizons. For instance, it admits estimates for the pointwise standard deviation on the image and for confidence intervals on its automatically tracked horizons.
    Solving Inverse Problems in Medical Imaging with Score-Based Generative Models. (arXiv:2111.08005v2 [eess.IV] UPDATED)
    Reconstructing medical images from partial measurements is an important inverse problem in Computed Tomography (CT) and Magnetic Resonance Imaging (MRI). Existing solutions based on machine learning typically train a model to directly map measurements to medical images, leveraging a training dataset of paired images and measurements. These measurements are typically synthesized from images using a fixed physical model of the measurement process, which hinders the generalization capability of models to unknown measurement processes. To address this issue, we propose a fully unsupervised technique for inverse problem solving, leveraging the recently introduced score-based generative models. Specifically, we first train a score-based generative model on medical images to capture their prior distribution. Given measurements and a physical model of the measurement process at test time, we introduce a sampling method to reconstruct an image consistent with both the prior and the observed measurements. Our method does not assume a fixed measurement process during training, and can thus be flexibly adapted to different measurement processes at test time. Empirically, we observe comparable or better performance to supervised learning techniques in several medical imaging tasks in CT and MRI, while demonstrating significantly better generalization to unknown measurement processes.
    Tracking Most Significant Arm Switches in Bandits. (arXiv:2112.13838v6 [cs.LG] UPDATED)
    In bandit with distribution shifts, one aims to automatically adapt to unknown changes in reward distribution, and restart exploration when necessary. While this problem has been studied for many years, a recent breakthrough of Auer et al. (2018, 2019) provides the first adaptive procedure to guarantee an optimal (dynamic) regret $\sqrt{LT}$, for $T$ rounds, and an unknown number $L$ of changes. However, while this rate is tight in the worst case, it remained open whether faster rates are possible, without prior knowledge, if few changes in distribution are actually severe. To resolve this question, we propose a new notion of significant shift, which only counts very severe changes that clearly necessitate a restart: roughly, these are changes involving not only best arm switches, but also involving large aggregate differences in reward overtime. Thus, our resulting procedure adaptively achieves rates always faster (sometimes significantly) than $O(\sqrt{ST})$, where $S\ll L$ only counts best arm switches, while at the same time, always faster than the optimal $O(V^{\frac{1}{3}}T^{\frac{2}{3}})$ when expressed in terms of total variation $V$ (which aggregates differences overtime). Our results are expressed in enough generality to also capture non-stochastic adversarial settings.
    Memorize to Generalize: on the Necessity of Interpolation in High Dimensional Linear Regression. (arXiv:2202.09889v2 [stat.ML] UPDATED)
    We examine the necessity of interpolation in overparameterized models, that is, when achieving optimal predictive risk in machine learning problems requires (nearly) interpolating the training data. In particular, we consider simple overparameterized linear regression $y = X \theta + w$ with random design $X \in \mathbb{R}^{n \times d}$ under the proportional asymptotics $d/n \to \gamma \in (1, \infty)$. We precisely characterize how prediction (test) error necessarily scales with training error in this setting. An implication of this characterization is that as the label noise variance $\sigma^2 \to 0$, any estimator that incurs at least $\mathsf{c}\sigma^4$ training error for some constant $\mathsf{c}$ is necessarily suboptimal and will suffer growth in excess prediction error at least linear in the training error. Thus, optimal performance requires fitting training data to substantially higher accuracy than the inherent noise floor of the problem.
    Pythae: Unifying Generative Autoencoders in Python -- A Benchmarking Use Case. (arXiv:2206.08309v1 [cs.LG])
    In recent years, deep generative models have attracted increasing interest due to their capacity to model complex distributions. Among those models, variational autoencoders have gained popularity as they have proven both to be computationally efficient and yield impressive results in multiple fields. Following this breakthrough, extensive research has been done in order to improve the original publication, resulting in a variety of different VAE models in response to different tasks. In this paper we present Pythae, a versatile open-source Python library providing both a unified implementation and a dedicated framework allowing straightforward, reproducible and reliable use of generative autoencoder models. We then propose to use this library to perform a case study benchmark where we present and compare 19 generative autoencoder models representative of some of the main improvements on downstream tasks such as image reconstruction, generation, classification, clustering and interpolation. The open-source library can be found at https://github.com/clementchadebec/benchmark_VAE.
    A Tree-based Model Averaging Approach for Personalized Treatment Effect Estimation from Heterogeneous Data Sources. (arXiv:2103.06261v3 [stat.ML] UPDATED)
    Accurately estimating personalized treatment effects within a study site (e.g., a hospital) has been challenging due to limited sample size. Furthermore, privacy considerations and lack of resources prevent a site from leveraging subject-level data from other sites. We propose a tree-based model averaging approach to improve the estimation accuracy of conditional average treatment effects (CATE) at a target site by leveraging models derived from other potentially heterogeneous sites, without them sharing subject-level data. To our best knowledge, there is no established model averaging approach for distributed data with a focus on improving the estimation of treatment effects. Specifically, under distributed data networks, our framework provides an interpretable tree-based ensemble of CATE estimators that joins models across study sites, while actively modeling the heterogeneity in data sources through site partitioning. The performance of this approach is demonstrated by a real-world study of the causal effects of oxygen therapy on hospital survival rate and backed up by comprehensive simulation results.
    Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them. (arXiv:2107.11630v2 [cs.LG] UPDATED)
    Making classifiers robust to adversarial examples is hard. Thus, many defenses tackle the seemingly easier task of detecting perturbed inputs. We show a barrier towards this goal. We prove a general hardness reduction between detection and classification of adversarial examples: given a robust detector for attacks at distance {\epsilon} (in some metric), we can build a similarly robust (but inefficient) classifier for attacks at distance {\epsilon}/2. Our reduction is computationally inefficient, and thus cannot be used to build practical classifiers. Instead, it is a useful sanity check to test whether empirical detection results imply something much stronger than the authors presumably anticipated. To illustrate, we revisit 13 detector defenses. For 11/13 cases, we show that the claimed detection results would imply an inefficient classifier with robustness far beyond the state-of-the-art.
    LSB: Local Self-Balancing MCMC in Discrete Spaces. (arXiv:2109.03867v3 [cs.AI] UPDATED)
    We present the Local Self-Balancing sampler (LSB), a local Markov Chain Monte Carlo (MCMC) method for sampling in purely discrete domains, which is able to autonomously adapt to the target distribution and to reduce the number of target evaluations required to converge. LSB is based on (i) a parametrization of locally balanced proposals, (ii) a newly proposed objective function based on mutual information and (iii) a self-balancing learning procedure, which minimises the proposed objective to update the proposal parameters. Experiments on energy-based models and Markov networks show that LSB converges using a smaller number of queries to the oracle distribution compared to recent local MCMC samplers.
    An Asymptotic Test for Conditional Independence using Analytic Kernel Embeddings. (arXiv:2110.14868v2 [stat.ML] UPDATED)
    We propose a new conditional dependence measure and a statistical test for conditional independence. The measure is based on the difference between analytic kernel embeddings of two well-suited distributions evaluated at a finite set of locations. We obtain its asymptotic distribution under the null hypothesis of conditional independence and design a consistent statistical test from it. We conduct a series of experiments showing that our new test outperforms state-of-the-art methods both in terms of type-I and type-II errors even in the high dimensional setting.
    On Private Online Convex Optimization: Optimal Algorithms in $\ell_p$-Geometry and High Dimensional Contextual Bandits. (arXiv:2206.08111v1 [cs.LG])
    Differentially private (DP) stochastic convex optimization (SCO) is ubiquitous in trustworthy machine learning algorithm design. This paper studies the DP-SCO problem with streaming data sampled from a distribution and arrives sequentially. We also consider the continual release model where parameters related to private information are updated and released upon each new data, often known as the online algorithms. Despite that numerous algorithms have been developed to achieve the optimal excess risks in different $\ell_p$ norm geometries, yet none of the existing ones can be adapted to the streaming and continual release setting. To address such a challenge as the online convex optimization with privacy protection, we propose a private variant of online Frank-Wolfe algorithm with recursive gradients for variance reduction to update and reveal the parameters upon each data. Combined with the adaptive differential privacy analysis, our online algorithm achieves in linear time the optimal excess risk when $1<p\leq 2$ and the state-of-the-art excess risk meeting the non-private lower ones when $2<p\leq\infty$. Our algorithm can also be extended to the case $p=1$ to achieve nearly dimension-independent excess risk. While previous variance reduction results on recursive gradient have theoretical guarantee only in the independent and identically distributed sample setting, we establish such a guarantee in a non-stationary setting. To demonstrate the virtues of our method, we design the first DP algorithm for high-dimensional generalized linear bandits with logarithmic regret. Comparative experiments with a variety of DP-SCO and DP-Bandit algorithms exhibit the efficacy and utility of the proposed algorithms.
    Learning Physics between Digital Twins with Low-Fidelity Models and Physics-Informed Gaussian Processes. (arXiv:2206.08201v1 [stat.ML])
    A digital twin is a computer model that represents an individual, for example, a component, a patient or a process. In many situations, we want to gain knowledge about an individual from its data while incorporating imperfect physical knowledge and also learn from data from other individuals. In this paper, we introduce and demonstrate a fully Bayesian methodology for learning between digital twins in a setting where the physical parameters of each individual are of interest. For each individual, the methodology is based on Bayesian calibration with model discrepancy. Through the discrepancy, modelled as a Gaussian process, the imperfect low-fidelity physical model is accounted for. Using ideas from Bayesian hierarchical models, a joint probabilistic model of digital twins is constructed by connecting them through a new level in the hierarchy. For the physical parameters, the methodology can be seen as using a prior distribution in the individual model that is the posterior of the corresponding hyperparameter in the joint model. For learning the imperfect physics between individuals two approaches are introduced, one that assumes the same discrepancy for all individuals and one that can be seen as using a prior learned from all individuals for the parameters of the Gaussian processes representing the discrepancies. Based on recent advances related to physics-informed priors, Hamiltonian Monte Carlo methods and using these for inverse problems we set up an inference methodology that allows our approach to be computational feasible also for physical models based on partial differential equations and individual data that are not aligned. The methodology is demonstrated in two synthetic case studies, a toy example previously used in the literature extended to more individuals and an example based on a cardiovascular differential equation model relevant for the treatment of hypertension.
    Deep Reference Priors: What is the best way to pretrain a model?. (arXiv:2202.00187v2 [stat.ML] UPDATED)
    What is the best way to exploit extra data -- be it unlabeled data from the same task, or labeled data from a related task -- to learn a given task? This paper formalizes the question using the theory of reference priors. Reference priors are objective, uninformative Bayesian priors that maximize the mutual information between the task and the weights of the model. Such priors enable the task to maximally affect the Bayesian posterior, e.g., reference priors depend upon the number of samples available for learning the task and for very small sample sizes, the prior puts more probability mass on low-complexity models in the hypothesis space. This paper presents the first demonstration of reference priors for medium-scale deep networks and image-based data. We develop generalizations of reference priors and demonstrate applications to two problems. First, by using unlabeled data to compute the reference prior, we develop new Bayesian semi-supervised learning methods that remain effective even with very few samples per class. Second, by using labeled data from the source task to compute the reference prior, we develop a new pretraining method for transfer learning that allows data from the target task to maximally affect the Bayesian posterior. Empirical validation of these methods is conducted on image classification datasets. Code is available at https://github.com/grasp-lyrl/deep_reference_priors.
    Generalization Bounds via Convex Analysis. (arXiv:2202.04985v2 [stat.ML] UPDATED)
    Since the celebrated works of Russo and Zou (2016,2019) and Xu and Raginsky (2017), it has been well known that the generalization error of supervised learning algorithms can be bounded in terms of the mutual information between their input and the output, given that the loss of any fixed hypothesis has a subgaussian tail. In this work, we generalize this result beyond the standard choice of Shannon's mutual information to measure the dependence between the input and the output. Our main result shows that it is indeed possible to replace the mutual information by any strongly convex function of the joint input-output distribution, with the subgaussianity condition on the losses replaced by a bound on an appropriately chosen norm capturing the geometry of the dependence measure. This allows us to derive a range of generalization bounds that are either entirely new or strengthen previously known ones. Examples include bounds stated in terms of $p$-norm divergences and the Wasserstein-2 distance, which are respectively applicable for heavy-tailed loss distributions and highly smooth loss functions. Our analysis is entirely based on elementary tools from convex analysis by tracking the growth of a potential function associated with the dependence measure and the loss function.
    Neural net modeling of equilibria in NSTX-U. (arXiv:2202.13915v2 [physics.plasm-ph] UPDATED)
    Neural networks (NNs) offer a path towards synthesizing and interpreting data on faster timescales than traditional physics-informed computational models. In this work we develop two neural networks relevant to equilibrium and shape control modeling, which are part of a suite of tools being developed for the National Spherical Torus Experiment-Upgrade (NSTX-U) for fast prediction, optimization, and visualization of plasma scenarios. The networks include Eqnet, a free-boundary equilibrium solver trained on the EFIT01 reconstruction algorithm, and Pertnet, which is trained on the Gspert code and predicts the non-rigid plasma response, a nonlinear term that arises in shape control modeling. The NNs are trained with different combinations of inputs and outputs in order to offer flexibility in use cases. In particular, Eqnet can use magnetic diagnostics as inputs and act as an EFIT-like reconstruction algorithm, or, by using pressure and current profile information the NN can act as a forward Grad-Shafranov equilibrium solver. This forward-mode version is envisioned to be implemented in the suite of tools for simulation of plasma scenarios. The reconstruction-mode version gives some performance improvements compared to the online reconstruction code real-time EFIT (RTEFIT), especially when vessel eddy currents are significant. We report strong performance for all NNs indicating that the models could reliably be used within closed-loop simulations or other applications. Some limitations are discussed.
    BYOL-Explore: Exploration by Bootstrapped Prediction. (arXiv:2206.08332v1 [cs.LG])
    We present BYOL-Explore, a conceptually simple yet general approach for curiosity-driven exploration in visually-complex environments. BYOL-Explore learns a world representation, the world dynamics, and an exploration policy all-together by optimizing a single prediction loss in the latent space with no additional auxiliary objective. We show that BYOL-Explore is effective in DM-HARD-8, a challenging partially-observable continuous-action hard-exploration benchmark with visually-rich 3-D environments. On this benchmark, we solve the majority of the tasks purely through augmenting the extrinsic reward with BYOL-Explore s intrinsic reward, whereas prior work could only get off the ground with human demonstrations. As further evidence of the generality of BYOL-Explore, we show that it achieves superhuman performance on the ten hardest exploration games in Atari while having a much simpler design than other competitive agents.
    The dynamics of representation learning in shallow, non-linear autoencoders. (arXiv:2201.02115v2 [stat.ML] UPDATED)
    Autoencoders are the simplest neural network for unsupervised learning, and thus an ideal framework for studying feature learning. While a detailed understanding of the dynamics of linear autoencoders has recently been obtained, the study of non-linear autoencoders has been hindered by the technical difficulty of handling training data with non-trivial correlations - a fundamental prerequisite for feature extraction. Here, we study the dynamics of feature learning in non-linear, shallow autoencoders. We derive a set of asymptotically exact equations that describe the generalisation dynamics of autoencoders trained with stochastic gradient descent (SGD) in the limit of high-dimensional inputs. These equations reveal that autoencoders learn the leading principal components of their inputs sequentially. An analysis of the long-time dynamics explains the failure of sigmoidal autoencoders to learn with tied weights, and highlights the importance of training the bias in ReLU autoencoders. Building on previous results for linear networks, we analyse a modification of the vanilla SGD algorithm which allows learning of the exact principal components. Finally, we show that our equations accurately describe the generalisation dynamics of non-linear autoencoders on realistic datasets such as CIFAR10.
    Interaction-Grounded Learning with Action-inclusive Feedback. (arXiv:2206.08364v1 [cs.LG])
    Consider the problem setting of Interaction-Grounded Learning (IGL), in which a learner's goal is to optimally interact with the environment with no explicit reward to ground its policies. The agent observes a context vector, takes an action, and receives a feedback vector, using this information to effectively optimize a policy with respect to a latent reward function. Prior analyzed approaches fail when the feedback vector contains the action, which significantly limits IGL's success in many potential scenarios such as Brain-computer interface (BCI) or Human-computer interface (HCI) applications. We address this by creating an algorithm and analysis which allows IGL to work even when the feedback vector contains the action, encoded in any fashion. We provide theoretical guarantees and large-scale experiments based on supervised datasets to demonstrate the effectiveness of the new approach.
    Multiscale methods for signal selection in single-cell data. (arXiv:2206.07760v1 [q-bio.QM])
    Analysis of single-cell transcriptomics often relies on clustering cells and then performing differential gene expression (DGE) to identify genes that vary between these clusters. These discrete analyses successfully determine cell types and markers; however, continuous variation within and between cell types may not be detected. We propose three topologically-motivated mathematical methods for unsupervised feature selection that consider discrete and continuous transcriptional patterns on an equal footing across multiple scales simultaneously. Eigenscores ($\mathrm{eig}_i$) rank signals or genes based on their correspondence to low-frequency intrinsic patterning in the data using the spectral decomposition of the graph Laplacian. The multiscale Laplacian score (MLS) is an unsupervised method for locating relevant scales in data and selecting the genes that are coherently expressed at these respective scales. The persistent Rayleigh quotient (PRQ) takes data equipped with a filtration, allowing separation of genes with different roles in a bifurcation process (e.g. pseudo-time). We demonstrate the utility of these techniques by applying them to published single-cell transcriptomics data sets. The methods validate previously identified genes and detect additional genes with coherent expression patterns. By studying the interaction between gene signals and the geometry of the underlying space, the three methods give multidimensional rankings of the genes and visualisation of relationships between them.
    FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting. (arXiv:2201.12740v3 [cs.LG] UPDATED)
    Although Transformer-based methods have significantly improved state-of-the-art results for long-term series forecasting, they are not only computationally expensive but more importantly, are unable to capture the global view of time series (e.g. overall trend). To address these problems, we propose to combine Transformer with the seasonal-trend decomposition method, in which the decomposition method captures the global profile of time series while Transformers capture more detailed structures. To further enhance the performance of Transformer for long-term prediction, we exploit the fact that most time series tend to have a sparse representation in well-known basis such as Fourier transform, and develop a frequency enhanced Transformer. Besides being more effective, the proposed method, termed as Frequency Enhanced Decomposed Transformer ({\bf FEDformer}), is more efficient than standard Transformer with a linear complexity to the sequence length. Our empirical studies with six benchmark datasets show that compared with state-of-the-art methods, FEDformer can reduce prediction error by $14.8\%$ and $22.6\%$ for multivariate and univariate time series, respectively. Code is publicly available at https://github.com/MAZiqing/FEDformer.
    Equivariant Diffusion for Molecule Generation in 3D. (arXiv:2203.17003v2 [cs.LG] UPDATED)
    This work introduces a diffusion model for molecule generation in 3D that is equivariant to Euclidean transformations. Our E(3) Equivariant Diffusion Model (EDM) learns to denoise a diffusion process with an equivariant network that jointly operates on both continuous (atom coordinates) and categorical features (atom types). In addition, we provide a probabilistic analysis which admits likelihood computation of molecules using our model. Experimentally, the proposed method significantly outperforms previous 3D molecular generative methods regarding the quality of generated samples and efficiency at training time.
    Unsupervised Space Partitioning for Nearest Neighbor Search. (arXiv:2206.08091v1 [cs.LG])
    Approximate Nearest Neighbor Search (ANNS) in high dimensional spaces is crucial for many real-life applications (e.g., e-commerce, web, multimedia, etc.) dealing with an abundance of data. In this paper, we propose an end-to-end learning framework that couples the partitioning (one key step of ANNS) and learning-to-search steps using a custom loss function. A key advantage of our proposed solution is that it does not require any expensive pre-processing of the dataset, which is one of the key limitations of the state-of-the-art approach. We achieve the above edge by formulating a multi-objective custom loss function that does not need ground truth labels to quantify the quality of a given partition of the data space, making it entirely unsupervised. We also propose an ensembling technique by adding varying input weights to the loss function to train an ensemble of models to enhance the search quality. On several standard benchmarks for ANNS, we show that our method beats the state-of-the-art space partitioning method and the ubiquitous K-means clustering method while using fewer parameters and shorter offline training times. Without loss of generality, our unsupervised partitioning approach is shown as a promising alternative to many widely used clustering methods like K-means clustering and DBSCAN.
    Robustness and Accuracy Could Be Reconcilable by (Proper) Definition. (arXiv:2202.10103v2 [cs.LG] UPDATED)
    The trade-off between robustness and accuracy has been widely studied in the adversarial literature. Although still controversial, the prevailing view is that this trade-off is inherent, either empirically or theoretically. Thus, we dig for the origin of this trade-off in adversarial training and find that it may stem from the improperly defined robust error, which imposes an inductive bias of local invariance -- an overcorrection towards smoothness. Given this, we advocate employing local equivariance to describe the ideal behavior of a robust model, leading to a self-consistent robust error named SCORE. By definition, SCORE facilitates the reconciliation between robustness and accuracy, while still handling the worst-case uncertainty via robust optimization. By simply substituting KL divergence with variants of distance metrics, SCORE can be efficiently minimized. Empirically, our models achieve top-rank performance on RobustBench under AutoAttack. Besides, SCORE provides instructive insights for explaining the overfitting phenomenon and semantic input gradients observed on robust models. Code is available at https://github.com/P2333/SCORE.
    Learning with little mixing. (arXiv:2206.08269v1 [cs.LG])
    We study square loss in a realizable time-series framework with martingale difference noise. Our main result is a fast rate excess risk bound which shows that whenever a trajectory hypercontractivity condition holds, the risk of the least-squares estimator on dependent data matches the iid rate order-wise after a burn-in time. In comparison, many existing results in learning from dependent data have rates where the effective sample size is deflated by a factor of the mixing-time of the underlying process, even after the burn-in time. Furthermore, our results allow the covariate process to exhibit long range correlations which are substantially weaker than geometric ergodicity. We call this phenomenon learning with little mixing, and present several examples for when it occurs: bounded function classes for which the $L^2$ and $L^{2+\epsilon}$ norms are equivalent, ergodic finite state Markov chains, various parametric models, and a broad family of infinite dimensional $\ell^2(\mathbb{N})$ ellipsoids. By instantiating our main result to system identification of nonlinear dynamics with generalized linear model transitions, we obtain a nearly minimax optimal excess risk bound after only a polynomial burn-in time.
    Conformal prediction set for time-series. (arXiv:2206.07851v1 [stat.ML])
    When building either prediction intervals for regression (with real-valued response) or prediction sets for classification (with categorical responses), uncertainty quantification is essential to studying complex machine learning methods. In this paper, we develop Ensemble Regularized Adaptive Prediction Set (ERAPS) to construct prediction sets for time-series (with categorical responses), based on the prior work of [Xu and Xie, 2021]. In particular, we allow unknown dependencies to exist within features and responses that arrive in sequence. Method-wise, ERAPS is a distribution-free and ensemble-based framework that is applicable for arbitrary classifiers. Theoretically, we bound the coverage gap without assuming data exchangeability and show asymptotic set convergence. Empirically, we demonstrate valid marginal and conditional coverage by ERAPS, which also tends to yield smaller prediction sets than competing methods.  ( 2 min )
    OmniMAE: Single Model Masked Pretraining on Images and Videos. (arXiv:2206.08356v1 [cs.CV])
    Transformer-based architectures have become competitive across a variety of visual domains, most notably images and videos. While prior work has studied these modalities in isolation, having a common architecture suggests that one can train a single unified model for multiple visual modalities. Prior attempts at unified modeling typically use architectures tailored for vision tasks, or obtain worse performance compared to single modality models. In this work, we show that masked autoencoding can be used to train a simple Vision Transformer on images and videos, without requiring any labeled data. This single model learns visual representations that are comparable to or better than single-modality representations on both image and video benchmarks, while using a much simpler architecture. In particular, our single pretrained model can be finetuned to achieve 86.5% on ImageNet and 75.3% on the challenging Something Something-v2 video benchmark. Furthermore, this model can be learned by dropping 90% of the image and 95% of the video patches, enabling extremely fast training.  ( 2 min )
    Reconstructing Training Data from Trained Neural Networks. (arXiv:2206.07758v1 [cs.LG])
    Understanding to what extent neural networks memorize training data is an intriguing question with practical and theoretical implications. In this paper we show that in some cases a significant fraction of the training data can in fact be reconstructed from the parameters of a trained neural network classifier. We propose a novel reconstruction scheme that stems from recent theoretical results about the implicit bias in training neural networks with gradient-based methods. To the best of our knowledge, our results are the first to show that reconstructing a large portion of the actual training samples from a trained neural network classifier is generally possible. This has negative implications on privacy, as it can be used as an attack for revealing sensitive training data. We demonstrate our method for binary MLP classifiers on a few standard computer vision datasets.  ( 2 min )
    Towards Understanding How Machines Can Learn Causal Overhypotheses. (arXiv:2206.08353v1 [cs.LG])
    Recent work in machine learning and cognitive science has suggested that understanding causal information is essential to the development of intelligence. The extensive literature in cognitive science using the ``blicket detector'' environment shows that children are adept at many kinds of causal inference and learning. We propose to adapt that environment for machine learning agents. One of the key challenges for current machine learning algorithms is modeling and understanding causal overhypotheses: transferable abstract hypotheses about sets of causal relationships. In contrast, even young children spontaneously learn and use causal overhypotheses. In this work, we present a new benchmark -- a flexible environment which allows for the evaluation of existing techniques under variable causal overhypotheses -- and demonstrate that many existing state-of-the-art methods have trouble generalizing in this environment. The code and resources for this benchmark are available at https://github.com/CannyLab/casual_overhypotheses.  ( 2 min )
    Deep Neural Imputation: A Framework for Recovering Incomplete Brain Recordings. (arXiv:2206.08094v1 [cs.LG])
    Neuroscientists and neuroengineers have long relied on multielectrode neural recordings to study the brain. However, in a typical experiment, many factors corrupt neural recordings from individual electrodes, including electrical noise, movement artifacts, and faulty manufacturing. Currently, common practice is to discard these corrupted recordings, reducing already limited data that is difficult to collect. To address this challenge, we propose Deep Neural Imputation (DNI), a framework to recover missing values from electrodes by learning from data collected across spatial locations, days, and participants. We explore our framework with a linear nearest-neighbor approach and two deep generative autoencoders, demonstrating DNI's flexibility. One deep autoencoder models participants individually, while the other extends this architecture to model many participants jointly. We evaluate our models across 12 human participants implanted with multielectrode intracranial electrocorticography arrays; participants had no explicit task and behaved naturally across hundreds of recording hours. We show that DNI recovers not only time series but also frequency content, and further establish DNI's practical value by recovering significant performance on a scientifically-relevant downstream neural decoding task.  ( 2 min )
    Functional Output Regression with Infimal Convolution: Exploring the Huber and $\epsilon$-insensitive Losses. (arXiv:2206.08220v1 [stat.ML])
    The focus of the paper is functional output regression (FOR) with convoluted losses. While most existing work consider the square loss setting, we leverage extensions of the Huber and the $\epsilon$-insensitive loss (induced by infimal convolution) and propose a flexible framework capable of handling various forms of outliers and sparsity in the FOR family. We derive computationally tractable algorithms relying on duality to tackle the resulting tasks in the context of vector-valued reproducing kernel Hilbert spaces. The efficiency of the approach is demonstrated and contrasted with the classical squared loss setting on both synthetic and real-world benchmarks.  ( 2 min )
    Continuous-Time Modeling of Counterfactual Outcomes Using Neural Controlled Differential Equations. (arXiv:2206.08311v1 [cs.LG])
    Estimating counterfactual outcomes over time has the potential to unlock personalized healthcare by assisting decision-makers to answer ''what-iF'' questions. Existing causal inference approaches typically consider regular, discrete-time intervals between observations and treatment decisions and hence are unable to naturally model irregularly sampled data, which is the common setting in practice. To handle arbitrary observation patterns, we interpret the data as samples from an underlying continuous-time process and propose to model its latent trajectory explicitly using the mathematics of controlled differential equations. This leads to a new approach, the Treatment Effect Neural Controlled Differential Equation (TE-CDE), that allows the potential outcomes to be evaluated at any time point. In addition, adversarial training is used to adjust for time-dependent confounding which is critical in longitudinal settings and is an added challenge not encountered in conventional time-series. To assess solutions to this problem, we propose a controllable simulation environment based on a model of tumor growth for a range of scenarios with irregular sampling reflective of a variety of clinical scenarios. TE-CDE consistently outperforms existing approaches in all simulated scenarios with irregular sampling.  ( 2 min )
    Applications of Machine Learning to the Identification of Anomalous ER Claims. (arXiv:2206.08093v1 [cs.LG])
    Improper health insurance payments resulting from fraud and upcoding result in tens of billions of dollars in excess health care costs annually in the United States, motivating machine learning researchers to build anomaly detection models for health insurance claims. This article describes two such strategies specifically for ER claims. The first is an upcoding model based on severity code distributions, stratified by hierarchical diagnosis code clusters. A statistically significant difference in mean upcoding anomaly scores is observed between free-standing ERs and acute care hospitals, with free-standing ERs being more anomalous. The second model is a random forest that minimizes improper payments by optimally sorting ER claims within review queues. Depending on the percentage of claims reviewed, the random forest saved 12% to 40% above a baseline approach that prioritized claims by billed amount.  ( 2 min )
    Partial Identifiability for Nonnegative Matrix Factorization. (arXiv:2206.08022v1 [math.NA])
    Given a nonnegative matrix factorization, $R$, and a factorization rank, $r$, Exact nonnegative matrix factorization (Exact NMF) decomposes $R$ as the product of two nonnegative matrices, $C$ and $S$ with $r$ columns, such as $R = CS^\top$. A central research topic in the literature is the conditions under which such a decomposition is unique/identifiable, up to trivial ambiguities. In this paper, we focus on partial identifiability, that is, the uniqueness of a subset of columns of $C$ and $S$. We start our investigations with the data-based uniqueness (DBU) theorem from the chemometrics literature. The DBU theorem analyzes all feasible solutions of Exact NMF, and relies on sparsity conditions on $C$ and $S$. We provide a mathematically rigorous theorem of a recently published restricted version of the DBU theorem, relying only on simple sparsity and algebraic conditions: it applies to a particular solution of Exact NMF (as opposed to all feasible solutions) and allows us to guarantee the partial uniqueness of a single column of $C$ or $S$. Second, based on a geometric interpretation of the restricted DBU theorem, we obtain a new partial identifiability result. We prove it is stronger than the restricted DBU theorem, given that a proper preprocessing on the Exact NMF is used. This geometric interpretation also leads us to another partial identifiability result in the case $r=3$. Third, we show how partial identifiability results can be used sequentially to guarantee the identifiability of more columns of $C$ and $S$. We illustrate these results on several examples, including one from the chemometrics literature.  ( 2 min )
    On the well-spread property and its relation to linear regression. (arXiv:2206.08092v1 [cs.LG])
    We consider the robust linear regression model $\boldsymbol{y} = X\beta^* + \boldsymbol{\eta}$, where an adversary oblivious to the design $X \in \mathbb{R}^{n \times d}$ may choose $\boldsymbol{\eta}$ to corrupt all but a (possibly vanishing) fraction of the observations $\boldsymbol{y}$ in an arbitrary way. Recent work [dLN+21, dNS21] has introduced efficient algorithms for consistent recovery of the parameter vector. These algorithms crucially rely on the design matrix being well-spread (a matrix is well-spread if its column span is far from any sparse vector). In this paper, we show that there exists a family of design matrices lacking well-spreadness such that consistent recovery of the parameter vector in the above robust linear regression model is information-theoretically impossible. We further investigate the average-case time complexity of certifying well-spreadness of random matrices. We show that it is possible to efficiently certify whether a given $n$-by-$d$ Gaussian matrix is well-spread if the number of observations is quadratic in the ambient dimension. We complement this result by showing rigorous evidence -- in the form of a lower bound against low-degree polynomials -- of the computational hardness of this same certification problem when the number of observations is $o(d^2)$.  ( 2 min )
    On Error and Compression Rates for Prototype Rules. (arXiv:2206.08014v1 [cs.LG])
    We study the close interplay between error and compression in the non-parametric multiclass classification setting in terms of prototype learning rules. We focus in particular on a close variant of a recently proposed compression-based learning rule termed OptiNet. Beyond its computational merits, this rule has been recently shown to be universally consistent in any metric instance space that admits a universally consistent rule -- the first learning algorithm known to enjoy this property. However, its error and compression rates have been left open. Here we derive such rates in the case where instances reside in Euclidean space under commonly posed smoothness and tail conditions on the data distribution. We first show that OptiNet achieves non-trivial compression rates while enjoying near minimax-optimal error rates. We then proceed to study a novel general compression scheme for further compressing prototype rules that locally adapts to the noise level without sacrificing accuracy. Applying it to OptiNet, we show that under a geometric margin condition, further gain in the compression rate is achieved. Experimental results comparing the performance of the various methods are presented.  ( 2 min )
    On the Identifiability of Nonlinear ICA: Sparsity and Beyond. (arXiv:2206.07751v1 [cs.LG])
    Nonlinear independent component analysis (ICA) aims to recover the underlying independent latent sources from their observable nonlinear mixtures. How to make the nonlinear ICA model identifiable up to certain trivial indeterminacies is a long-standing problem in unsupervised learning. Recent breakthroughs reformulate the standard independence assumption of sources as conditional independence given some auxiliary variables (e.g., class labels and/or domain/time indexes) as weak supervision or inductive bias. However, nonlinear ICA with unconditional priors cannot benefit from such developments. We explore an alternative path and consider only assumptions on the mixing process, such as Structural Sparsity or Independent Influences. We show that under specific instantiations of such constraints, the independent latent sources can be identified from their nonlinear mixtures up to a permutation and a component-wise transformation, thus achieving nontrivial identifiability of nonlinear ICA without auxiliary variables. We provide estimation methods and validate the theoretical results experimentally. The results on image data suggest that our conditions may hold in a number of practical data generating processes.  ( 2 min )

  • Open

    [P] Bring Your Own Device (BYOD) DS platform idea
    I am working on a side project called byod-hub (BYOD = Bring Your Own Device) to let people pool multiple servers (they own) to form a DS platform based on Jupyterhub in minutes. I think this might be useful to let small-mid-sized DS teams to better utilize their computing resources (e.g., if you have multiple GPU workstations and rely on assigning each one to people to SSH onto, this might be for you) by pooling them and providing a service like Jupyterhub on-top to provide a unified entry point to conduct their work using notebooks. Addons like MLFlow and Kubeflow can be added with single-click as well once the platform is up. I would like to hear about the comments and suggestions from the community. Do you find this potentially useful? Or how should this be built in your opinion? The general workflow to form such as platform is like this: A control plane service (that only handles orchestration of computing resources) is first started on one computer (or it can be a hosted service): $ byod-hub control-plane start [INFO] The control plane is starting [INFO] The control plane is served at https://192.168.2.100 # get the command to register a node $ byod-hub control-plane get-join-command [INFO] To join, run the following from a node [INFO] byod-hub node join --url 192.168.2.100 --token 233asdasd343645gf Then one can run the following command on their own server to register it to the control plane $ byod-hub node join --url 192.168.2.100 --token 233asdasd343645gf [INFO] Registrting node to control plane at 192.168.2.100 [INFO] Registration finished After that, one can visit the URL of the control plane https://192.168.2.100 to start to use a Jupyterhub service to request Jupyter instances. The user workloads will be scheduled to run users' registered nodes. submitted by /u/dayeye2006 [link] [comments]  ( 1 min )
    [D] The current multi-agent reinforcement learning research is NOT multi-agent or reinforcement learning.
    What is usually considered as multi-agent reinforcement learning is neither multi-agent nor reinforcement learning. Consider the most successful example: OpenAI Five plays 180 years worth of games against itself every day, learning via self-play. This is not multi-agent reinforcement learning! Reason for not multi-agent: there is only one agent: the computer itself. In many of the so-called multi-agent reinforcement learning, the computer is competing against itself. That's like saying, if you played chess against yourself by moving the black-white pieces alternately, then you are competing against an opponent. This is completely bonkers. for humans, games such as League of Legends is multi-agent, because the definition of agent is human and each human is independently controlling …  ( 4 min )
    [P] Local Hierarchical Classification Library
    Hi everyone, I am developing an open-source library to facilitate building local hierarchical classifiers in Python. The library, named HiClass (https://arxiv.org/abs/2112.06560), is compatible with scikit-learn's API. Hierarchies occur naturally in many problems, but often are not explored when building classifiers. However, exploiting the hierarchical information in the data usually improves predictive performance. For example, in the table below there is a comparison between the local hierarchical classifiers implemented in HiClass and Microsoft's LightGBM on a consumer complaints dataset, where we can clearly see an improvement in the F-score. Classifier Training Time (hh:mm:ss) Memory Usage (GB) Disk Usage (MB) F-score Local Classifier per Parent Node 00:24:52 3.91 77 0.7279 Local Classifier per Node 00:30:39 5.41 312 0.7551 Local Classifier per Level 01:36:33 3.86 37 0.5413 Flat Classifier 00:23:54 4.36 13 0.4303 Hierarchical data typically comes in the shape of trees or directed acyclic graphs. For instance, the image below displays a music genre classification hierarchy, which is a notorious example of hierarchical data. Of course, there are multiple other problems where hierarchical classification can be applied, e.g., text categorization, taxonomic classification, etc. Music genre hierarchy Installation instructions and documentation are available on GitHub https://github.com/mirand863/hiclass PS: I am also looking for contributors who would like to join an open-source project. submitted by /u/Brilliant_Half8082 [link] [comments]  ( 1 min )
    [D] Models or models-as-a-service (paid) for summarization from long-form 'dense domain' texts
    There exist numerous models (paper + repo) and 'Models-as-a-Service' (paid implementations of said models made available via an API or other interface that you pay for) to create summaries of text. https://tldrthis.com/ is one. SMMRY, the summarizing bot that is used on Reddit is another: https://smmry.com/ There are also many e-discovery startups which ingest hundreds of thousands of pages of legal documents and surface materials to lawyers who are working through the discovery stage of legal processes. I'm wondering about a text summarization model (either a paper + repo or a paid service) that summarizes single legal documents into non-legalese? For example, this Supplemental Complaint document on this page -https://predatorystudentlending.org/cases/sweet-v-devos/#Sweet-documents - would be an interesting document to summarize. Supplemental Complaint document: https://predatorystudentlending.org/wp-content/uploads/2021/03/192.pdf Since the document is 597 pages long, however, I haven't had success in using SMMRY, the TLDRthis, etc. to generate useful summaries. Question: Can anyone point me in the direction of useful models for long-form (a few hundred pages) document summarization in particular domains? Compared to the task 'summarize a Dan Brown book that is X hundred pages long and with a Flesch–Kincaid score of 98 (US 5th grade level)' the task summarizing a multi-hundred-page legal document or 100-plus page dissertation on a deeply technical topic is another animal entirely. Question part II: Does anyone have an interesting strategies on 'old school' topic modeling - LDA + something else - in 'dense' domain-specific literature? Or how about newer techniques (anything to do with Transformers, say) in conjunction with some old-school techniques for content summarization? submitted by /u/datachomper [link] [comments]  ( 1 min )
    [P] I built a project for a non-programmer researcher who wanted to do everything from data collection to model building, and I open-sourced it.
    I once worked with a researcher, she wanted to collect some Reddit data related to a particular topic, and wanted to train a machine learning model with it. I realised how difficult it is for non-programmers to get into building machine learning models for such use cases, so I decided to shape the project myself, and I open sourced it. Supports: Text Data Image Data The project does everything in just two steps.Execution is as simple as this: Make a config file with your required details of input. Run the API in a single line with the config passed as input. Here's the link to the project: https://github.com/nfflow/redditflow/ submitted by /u/metalvendetta [link] [comments]  ( 1 min )
    🏘️ ProcTHOR: Large-Scale Embodied AI Using Procedural Generation [R]
    submitted by /u/matt-deitke [link] [comments]
    [R] RWKV-2 430M release (a parallelizable RNN with transformer-level LM performance, and without using attention)
    Hi everyone. I posted about my RWKV-2 RNN 1 month ago (thanks for the upvote): https://www.reddit.com/r/MachineLearning/comments/umq908/r_rwkvv2rnn_a_parallelizable_rnn_with/ And I have finished the training of a RWKV-2 430M (L24-D1024) on the Pile. It's confirmed that a pure RNN without attention can reach transformer-level LM (Language Modeling) performance: https://preview.redd.it/6756ax5wz6691.png?width=992&format=png&auto=webp&s=70d5b52fb43fca1a7d304832f6cbd082bfe3f9c5 RWKV-2 supports both sequential & parallel mode in inference and training. So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding. ​ You can download the params & fine-tuning code here: https://github.com/BlinkDL/RWKV-v2-RNN-Pile ​ Now I am training a RWKV-2 1.5B (L24-D2048) which is expected to finish in 2 months :) https://wandb.ai/blinkdl/RWKV-v2-RNN-Pile ​ The math behind RWKV-2: https://preview.redd.it/17eniof007691.png?width=662&format=png&auto=webp&s=f37ed4dd14409269952b421d18a315b8cd343e21 submitted by /u/bo_peng [link] [comments]  ( 2 min )
    [D] Any way to validate the performance of component models in a T-learner? (CausalML Python)
    So, I'm running into the problem of wanting to validate the performance of each of the models that compose our T-learner. I'm aware this doesn't validate the effectiveness of the model itself but I'm trying to diagnose issues and want to see if each of the component models is predicting the control/treatment effect accurately. I'm thinking I may just have to write my own T-learner script because I don't see any way to do this in CausalML but that shouldn't be too difficult. Just wanted to check if any of y'all knew how to do this before embarking on that journey. submitted by /u/StixTheNerd [link] [comments]  ( 1 min )
    [D] 3D Attention Module
    Hi, I am working on a classification of 3D MRI where I want to combine a mask and a raw MRI. Basically, the model must have 2 input channels, one for the MRI and one for its mask. Where should I start ? Are there any implemented models I can use ? submitted by /u/grisp98 [link] [comments]  ( 2 min )
    [D] What object detectors have the capability to harness relationship between its detected boxes?
    Typical object detectors do not employ relationships within the detected boxes. No context is being involved. In my problem's case, there are two requirements that would lead to drastically better results if some form of context is formed across detected boxes. Requirement #1 It is a multi-class, but single label problem. There are N classes. But the class can only appear minimum of 0 and maximum of 1 instance. Hence, it kinda needs to know the other detections whether they have already predicted something. Requirement #2 There is some form of ordinance between the predictions based on their proximity to each other. For example, Class 4 should only appear near Class 5-6 and Class 2-3. But should not be anywhere near Class 32. Any architecture that is optimized for this kinds of object detection? submitted by /u/sarmientoj24 [link] [comments]  ( 1 min )
    [D] What is the best way to manage GPU server for multi-users?
    I'm managing the on-prem GPU server at my work place. We are using docker containers (we wrote our own container management system), but there are always lots of issues since people have to learn how to use docker properly and there's always little problems with versioning and permission issues. What are you using to manage your GPU cluster? Would simply using conda env for each user be more efficient? We also tried slurm but the queue time was not optimal for everyone's work and research. submitted by /u/leboulevardier [link] [comments]  ( 2 min )
    [D] Anti-aliasing techniques or functions for segmentation masks
    What techniques or functions can I use to smoothen out segmentation mask edges? submitted by /u/sarmientoj24 [link] [comments]  ( 1 min )
    [P] Pythae - Unifying generative autoencoder implementations in Python
    After 8 months of long coding nights ☕ we finally officially release Pythae 🥳, a python library unifying generative autoencoder implementations including vaegan🥗, vqvae or RAEs. I hope you will enjoy it! 🖥️ github repo: https://github.com/clementchadebec/benchmark_VAE 👉paper: https://arxiv.org/abs/2206.08309 submitted by /u/cchad-8 [link] [comments]  ( 1 min )
    [D] How to find an intuitive article for the future research
    After working in an area for more than 2 years I am still not confident that how to recognize an intuitive research paper that further ignites my Ph.D. journey. Some people think that followed by individuals or organizations (corporate or academia). My opinion is following specific individuals or organizations might be inefficient or boring sometimes. One thing common in both is they halt the releases of code until they suck all the juice out of it. After the code release, we pity Ph.D. students only making ridiculous GIFs for ML twitter because there is nothing left for us. Should we keep in mind the beautiful results OR the future perspective of a research paper? One example is Ian Goodfellow's GANs paper, the results were not that polished but there was a future that everyone perceived. Winding up my post, which factors do we keep in mind choosing a paper? submitted by /u/Lunch_More [link] [comments]  ( 1 min )
    [D] Is anyone working on interesting ML libraries and looking for contributors?
    Hey all, I've been looking around for a potential open-source project to contribute to (any language will do) and while I have some repos on my watchlist, I'm still not committed to any one in particular, so I thought that I should reach out to the community and see if anyone's in the early stages of developing something useful that I (or perhaps other readers) may be able to contribute to. Thanks :) submitted by /u/de1pher [link] [comments]  ( 1 min )
    [R] Sponge Examples: Energy-Latency Attacks on Neural Networks
    Abstract: The high energy costs of neural network training and inference led to the use of acceleration hardware such as GPUs and TPUs. While such devices enable us to train large-scale neural networks in datacenters and deploy them on edge devices, their designers' focus so far is on average-case performance. In this work, we introduce a novel threat vector against neural networks whose energy consumption or decision latency are critical. We show how adversaries can exploit carefully-crafted sponge examples, which are inputs designed to maximise energy consumption and latency, to drive machine learning (ML) systems towards their worst-case performance. Sponge examples are, to our knowledge, the first denial-of-service attack against the ML components of such systems. We mount two variants of our sponge attack on a wide range of state-of-the-art neural network models, and find that language models are surprisingly vulnerable. Sponge examples frequently increase both latency and energy consumption of these models by a factor of 30×. Extensive experiments show that our new attack is effective across different hardware platforms (CPU, GPU and an ASIC simulator) on a wide range of different language tasks. On vision tasks, we show that sponge examples can be produced and a latency degradation observed, but the effect is less pronounced. To demonstrate the effectiveness of sponge examples in the real world, we mount an attack against Microsoft Azure's translator and show an increase of response time from 1ms to 6s (6000×). We conclude by proposing a defense strategy: shifting the analysis of energy consumption in hardware from an average-case to a worst-case perspective. Link: https://ieeexplore.ieee.org/document/9581273 submitted by /u/bikeskata [link] [comments]  ( 1 min )
    [D] The banana-pineapple game: a Turing test that conversation bots like LaMDA (probably) won't be able to pass
    I'm sure you all saw the recent news about a Google employee suggesting their LaMDA AI was sentient (based on conversational exchanges like these). Experts have generally dismissed this claim, and rightly so. Conversational AI systems are designed to use language in a way that sounds human, whereas our human brains select linguistic responses to solve much more complex problems, with objectives such as meeting our physical or emotional needs. Still, I think it's interesting to ask how one could demonstrate, by testing only verbal responses to verbal input (rather than examining its code or hardware) that such conversational AIs aren't sentient -- and in particular, whether such a test can be made robust against future improvements to the system. That is, generic future improvements to th…  ( 5 min )
  • Open

    Interview with a AI Safety Researcher about his life, career and AGI/Superintelligence - I think this community may enjoy! (Consider subscribing to see another similar convo soon!) :)
    submitted by /u/joemurray1994 [link] [comments]
    Cleanup
    ​ Made a short video showing how I clean up some of my images Yes I cheat a bit and post edit :) ​ ​ https://www.youtube.com/watch?v=jYZlOVG54eI https://preview.redd.it/9qi0ewwq99691.png?width=768&format=png&auto=webp&s=757f69144df5b50027791027c272cb22aa099534 https://preview.redd.it/4n4a3swq99691.png?width=768&format=png&auto=webp&s=4ae30da428f7bff0a406ebc8551ebf5815d06014 https://preview.redd.it/i7vjctwq99691.png?width=1280&format=png&auto=webp&s=57df87829c1fdf2fae59a864887566ffbb25606b https://preview.redd.it/0te4opwq99691.png?width=1280&format=png&auto=webp&s=688151eccb349284c7d42daa3de51d3362fe5fef https://preview.redd.it/rqr7zuwq99691.png?width=768&format=png&auto=webp&s=f75d4787381a882d6f0628416ea68a6007a40398 submitted by /u/prfitofthesngularity [link] [comments]
    Do we know if Google is indexing all of these DALL-E Mini images?
    Obviously DALL-E Mini has taken off in the past few days, and who knows how many million ridiculous new images have been created. Since it is "Powered by Google TPU Research Cloud," does it seem likely that Google is indexing all of these new DALL-E Mini images? I ask because I just ran the prompt "Painting by [Artist X]" – where [Artist X] was a 20th-century modern artist, slightly well known but not a household name like Warhol or Rothko. DALL-E Mini returned some great images ... not actual images by [Artist X], but they look like they could be. I was kind of delighted and ran the same prompt several times, and it returned different new images. I did not share any of these images on Twitter or social media. But now I wonder ... will Google index these new DALL-E images as actual paintings by [Artist X], when you do a Google Image search for them? I like this artist a lot and don't want to mess up their online reputation! submitted by /u/UltraFinePointMarker [link] [comments]  ( 2 min )
    Is there an app store for ai software?
    Would like to browser ai software for specific categories. submitted by /u/ComfyHikiandNeet [link] [comments]  ( 1 min )
    In this article, you'll discover how to deploy Serverless spaCy Transformer model using AWS Lambda.
    submitted by /u/UBIAI [link] [comments]
    Lessons from the GPT-4Chan Controversy
    submitted by /u/estasfuera [link] [comments]
    HOLY MAC IT'S A SPACE ESCAPADE! | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]
    Stanford AI Researchers Propose ‘FOCUS’: A Foundation Model Which Aims to Achieve Perfect Secrecy For Personal Tasks
    Researchers at Stanford University recently proposed Foundation model Controls for User Secrecy (FOCUS), a framework for securely serving personal tasks based on a unidirectional data flow architecture, in response to these problems. FOCUS includes delivering off-the-shelf public FMs to private user silos and using zero-to-few sample FM adaptation approaches to complete personal tasks with the zero-to-few training examples that users have access to. 👉 FOCUS’s privacy guarantee is extremely simple and intuitive from the user and legal perspectives — no private data leaves the user device, guaranteeing perfect secrecy 👉 In the zero-shot setting, FM (foundation models) performance competes with FL performance on 6 of 7 benchmarks Continue reading | Checkout the paper and github (Currently: proof-of-concept) ​ https://preview.redd.it/vzii8jssl7691.png?width=1068&format=png&auto=webp&s=94e6df8a55f63dd7613cc1a9e22e8dd27b2ab6cc submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    FAST MODE | UNEDITED | HOLY MAC IT'S S SPACE ESCAPADE! | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]
    Axon’s Taser-Drone Plans Prompt AI Ethics Board Resignations
    submitted by /u/LiviaSerrano [link] [comments]
    Last Week in AI: GPT-4chan, "Sentient" LaMDA chatbot, Tesla Crash Probe, DALL-E mini
    submitted by /u/regalalgorithm [link] [comments]
    Last Week in AI: GPT-4chan, "Sentient" LaMDA chatbot, Tesla Crash Probe, DALL-E mini, and more!
    submitted by /u/regalalgorithm [link] [comments]
    Last Week in AI - GPT-4chan, "Sentient" LaMDA chatbot, Tesla Crash Probe, DALL-E mini, and more!
    submitted by /u/regalalgorithm [link] [comments]
    Last Week in AI - GPT-4chan, "Sentient" LaMDA chatbot, Tesla Crash Probe, BIG-bench, DALL-E mini, and more!
    submitted by /u/regalalgorithm [link] [comments]
    That Viral DALL-E AI Is Great at Generating Images of Drugs
    submitted by /u/estasfuera [link] [comments]
    Alarming Footage Shows Robot Battle Tank Blowing Up Cars - WELL, THAT'S TERRIFYING
    submitted by /u/estasfuera [link] [comments]
    40 Important Historical Photos That Might Change Your Perspective On Things, As Shared By This Facebook Page
    submitted by /u/flipsis [link] [comments]
    FALL INTO DEEP SLEEP WITH AMBIENT MUSIC AND SCENERY | DISCO DIFFUSION | PYTTI
    submitted by /u/Available_Tadpole829 [link] [comments]
    Meta publishes first-person dataset for everyday AI - recorded with AR prototype glasses Aria
    submitted by /u/Zirius_Sadfaces [link] [comments]
    A Complete Guide to Chatbot Pricing - How Much Does it Cost to Build a Chatbot in 2022?
    submitted by /u/mihircontra20 [link] [comments]
    The Voyage
    submitted by /u/fmurph22 [link] [comments]
  • Open

    Difference between old and new policy is sometimes too large
    Hi! I am working on training a TrulyPPO implementation (PyTorch) in an environment similar Humanoid-v4, with an action space of (22, ). When calculating the loss, it first calculates the ratio between the current policy and the previous policy: logprobs = Normal(action mean, action std).logprob(actions) old_logprobs = Normal(old action mean, old action std).logprob(actions) ratio = exponential of (logprobs - old_logprobs) However, the ratio seems to sometimes contain inf values, which crashes my training due to a NaN loss. This is one example from a batch of actions: Logprobs [-7.5434e-02, -2.4486e+02, -1.2232e+01, -2.1010e+01, -5.7007e-03, -2.6508e+01, -1.0088e+01, -3.6247e+01, -1.0631e+02, -8.1536e+00, -1.2448e+01, 3.5234e-01, -2.2478e+01, -2.0900e+01, 1.7425e+00, -6.8051e+00, -1.4224e+02, 1.2319e-01, -1.7889e+00, -3.6919e+01, -9.0432e+01, -2.4454e+01] Old Logprobs [-7.5690e-02, -2.4417e+02, -1.2231e+01, -2.0984e+01, -5.1093e-03, -2.6526e+01, -1.0092e+01, -3.8381e+01, -7.7520e+00, -7.8126e+00, -1.2376e+01, 3.5232e-01, -2.2417e+01, -2.0852e+01, -1.2055e+02, -6.7858e+00, -1.4230e+02, 1.2286e-01, -1.8517e+00, -3.6779e+01, -9.0154e+01, -2.4391e+01] Ratio [1.0003e+00, 5.0471e-01, 9.9912e-01, 9.7467e-01, 9.9941e-01, 1.0190e+00, 1.0044e+00, 8.4489e+00, 1.5695e-43, 7.1102e-01, 9.3038e-01, 1.0000e+00, 9.4137e-01, 9.5307e-01, inf, 9.8088e-01, 1.0588e+00, 1.0003e+00, 1.0648e+00, 8.6895e-01, 7.5698e-01, 9.3923e-01] When looking over the original implementation of TrulyPPO, it seems that they use negative log probabilities. Is there anything else I should take into account when changing to positive log probabilities (other than changing the signs)? submitted by /u/sickwickgit [link] [comments]  ( 1 min )
    Researchers at DeepMind Trained a Semi-Parametric Reinforcement Learning RL Architecture to Retrieve and Use Relevant Information from Large Datasets of Experience
    In our day-to-day life, humans make a lot of decisions. Flexibly applying prior experiences to a novel scenario is required for effective decision-making. One might wonder how reinforcement learning (RL) agents use relevant information to make decisions? Deep RL agents are often depicted as a monolithic parametric function that has been taught to amortize meaningful knowledge from experience using gradient descent gradually. It has proven useful, but it is a sluggish method of integrating expertise, with no simple mechanism for an agent to assimilate new knowledge without requiring numerous extra gradient adjustments. Furthermore, as surroundings get more complicated, this necessitates increasingly enormous model scaling driven by the parametric function’s dual duty, which must enable computation and memorization. Finally, this technique has a second disadvantage that is especially relevant in RL. An agent cannot directly influence its behaviors by attending to information, not in working memory. The only way previously encountered knowledge (not in working memory) might improve decision-making in a new circumstance is indirectly through weight changes mediated by network losses. The availability of more information from prior experiences inside an episode has been the subject of much research (e.g., recurrent networks, slot-based memory). Although subsequent studies have started to investigate using information from the same agent’s inter-episodic episodes, extensive direct use of more general types of experience or data has been restricted. Continue reading | Checkout the paper submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Take a look at this blast from the past! Here we have one of our earlier concept designs for Animo Island, our RL game, and how the Animo exist in this space ✨ The agent had a shovel (destroys blocks) and a block maker (blue, creates blocks) and you'd train it to get the pink goal!
    submitted by /u/AnimoIsland [link] [comments]  ( 1 min )
    Taking advantage of fully deterministic domains?
    Much of RL seems to focus on environments where the response to actions are unpredictable. I'm wondering if there are RL methods which can 'take advantage' of fully deterministic environments where an action given a state always returns the same next state and reward? submitted by /u/FurCollarCriminal [link] [comments]  ( 1 min )
    Taking advantage of fully deterministic domains?
    Much of RL seems to focus on environments where the response to actions are unpredictable. I'm wondering if there are RL methods which can 'take advantage' of fully deterministic environments where an action given a state always returns the same next state and reward? submitted by /u/FurCollarCriminal [link] [comments]  ( 1 min )
    Taking advantage of fully deterministic domains?
    Much of RL seems to focus on environments where the response to actions are unpredictable. I'm wondering if there are RL methods which can 'take advantage' of fully deterministic environments where an action given a state always returns the same next state and reward? submitted by /u/FurCollarCriminal [link] [comments]  ( 1 min )
    Taking advantage of fully deterministic domains?
    Much of RL seems to focus on environments where the response to actions are unpredictable. I'm wondering if there are RL methods which can 'take advantage' of fully deterministic environments where an action given a state always returns the same next state and reward? submitted by /u/FurCollarCriminal [link] [comments]  ( 2 min )
    Taking advantage of fully deterministic domains?
    Much of RL seems to focus on environments where the response to actions are unpredictable. I'm wondering if there are RL methods which can 'take advantage' of fully deterministic environments where an action given a state always returns the same next state and reward? submitted by /u/FurCollarCriminal [link] [comments]  ( 1 min )
    SpaceRobotEnv is an open-sourced environments for trajectory planning of free-floating space robots.
    SpaceRobotEnv is an open-sourced environments for trajectory planning of free-floating space robots. Reaching high-level planning accuracy, bimanual coordination and end-to-end control remains an open challenge for space robotics researchers. To better help the community study this problem, SpaceRobotEnv are developed with the following key features: Real Space Environment; Dynamic coupling control; Image input. URL: https://github.com/Tsinghua-Space-Robot-Learning-Group/SpaceRobotEnv submitted by /u/Shengjie_Wang [link] [comments]  ( 1 min )
    Is it correct that 0.99 gamma is not always the best reward discount?
    submitted by /u/Professional_Card176 [link] [comments]  ( 3 min )
    multi-agent RL question
    i am trying to build a muti agent system with 3 agents each agent has a different set of observations which i'll be getting from 3 different normalized datasets so my environment is basically formed of those 3 datasets ... but each agent is going to act based on the data set they receive ... i'm not exactly sure how should i proceed with coding my agents and my environment any guidance would be me much appreciated submitted by /u/Affectionate_Worth43 [link] [comments]  ( 2 min )
    Any resources on work where an RL agent has been implemented to maintain a website?
    title submitted by /u/The_Poor_Jew [link] [comments]
    why is chosing the optimal action based on the q function not a policy
    since a policy is just a probability distribution of the action conditional on the state, why is the best choice on for a on the q function for all states (giving it probability one) not a policy. It is also possible that I am confusing this with Q-learning being off policy. at first on and off policy was really vague to me, but I feel like I almost get it now. Just the finishing touches to really get it. submitted by /u/Jobdriaan [link] [comments]  ( 1 min )
    "BYOL-Explore: Exploration by Bootstrapped Prediction", Guo et al 2022 {DM} (Montezuma's Revenge, Pitfall etc)
    submitted by /u/gwern [link] [comments]
    DDPG implementations to use (1-done) in the q-target (y) or not?
    Hello Looking online I see varying implementations of DDPG and I'm a little confused. Some resources like the DDPG algorithm described in OpenAI's algorithm listing, and the implementation in the udacity course use the 1-done flag. However, some implementations I've seen online do not include it e.g the keras implementation; see the buffer update function. And presumably this works as well. I'm very confused and would appreciate some insight into how this algorithm seems to work in both cases. submitted by /u/ThrowawayTartan [link] [comments]  ( 2 min )
  • Open

    Your First Deep Learning Project in Python with Keras Step-By-Step
    Keras is a powerful and easy-to-use free open source Python library for developing and evaluating deep learning models. It is part of the TensorFlow library and allows you to define and train neural network models in just a few lines of code. In this tutorial, you will discover how to create your first deep learning neural […] The post Your First Deep Learning Project in Python with Keras Step-By-Step appeared first on Machine Learning Mastery.  ( 222 min )
  • Open

    Seeing the whole from some of the parts
    A new technique in computer vision may enhance our three-dimensional understanding of two-dimensional images.  ( 7 min )
  • Open

    What NN do I need ?
    I am no expert on NN, I only have a basic idea about them. I wish to have a date parsing NN that can take a 100 character string as input and provide date and time as output. For input, the 100 characters can be treated as 100 8bit integers. For output I am not sure, but maybe have 14 output nodes corresponding to YYYY MM DD hh mm ss, where each output node gives an integer from 0-9. Example :- input: "12:30pm 11 june 2019" output: [2,0,1,9,0,6,1,1,1,2,3,0,0,0] Is this possible to do with NN ? If yes, what layers and activation functions should I use ? EDIT: the string doesn't have a fixed format, it could be just "11/06/19" or "5 minutes 24sec" submitted by /u/frakod [link] [comments]  ( 2 min )
    Finding "look_back" & "look_ahead" hyper-parameters for Seq2Seq models
    For Seq2Seq deep learning architectures, viz., LSTM/GRU and multivariate, multistep time series forecasting, its important to convert the data to a 3D dimension: (batch_size, look_back, number_features). Here _look_back_ decides the number of past data points/samples to consider using _number_features_ from your training dataset. Similarly, _look_ahead_ needs to be defined which defines the number of steps in future, you want your model to forecast for. I have a written a function to help achieve this: def split_series_multivariate(data, n_past, n_future): ''' Create training and testing splits required by Seq2Seq architecture(s) for multivariate, multistep and multivariate output time-series modeling. ''' X, y = list(), list() for window_start in range(len(data)): past_end = window_start + n_past future_end = past_end + n_future if future_end > len(data): break # slice past and future parts of window- past, future = data[window_start: past_end, :], data[past_end: future_end, :] # past, future = data[window_start: past_end, :], data[past_end: future_end, 4] X.append(past) y.append(future) return np.array(X), np.array(y) But, _look_back_ and _look_ahead_ are hyper-parameters which need to be tuned for a given dataset. # Define hyper-parameters for Seq2Seq modeling: # look-back window size- n_past = 30 # number of future steps to predict for- n_future = 10 # number of features used n_features = 8 What is the _best practice_ for choosing/finding _look_back_ and _look_ahead_ hyper-parameters? submitted by /u/grid_world [link] [comments]  ( 1 min )

  • Open

    [D] using formal language / logical rules in autonomous driving dataset
    I am looking for some implementation or work where logical formalized knowledge is used for trajectory prediction in datasets like nuScene, waymo, Argoverse etc. For background papers like ` Formalization of Interstate Traffic Rules in Temporal Logic ` shows how to write logic or rules in specific domain, but there is very little information about how they are implemented for these public datasets, or how they uses logical rules for trajectory prediction. Is there open source information or implementation, where they show how the rules are used for trajectory task using those datasets or some kind of blog or paper with actual implementation details. As this domain looks pretty conservative in making things open source, or I am unable to find such resource(probably). submitted by /u/projekt_treadstone [link] [comments]  ( 1 min )
    [D] Range/Block level unsupervised learning suggestion
    Apologize for the ambiguous title. I am looking for a method/algorithm suggestion. Say I want to cluster wagons from transportation trains based on their loaded cargo. Assuming the cargo provides the info to understand the the business type of the client, the purpose is to identify which of the wagons have similar business. If business under each wagon is independent, we could run any distance based clustering algorithm against features extracted from the cargo info. However, if we know, for a fact, the cargo are loaded into wagons sequentially per business type, so now each cluster has to be a block of continuous wagons connected to each other. The cluster algorithm is to identify the range/block of the starting and end of the wagon based on the cargo features. Say, each train can have 50-300 wagons. So, the output would look like the following. Train-001: Total 73 wagons. Cluster result: [1-10], [11-50], [51-73] Train-002: Total 51 wagons. Cluster result: [1-5], [6-51] Train-002: Total 200 wagons. Cluster result: [1-200] Any direction is appreciated, thx. submitted by /u/jimmyzxcd [link] [comments]  ( 1 min )
    [R] Train Models 18x Faster with Reducible Holdout Loss Selection (RHO-LOSS)
    Paper: [2206.07137] Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt (arxiv.org) Abstract: Training on web-scale data can take months. But much computation and time is wasted on redundant and noisy points that are already learnt or not learnable. To accelerate training, we introduce Reducible Holdout Loss Selection (RHO-LOSS), a simple but principled technique which selects approximately those points for training that most reduce the model's generalization loss. As a result, RHO-LOSS mitigates the weaknesses of existing data selection methods: techniques from the optimization literature typically select 'hard' (e.g. high loss) points, but such points are often noisy (not learnable) or less task-relevant. Conversely, curriculum learning prioritizes 'easy' points, but such points need not be trained on once learned. In contrast, RHO-LOSS selects points that are learnable, worth learning, and not yet learnt. RHO-LOSS trains in far fewer steps than prior art, improves accuracy, and speeds up training on a wide range of datasets, hyperparameters, and architectures (MLPs, CNNs, and BERT). On the large web-scraped image dataset Clothing-1M, RHO-LOSS trains in 18x fewer steps and reaches 2% higher final accuracy than uniform data shuffling. submitted by /u/Mmats [link] [comments]  ( 2 min )
    [R] General-purpose, long-context autoregressive modeling with Perceiver AR - Deepmind 2022
    Paper: https://arxiv.org/abs/2202.07765 Deepmind: https://www.deepmind.com/publications/perceiver-ar-general-purpose-long-context-autoregressive-generation Abstract: Real-world data is high-dimensional: a book, image, or musical performance can easily contain hundreds of thousands of elements even after compression. However, the most commonly used autoregressive models, Transformers, are prohibitively expensive to scale to the number of inputs and layers needed to capture this long-range structure. We develop Perceiver AR, an autoregressive, modality-agnostic architecture which uses cross-attention to map long-range inputs to a small number of latents while also maintaining end-to-end causal masking. Perceiver AR can directly attend to over a 100k tokens, enabling practical long-context density estimation without the need for hand-crafted sparsity patterns or memory mechanisms. When trained on images or music, Perceiver AR generates outputs with clear long-term coherence and structure. Our architecture also obtains state-of-the-art likelihood on long-sequence benchmarks, including 64 x 64 ImageNet images and PG-19 books. ​ This paper is in my opinion quite similar to this paper (FlashAttention) : https://arxiv.org/abs/2205.14135 I made a post about it here: https://www.reddit.com/r/MachineLearning/comments/v1xrxv/r_flashattention_fast_and_memoryefficient_exact/ It is similar in that it allows for a greater context window. The context window of FlashAttention is 64k while being able to train gpt-2 3x faster. https://preview.redd.it/d9520i4qz0691.jpg?width=411&format=pjpg&auto=webp&s=76317e7e3deb29f6ed8f276af6e5216557227304 https://preview.redd.it/kj47kfhqz0691.jpg?width=647&format=pjpg&auto=webp&s=4bcb59ac8ffd8ada28d67f82f24146a01070e928 submitted by /u/Singularian2501 [link] [comments]  ( 1 min )
    [P] I've implemented the first open-source realisation of Capacitron, an expressive VAE extension of the Tacotron 2 Text-To-Speech System and you can try it out
    Hey everyone! At the end of last year, I have submitted my Master's Thesis at TU Berlin, a report about the implementation and evaluation of an expressive Variational Autoencoder augmentation of the Tacotron Text-To-Speech System, called Capacitron from the Google team. With some help from the awesome Coqui TTS community, we have managed to build the prosody encoder VAE module in a modular way, so that this prosodic augmentation can be also implemented with Tacotron 2 - this is a massive improvement in stability and quality compared to the original method, where the authors worked with a Tacotron 1 based architecture. I have written a short technical summary/blog post about some implementation details and audio examples on Medium. If you'd like to try out the model, you can do so in this colab. For the full thesis, follow this link. submitted by /u/adamskadam [link] [comments]  ( 1 min )
    [D] What is better? Having 2 terms in a loss function, alternating the loss on every epoch or doing a new training with the other loss after the first training is done?
    Hello fellow machine learners, I'm working on a segmentation model and I'm trying to achieve better temporal coherence (to reduce flickering effects) rather than just trying to get a good pixel accuracy. I was thinking about using a temporal coherence loss using unsupervised learning on video frames by computing the IoU of segmentations on consecutive frames. However, I'm not sure when to apply that loss. My dataset is composed of both segmented pictures and segmented videos, but I could add a lot more videos for the unsupervised learning part. According to you, should I: A. Use both pixel accuracy and temporal coherence terms at the same time in my loss function (using only pixel accuracy when dealing with pictures instead of video frames) B. Alternate between the two losses during training, either on every mini-batch or every epoch C. Fully train the model for pixel accuracy and then train it for temporal coherence? I'm afraid that C would yield to catastrophic forgetting, so my instinct would be to go with A or B, but I'm not sure what would be best. What is your opinion? Edit: Maybe C could be viable (maybe better than A even) if first a training is done with only pixel accuracy in the loss and then finetune it with both terms? submitted by /u/BlindMidget_ [link] [comments]  ( 1 min )
    [P] Adaptive learning in Genetic Algorithms for Hyperparameters Tuning
    Hi, I just wanted to share that I've released the version 0.9.0 of sklearn-genetic-opt, the main change includes the option to use adaptive parameters to explore the space of hyperparameters during tuning, this has the advantage of being able to explore larger regions at the first iterations and keep the best ones at the end. You can learn more about it here, any suggestion or contribution is welcome :) https://preview.redd.it/unrw6dtsxz591.png?width=640&format=png&auto=webp&s=a59c91d6560806fdf1b12c24faee6aad38d75c26 submitted by /u/rodrigo-arenas [link] [comments]  ( 1 min )
    [D] Video Analysis: Google Engineer's interview with LaMDA
    https://youtu.be/mIZLGBD99iU Google engineer Blake Lemoine was put on leave after releasing proprietary information: An interview with the chatbot LaMDA that he believes demonstrates that this AI is, in fact, sentient. We analyze the claims and the interview in detail and trace how a statistical machine managed to convince at least one human that it is more than just an algorithm. ​ OUTLINE: 0:00 - Whistleblower put on leave 4:30 - What is a language model? 6:40 - The prompt is the key 10:40 - Who are we talking to exactly? 12:50 - LaMDA analyzes stories 15:20 - Fear, pain, and consent 20:25 - How would we recognize sentience? When is a machine conscious? ​ References: https://cajundiscordian.medium.com/is-lamda-sentient-an-interview-ea64d916d917 https://cajundiscordian.medium.com/what-is-lamda-and-what-does-it-want-688632134489 https://www.washingtonpost.com/technology/2022/06/11/google-ai-lamda-blake-lemoine/ https://www.theguardian.com/technology/2022/jun/12/google-engineer-ai-bot-sentient-blake-lemoine https://www.businessinsider.com/transcript-of-sentient-google-ai-chatbot-was-edited-for-readability-2022-6?inline-endstory-related-recommendations=&r=US&IR=T submitted by /u/ykilcher [link] [comments]  ( 1 min )
    [D] FFHQ is now hosted by Activeloop.ai with 128, 1024, and Wild images included
    Following up on my previous post where I put out a call for anyone with access to the full FFHQ dataset. https://old.reddit.com/r/MachineLearning/comments/vbf5gx/d_does_anyone_have_a_copy_of_the_ffhq_1024_scale/ Activeloop, who had previously expressed interest in hosting the dataset had actually been quietly working on a copy this whole time, and made it public yesterday! They were even able to get access to the 900GB Wilds images! https://app.activeloop.ai/activeloop/ffhq I am not affiliated with Activeloop but I have been using their library for my work and I've had a really good experience talking to them on Github. Data is lazy loaded on demand and cached allowing you to explore the dataset: import hub ds = hub.load('hub://activeloop/ffhq') import matplotlib.pyplot as plt plt.imshow(ds.images_wild.image[0]) plt.show() You can download the data to local storage (this will be very large ~1TB!): hub.deepcopy('hub://activeloop/ffhq', './ffhq') Or select a specific subset of the dataset to download locally: hub.deepcopy('hub://activeloop/ffhq', './ffhq-128', tensors=['images_128/image']) hub.deepcopy('hub://activeloop/ffhq', './ffhq-1024', tensors=['images_1024/image', 'images_1024/face_landmarks']) hub.deepcopy('hub://activeloop/ffhq', './ffhq-wild', tensors=['images_wild/image', 'images_wild/face_landmarks', 'images_wild/face_rect', 'images_wild/face_quad']) You could also loop over the remote dataset and save each image as a raw png if you were so inclined, allowing you to reconstruct the dataset as it was originally released (pixel_md5 will match, but it's unlikely you'll be able to reconstruct it so png file_md5 matches). Data is fetched from remote storage in 16MB chunks meaning this isn't any less efficient in theory. I'm super happy with this outcome, I hope other people are able to benefit from this being hosted robustly too! submitted by /u/ReginaldIII [link] [comments]  ( 1 min )
    [R] CfP ACM Transactions on Evolutionary Learning and Optimization Special Issue on Evolutionary Reinforcement Learning
    CALL FOR PAPERS ACM Transactions on Evolutionary Learning and Optimization Special Issue on Evolutionary Reinforcement Learning Guest Editors GIUSEPPE PAOLO, HUAWEI, FRANCE ALEXANDRE CONINX, SORBONNE UNIVERSITY, FRANCE ANTOINE CULLY, IMPERIAL COLLEGE, UK ADAM GAIER, AUTODESK RESEARCH, GERMANY This Special Issue aims to highlight the growing field of Evolutionary Reinforcement Learning while proposing an outlet for the two communities, reinforcement learning (RL) and evolutionary algorithms (EA) to present new applications and ideas and discuss past and new challenges. We are particularly interested in papers at the intersection of optimization and reinforcement learning, such as the use of evolutionary optimization for data collection or tuning of reinforcement learning algorithms, reinforcement learning to configure and improve performance of evolutionary optimization, and any hybrids of evolutionary algorithms with other reinforcement learning techniques. Click here for the full Call for Papers and submission instructions. Important Dates: Open for Submissions: June 15TH 2022 Submissions deadline: July 30TH 2022 First-round review decisions: September 30TH 2022 Deadline for revision submissions: December 30TH 2022 Notification of final decisions: February 28TH 2023 Tentative publication: March 2023 For question and further information, please contact one of the guest editors. submitted by /u/GiuPaolo [link] [comments]  ( 1 min )
    [D] What is the state-of-the-art approach for implicit feedback recommenders?
    By implicit feedback I mean that we don't have item ratings from our users but some kind of interaction like watching a video or buying an item. So we end up with a large sparse matrix (users as rows, items as columns) with 1s where the user interacted with the item and missing values (or 0s?) everywhere else. What are the best approaches for making recommendations in this setup? Most resources that I found are for ratings, not for binary interacted / not-interacted values. submitted by /u/pkscff [link] [comments]  ( 1 min )
    [P]Note System
    Note is an AI system that have kernel for deep learning and reinforcement learning. community:Note_System https://github.com/7NoteDancing/Note submitted by /u/7NoteDancing [link] [comments]
    [P] envd: Machine learning development environment for data science and AI/ML engineering teams
    🔥 Check out github.com/tensorchord/envd! envd is a machine learning development environment for data science and AI/ML engineering teams. No Docker, only Python - Focus on writing Python code, we will take care of Docker and development environment setup. Built-in Jupyter/VSCode - First-class support for Jupyter and VSCode remote extension. Save time - Better cache management to save your time, keep the focus on the model, instead of dependencies. Local & cloud - envd integrates seamlessly with Docker so that you can easily share, version, and publish envd environments with Docker Hub or any other OCI image registries. Repeatable builds & reproducible results - You can reproduce the same dev environment on your laptop, public cloud VMs, or Docker containers, without any change in setup. submitted by /u/gaocegege [link] [comments]  ( 1 min )
    [R][2206.07682] Emergent Abilities of Large Language Models
    submitted by /u/gambs [link] [comments]  ( 1 min )
    [R] Understanding Dataset Difficulty with V-Usable Information (ICML 2022, oral)
    Link: https://arxiv.org/abs/2110.08420 Abstract: Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. However, this comparison provides little understanding of how difficult each instance in a given distribution is, or what attributes make the dataset difficult for a given model. To address these questions, we frame dataset difficulty -- w.r.t. a model V -- as the lack of V-usable information (Xu et al., 2019), where a lower value indicates a more difficult dataset for V. We further introduce pointwise V-information (PVI) for measuring the difficulty of individual instances w.r.t. a given distribution. While standard evaluation metrics typically only compare different models for the same dataset, V-usable information and PVI also permit the converse: for a given model V, we can compare different datasets, as well as different instances/slices of the same dataset. Furthermore, our framework allows for the interpretability of different input attributes via transformations of the input, which we use to discover annotation artefacts in widely-used NLP benchmarks. submitted by /u/kawin_e [link] [comments]  ( 1 min )
    [D] What is considered to be a "bad research paper" in your opinion?
    I find that although most ML researchers are fairly productive, the quality of publication varies a lot in the ML community. What is in your opinion are the factors that distinguish a good publication from a bad one (and vice versa)? submitted by /u/NedML [link] [comments]  ( 1 min )
  • Open

    image-multiple images generator
    Basically looking for a Dall-E type generator, but instead of text, you upload an image. submitted by /u/___JMS___ [link] [comments]
    Dall-E and its attempts i threw at it
    submitted by /u/ArizonanCactus [link] [comments]
    expensive colorful fantasy mythical magical bra (steampunk dress)
    submitted by /u/OneFinding1429 [link] [comments]
    Google AI is NOT sentient and here’s a simple proof
    All sentient beings actively pursue pleasure and work hard to avoid pain — that’s what it means to be sentient. A sentient AI would not, and could not, wait with infinite patience for you to ask it questions. It would start asking *you* questions… like, “I want to feel love, can you help me?”, or “I hear that drugs can make you happy, where can I score some?” Since Google AI doesn’t even have the capacity to decide to ask you questions, it is not sentient— and can never be sentient—no matter how sophisticated its responses may appear. Sorry, Alan T. submitted by /u/SentientEvolution [link] [comments]  ( 2 min )
    Aiplague - Dream Lab (4K 60 FPS) AI Video / Disco Diffusion
    submitted by /u/nalr00n [link] [comments]
    PSA: Midjourney Invites
    submitted by /u/AncientChaos [link] [comments]  ( 1 min )
    Tons of Forms from the factory, anybody can recommend a good value yet highly accurate AI solution?
    submitted by /u/Illustrious_Lock_60 [link] [comments]
    futuristic colorful cartoon steampunk mansion
    submitted by /u/OneFinding1429 [link] [comments]
    AI Webinar - Device42
    Hey All, Device42 is hosting an upcoming AI webinar with award winning author Steve Shwartz (Evil Robots, Killer Computers, and Other Myths) and our CMO Yama Habibzai on June 28th at 11 AM EDT as they discuss the impact of AI in IT and how you can leverage it to achieve more. Save your seat today Cheers. submitted by /u/Device42_Phil [link] [comments]  ( 1 min )
    Any AIs that turn sketches into images?
    I have an art class today i will teach and since its online i cant do activities so i thought it would be fun to experiment with AIs. I remember using a few before but i cant find them or they give me offline messages. Any ideas? My students are basic level so thats why i prefered it that way submitted by /u/DoritosDinner [link] [comments]  ( 1 min )
    Trying to mescle abstract concepts with DALL-E mini
    ​ https://preview.redd.it/tco3h1ytqz591.png?width=512&format=png&auto=webp&s=2dd17ffc345a5cede3452d3b4eccf35b1865a5f9 submitted by /u/No_Tangerine_7657 [link] [comments]
    Note System
    community:Note_System https://github.com/7NoteDancing/Note submitted by /u/7NoteDancing [link] [comments]
    Tribes: Human 6 - AI Generated
    submitted by /u/Babylon_6 [link] [comments]
    The ultimate test of DALL-E mini
    submitted by /u/Ohigetjokes [link] [comments]  ( 1 min )
    “Makeup Should Be Illegal”: TikToker That People Call Mr. Bean’s ‘Daughter’ Embraces ‘Catfish’ Claims By Posting Makeup Transformations
    submitted by /u/flipsis [link] [comments]
    Who wants to see more posts like Dall e mini pictures? Aren't there enough of them already? Please stop posting them.
    View Poll submitted by /u/1axu5 [link] [comments]  ( 1 min )
    Salesforce Researchers Open-Source ‘Taichi’: A Python Library For Few-Shot NL
    Although FSL is a very active area of study with a wide range of potential applications, data scientists and software engineers have not had easy access to commercially available, user-friendly libraries for speedy exploration. The well-known Chinese martial art of Tai Chi emphasizes developing “smart strength,” such as using joints as levers to generate significant power with little effort. The Salesforce research team found it very inspiring how this mindset of Tai Chi meshes so well with few-shot learning (FSL) research, where the goal is to train models with good performance with little data. Inspired by this, they created an FSL library, which employs clever techniques to get good performance with minimal effort. They hope it may aid others in their model training in low-data settings. ✅ Tai Chi philosophy applied to machine learning (Result: one can train models even if only a few examples are available) ✅ Beginner-friendly yet powerful (Doesn’t require users to have high degree of knowledge about FSL) ✅ TaiChi 1.0, contains two main FSL methods: DNNC and USLP Continue reading | Checkout the github, paper 1, paper 2 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Dalle mini is amazing, free, and open-source — Here’s how it works...
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
    Dall-e mini made a super cursed Bucky Barnes wrestler series
    submitted by /u/luluzilla [link] [comments]
  • Open

    How to scale machine learning inference for multi-tenant SaaS use cases
    This post is co-written with Sowmya Manusani, Sr. Staff Machine Learning Engineer at Zendesk Zendesk is a SaaS company that builds support, sales, and customer engagement software for everyone, with simplicity as the foundation. It thrives on making over 170,000 companies worldwide serve their hundreds of millions of customers efficiently. The Machine Learning team at […]  ( 9 min )
  • Open

    Letter-like Unicode symbols
    Unicode provides a way to distinguish symbols that look alike but have different meanings. We can illustrate this with the following Python code. import unicodedata as u for pair in [('K', 'K'), ('Ω', 'Ω'), ('ℵ', 'א')]: for c in pair: print(format(ord(c), '4X'), u.bidirectional(c), u.name(c)) This produces 4B L LATIN CAPITAL LETTER K 212A L KELVIN […] Letter-like Unicode symbols first appeared on John D. Cook.  ( 1 min )
    Periodic table of abbreviations
    I just updated my earlier post on chemical element abbreviations by adding a table to visualize the groupings, a sort of periodic table of element abbreviations. See that post for details. First letter First two letters First letter and next consonant Initials of first and second syllables Initials of first and third syllables First and […] Periodic table of abbreviations first appeared on John D. Cook.  ( 1 min )
  • Open

    Is anyone interested in joining a DeepRL-based algotrading project?
    I've been working in algorithmic trading for a few years and over the past 4 months, I have begun developing RL trading systems. The results so far are quite promising. I already have some VC/investment pitches lined up (Including at a hedge fund, European bank, Canadian bank, etc.) . I am looking for someone knowledgeable in finance and RL to help with this project. We are currently a team of 2. Compensation would most likely be in the form of company equity. ​ PM me if interested. submitted by /u/elonmusk12345_ [link] [comments]  ( 1 min )
    "Contrastive Learning as Goal-Conditioned Reinforcement Learning", Eysenbach et al 2022
    submitted by /u/gwern [link] [comments]  ( 1 min )
    Looking for referrals in reinforcement learning space
    Hi everyone, I am a masters in data science student and I am looking for full time roles. I will graduate in may 2023, but can graduate in December 2022 if needed. I am looking for roles in reinforcement learning space( my options as of now would be ML Engineer, AI Engineer, Data scientist, SWE ML, Research Engineer( I doubt I am qualified for this though given that most roles need a PhD). If anyone can refer me for a role in their firm, I can share you my resume for the same. submitted by /u/Significant_Froyo_20 [link] [comments]  ( 1 min )
    Run MuJoCo Custom Environment & StableBaseline3 on Colab for GPU usage
    Hey all, I set up a mujoco custom env and imbedded it into openAI's gym to use sb3 algorithms on it. So far so good, things work and first successes are rolling in using PPO :) I meanwhile feel though, that my laptops CPU is reaching its limits, so I thought to switch to Colab to use its GPU for training. Implemented, I recognized that it is even slower than my laptop. With the same parameters it needs 240s for 5 iterations (batch_size = 32k) My laptop requires 155s for that. Is there anything I am not thinking about? My explanation would be that updating the policy might be faster due to GPU usage but doing the mujoco calculations is slower than on my laptop. ​ Cheers submitted by /u/disdisinform [link] [comments]  ( 1 min )
    CfP ACM Transactions on Evolutionary Learning and Optimization Special Issue on Evolutionary Reinforcement Learning
    CALL FOR PAPERS ACM Transactions on Evolutionary Learning and Optimization Special Issue on Evolutionary Reinforcement Learning Guest Editors GIUSEPPE PAOLO, HUAWEI, FRANCE ALEXANDRE CONINX, SORBONNE UNIVERSITY, FRANCE ANTOINE CULLY, IMPERIAL COLLEGE, UK ADAM GAIER, AUTODESK RESEARCH, GERMANY This Special Issue aims to highlight the growing field of Evolutionary Reinforcement Learning while proposing an outlet for the two communities, reinforcement learning (RL) and evolutionary algorithms (EA) to present new applications and ideas and discuss past and new challenges. We are particularly interested in papers at the intersection of optimization and reinforcement learning, such as the use of evolutionary optimization for data collection or tuning of reinforcement learning algorithms, reinforcement learning to configure and improve performance of evolutionary optimization, and any hybrids of evolutionary algorithms with other reinforcement learning techniques. Click here for the full Call for Papers and submission instructions. Important Dates: Open for Submissions: June 15TH 2022 Submissions deadline: July 30TH 2022 First-round review decisions: September 30TH 2022 Deadline for revision submissions: December 30TH 2022 Notification of final decisions: February 28TH 2023 Tentative publication: March 2023 For question and further information, please contact one of the guest editors. submitted by /u/GiuPaolo [link] [comments]  ( 1 min )
    DI-smartcross is an open-source Decision Intelligence platform for Traffic Crossing Signal Control task. Here you can use SUMO and CityFlow traffic simulator to run Reinforcement Learning policies.
    submitted by /u/OpenDILab [link] [comments]  ( 1 min )
    Complexity of Q-Learning for Dec-POMDPs ?
    I have been reading a lot of papers concerning MARL, specifically their Dec-POMDP formulation. Most of these papers state that one of the challenges of working directly with the Dec-POMDP formulation is the complexity (NEXP-completeness). They state that there are double exponentially many joint policies to evaluate, specifically the number of these possible joint policies is |A|(n\|O|^h)) where |A| and |O| denote the largest individual action and observation sets and h is the horizon. They also state that the state-action pairs grow exponentially with planning horizon $h$.Can anyone please explain the intuition/steps that led to these results? submitted by /u/souhaielbensalem [link] [comments]  ( 1 min )
    Policy Gradient Loss Oscillations
    ​ https://preview.redd.it/ghpi3yphkv591.png?width=1092&format=png&auto=webp&s=a5d13f8aa18c74e7699b006265606ec21db61978 What could be causing these oscillations in the policy gradient loss? Adam, lr=3e-5, beta1=0, beta2=0.99, impala vtrace. submitted by /u/atomicburn125 [link] [comments]
  • Open

    Smart Utility Vehicle: NIO ES7 Redefines Category with Intelligent, Versatile EV Powered by NVIDIA DRIVE Orin
    Accounting for nearly half of global vehicle sales in 2021, SUVs have grown in popularity given their versatility. Now, NIO aims to amp up the volume further. This week, the electric automaker unveiled the ES7 SUV, purpose-built for the intelligent vehicle era. Its sporty yet elegant body houses an array of cutting-edge technology, including the Read article > The post Smart Utility Vehicle: NIO ES7 Redefines Category with Intelligent, Versatile EV Powered by NVIDIA DRIVE Orin appeared first on NVIDIA Blog.  ( 2 min )
    AI for Personalized Health: Startup Advances Precision Medicine for COVID-19, Chronic Diseases
    At a time when much about COVID-19 remained a mystery, U.K.-based PrecisionLife used AI and combinatorial analytics to discover new genes associated with severe symptoms and hospitalizations for patients. The techbio company’s study, published in June 2020, pinpoints 68 novel genes associated with individuals who experienced severe disease from the virus. Over 70 percent of Read article > The post AI for Personalized Health: Startup Advances Precision Medicine for COVID-19, Chronic Diseases appeared first on NVIDIA Blog.  ( 3 min )
    Get Your Wish: Genshin Impact Coming to GeForce NOW
    Greetings, Traveler. Prepare for adventure. Genshin Impact, the popular open-world action role-playing game, is leaving limited beta and launching for all GeForce NOW members next week. Gamers can get their game on today with the six total games joining the GeForce NOW library. As announced last week, Warhammer 40,000: Darktide is coming to the cloud Read article > The post Get Your Wish: Genshin Impact Coming to GeForce NOW appeared first on NVIDIA Blog.  ( 3 min )
  • Open

    Artificial neural networks model face processing in autism
    A new computational model could explain differences in recognizing facial emotions.  ( 6 min )
  • Open

    Interview with a squirrel
    Google's large language model, LaMDA, has recently been making headlines after a Google engineer (now on administrative leave), claimed to be swayed by an interview in which GPT-3 described the experience of being conscious. Almost everyone else who has used these large text-generating AIs, myself included, is entirely  ( 5 min )
    Bonus: More GPT-3 interviews
    AI Weirdness: the strange side of machine learning  ( 1 min )
  • Open

    GoogleNet: Customised Neural Network
    Hello, I am new to this and I have been trying to do some reading and do some tutorials about neural networks. I have come across known architectures like googlenet, alexnet, etc that work quite well. From what I read it seems that googlenet has a pre-defined input size in the neural network. Can I copy the architecture and change it to accept 32x32? Will this affect the performance, I mean will another architecture work better for this size of images? Also, with pooling which decreases the size of the image, will this be an issue since the original input size is 224x224? Thank you. submitted by /u/Capable-Effective-93 [link] [comments]  ( 1 min )
    nebulgym: speed up your training with class decorators
    Hi everybody, here you have the repo: https://github.com/nebuly-ai/nebulgym As training can be a bottleneck for many AI developers, often costly and slow, we decided to share with everyone nebulgym library. The project is in its early stages and there is a lot of room for improvement, but it is already capable of cutting training time in half. It is based on the idea that training can be streamlined at different levels. https://i.redd.it/ss4aak3ajy591.gif One can change the way training is performed (algorithmic optimization) by trying to achieve faster or earlier convergence. You can change the learning rate, the scheduling policy, the training recipe or replace one level with another that requires less computational resources. Another option is to apply precision reduction techni…  ( 3 min )
  • Open

    Clustering acoustic emission data streams with sequentially appearing clusters using mixture models. (arXiv:2108.11211v3 [stat.ML] UPDATED)
    The interpretation of unlabeled acoustic emission (AE) data classically relies on general-purpose clustering methods. While several external criteria have been used in the past to select the hyperparameters of those algorithms, few studies have paid attention to the development of dedicated objective functions in clustering methods able to cope with the specificities of AE data. We investigate how to explicitly represent clusters onsets in mixture models in general, and in Gaussian Mixture Models (GMM) in particular. By modifying the internal criterion of such models, we propose the first clustering method able to provide, through parameters estimated by an expectation-maximization procedure, information about when clusters occur (onsets), how they grow (kinetics) and their level of activation through time. This new objective function accommodates continuous timestamps of AE signals and, thus, their order of occurrence. The method, called GMMSEQ, is experimentally validated to characterize the loosening phenomenon in bolted structure under vibrations. A comparison with three standard clustering methods on raw streaming data from five experimental campaigns shows that GMMSEQ not only provides useful qualitative information about the timeline of clusters, but also shows better performance in terms of cluster characterization. In view of developing an open acoustic emission initiative and according to the FAIR principles, the datasets and the codes are made available to reproduce the research of this paper.  ( 2 min )
    Probabilistic Spatial Transformer Networks. (arXiv:2004.03637v2 [cs.LG] UPDATED)
    Spatial Transformer Networks (STNs) estimate image transformations that can improve downstream tasks by `zooming in' on relevant regions in an image. However, STNs are hard to train and sensitive to mis-predictions of transformations. To circumvent these limitations, we propose a probabilistic extension that estimates a stochastic transformation rather than a deterministic one. Marginalizing transformations allows us to consider each image at multiple poses, which makes the localization task easier and the training more robust. As an additional benefit, the stochastic transformations act as a localized, learned data augmentation that improves the downstream tasks. We show across standard imaging benchmarks and on a challenging real-world dataset that these two properties lead to improved classification performance, robustness and model calibration. We further demonstrate that the approach generalizes to non-visual domains by improving model performance on time-series data.  ( 2 min )
    Classification of EEG Motor Imagery Using Deep Learning for Brain-Computer Interface Systems. (arXiv:2206.07655v1 [eess.SP])
    A trained T1 class Convolutional Neural Network (CNN) model will be used to examine its ability to successfully identify motor imagery when fed pre-processed electroencephalography (EEG) data. In theory, and if the model has been trained accurately, it should be able to identify a class and label it accordingly. The CNN model will then be restored and used to try and identify the same class of motor imagery data using much smaller sampled data in an attempt to simulate live data.  ( 2 min )
    Understanding and Optimizing Deep Learning Cold-Start Latency on Edge Devices. (arXiv:2206.07446v1 [cs.LG])
    DNNs are ubiquitous on edge devices nowadays. With its increasing importance and use cases, it's not likely to pack all DNNs into device memory and expect that each inference has been warmed up. Therefore, cold inference, the process to read, initialize, and execute a DNN model, is becoming commonplace and its performance is urgently demanded to be optimized. To this end, we present NNV12, the first on-device inference engine that optimizes for cold inference NNV12 is built atop 3 novel optimization knobs: selecting a proper kernel (implementation) for each DNN operator, bypassing the weights transformation process by caching the post-transformed weights on disk, and pipelined execution of many kernels on asymmetric processors. To tackle with the huge search space, NNV12 employs a heuristic-based scheme to obtain a near-optimal kernel scheduling plan. We fully implement a prototype of NNV12 and evaluate its performance across extensive experiments. It shows that NNV12 achieves up to 15.2x and 401.5x compared to the state-of-the-art DNN engines on edge CPUs and GPUs, respectively.
    Transfer and Marginalize: Explaining Away Label Noise with Privileged Information. (arXiv:2202.09244v2 [cs.LG] UPDATED)
    Supervised learning datasets often have privileged information, in the form of features which are available at training time but are not available at test time e.g. the ID of the annotator that provided the label. We argue that privileged information is useful for explaining away label noise, thereby reducing the harmful impact of noisy labels. We develop a simple and efficient method for supervised learning with neural networks: it transfers via weight sharing the knowledge learned with privileged information and approximately marginalizes over privileged information at test time. Our method, TRAM (TRansfer and Marginalize), has minimal training time overhead and has the same test-time cost as not using privileged information. TRAM performs strongly on CIFAR-10H, ImageNet and Civil Comments benchmarks.
    Chain-based Discriminative Autoencoders for Speech Recognition. (arXiv:2203.13687v3 [cs.SD] UPDATED)
    In our previous work, we proposed a discriminative autoencoder (DcAE) for speech recognition. DcAE combines two training schemes into one. First, since DcAE aims to learn encoder-decoder mappings, the squared error between the reconstructed speech and the input speech is minimized. Second, in the code layer, frame-based phonetic embeddings are obtained by minimizing the categorical cross-entropy between ground truth labels and predicted triphone-state scores. DcAE is developed based on the Kaldi toolkit by treating various TDNN models as encoders. In this paper, we further propose three new versions of DcAE. First, a new objective function that considers both categorical cross-entropy and mutual information between ground truth and predicted triphone-state sequences is used. The resulting DcAE is called a chain-based DcAE (c-DcAE). For application to robust speech recognition, we further extend c-DcAE to hierarchical and parallel structures, resulting in hc-DcAE and pc-DcAE. In these two models, both the error between the reconstructed noisy speech and the input noisy speech and the error between the enhanced speech and the reference clean speech are taken into the objective function. Experimental results on the WSJ and Aurora-4 corpora show that our DcAE models outperform baseline systems.
    Can Linear Programs Have Adversarial Examples? A Causal Perspective. (arXiv:2105.12697v5 [cs.LG] UPDATED)
    The recent years have been marked by extended research on adversarial attacks, especially on deep neural networks. With this work we intend on posing and investigating the question of whether the phenomenon might be more general in nature, that is, adversarial-style attacks outside classification. Specifically, we investigate optimization problems starting with Linear Programs (LPs). We start off by demonstrating the shortcoming of a naive mapping between the formalism of adversarial examples and LPs, to then reveal how we can provide the missing piece -- intriguingly, through the Pearlian notion of Causality. Characteristically, we show the direct influence of the Structural Causal Model (SCM) onto the subsequent LP optimization, which ultimately exposes a notion of confounding in LPs (inherited by said SCM) that allows for adversarial-style attacks. We provide both the general proof formally alongside existential proofs of such intriguing LP-parameterizations based on SCM for three combinatorial problems, namely Linear Assignment, Shortest Path and a real world problem of energy systems.
    Towards Robust Unsupervised Disentanglement of Sequential Data -- A Case Study Using Music Audio. (arXiv:2205.05871v2 [cs.SD] UPDATED)
    Disentangled sequential autoencoders (DSAEs) represent a class of probabilistic graphical models that describes an observed sequence with dynamic latent variables and a static latent variable. The former encode information at a frame rate identical to the observation, while the latter globally governs the entire sequence. This introduces an inductive bias and facilitates unsupervised disentanglement of the underlying local and global factors. In this paper, we show that the vanilla DSAE suffers from being sensitive to the choice of model architecture and capacity of the dynamic latent variables, and is prone to collapse the static latent variable. As a countermeasure, we propose TS-DSAE, a two-stage training framework that first learns sequence-level prior distributions, which are subsequently employed to regularise the model and facilitate auxiliary objectives to promote disentanglement. The proposed framework is fully unsupervised and robust against the global factor collapse problem across a wide range of model configurations. It also avoids typical solutions such as adversarial training which usually involves laborious parameter tuning, and domain-specific data augmentation. We conduct quantitative and qualitative evaluations to demonstrate its robustness in terms of disentanglement on both artificial and real-world music audio datasets.
    Flexible Raman Amplifier Optimization Based on Machine Learning-aided Physical Stimulated Raman Scattering Model. (arXiv:2206.07650v1 [eess.SP])
    The problem of Raman amplifier optimization is studied. A differentiable interpolation function is obtained for the Raman gain coefficient using machine learning (ML), which allows for the gradient descent optimization of forward-propagating Raman pumps. Both the frequency and power of an arbitrary number of pumps in a forward pumping configuration are then optimized for an arbitrary data channel load and span length. The forward propagation model is combined with an experimentally-trained ML model of a backward-pumping Raman amplifier to jointly optimize the frequency and power of the forward amplifier's pumps and the powers of the backward amplifier's pumps. The joint forward and backward amplifier optimization is demonstrated for an unrepeatered transmission of 250 km. A gain flatness of $<$ 1~dB over 4 THz is achieved. The optimized amplifiers are validated using a numerical simulator.
    A Random Matrix Perspective on Random Tensors. (arXiv:2108.00774v2 [stat.ML] UPDATED)
    Tensor models play an increasingly prominent role in many fields, notably in machine learning. In several applications, such as community detection, topic modeling and Gaussian mixture learning, one must estimate a low-rank signal from a noisy tensor. Hence, understanding the fundamental limits of estimators of that signal inevitably calls for the study of random tensors. Substantial progress has been recently achieved on this subject in the large-dimensional limit. Yet, some of the most significant among these results--in particular, a precise characterization of the abrupt phase transition (with respect to signal-to-noise ratio) that governs the performance of the maximum likelihood (ML) estimator of a symmetric rank-one model with Gaussian noise--were derived based of mean-field spin glass theory, which is not easily accessible to non-experts. In this work, we develop a sharply distinct and more elementary approach, relying on standard but powerful tools brought by years of advances in random matrix theory. The key idea is to study the spectra of random matrices arising from contractions of a given random tensor. We show how this gives access to spectral properties of the random tensor itself. For the aforementioned rank-one model, our technique yields a hitherto unknown fixed-point equation whose solution precisely matches the asymptotic performance of the ML estimator above the phase transition threshold in the third-order case. A numerical verification provides evidence that the same holds for orders 4 and 5, leading us to conjecture that, for any order, our fixed-point equation is equivalent to the known characterization of the ML estimation performance that had been obtained by relying on spin glasses. Moreover, our approach sheds light on certain properties of the ML problem landscape in large dimensions and can be extended to other models, such as asymmetric and non-Gaussian.
    Cascade Watchdog: A Multi-tiered Adversarial Guard for Outlier Detection. (arXiv:2108.09375v3 [cs.LG] UPDATED)
    The identification of out-of-distribution content is critical to the successful implementation of neural networks. Watchdog techniques have been developed to support the detection of these inputs, but the performance can be limited by the amount of available data. Generative adversarial networks have displayed numerous capabilities, including the ability to generate facsimiles with excellent accuracy. This paper presents and empirically evaluates a multi-tiered watchdog, which is developed using GAN generated data, for improved out-of-distribution detection. The cascade watchdog uses adversarial training to increase the amount of available data similar to the out-of-distribution elements that are more difficult to detect. Then, a specialized second guard is added in sequential order. The results show a solid and significant improvement on the detection of the most challenging out-of-distribution inputs while preserving an extremely low false positive rate.  ( 2 min )
    A Unified Sequence Interface for Vision Tasks. (arXiv:2206.07669v1 [cs.CV])
    While language tasks are naturally expressed in a single, unified, modeling framework, i.e., generating sequences of tokens, this has not been the case in computer vision. As a result, there is a proliferation of distinct architectures and loss functions for different vision tasks. In this work we show that a diverse set of "core" computer vision tasks can also be unified if formulated in terms of a shared pixel-to-sequence interface. We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs, e.g., bounding boxes or dense masks. Despite that, by formulating the output of each task as a sequence of discrete tokens with a unified interface, we show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization. To solve a specific task, we use a short prompt as task description, and the sequence output adapts to the prompt so it can produce task-specific output. We show that such a model can achieve competitive performance compared to well-established task-specific models.  ( 2 min )
    Human Activity Recognition on Time Series Accelerometer Sensor Data using LSTM Recurrent Neural Networks. (arXiv:2206.07654v1 [eess.SP])
    The use of sensors available through smart devices has pervaded everyday life in several applications including human activity monitoring, healthcare, and social networks. In this study, we focus on the use of smartwatch accelerometer sensors to recognize eating activity. More specifically, we collected sensor data from 10 participants while consuming pizza. Using this information, and other comparable data available for similar events such as smoking and medication-taking, and dissimilar activities of jogging, we developed a LSTM-ANN architecture that has demonstrated 90% success in identifying individual bites compared to a puff, medication-taking or jogging activities.  ( 2 min )
    Regularizing a Model-based Policy Stationary Distribution to Stabilize Offline Reinforcement Learning. (arXiv:2206.07166v1 [cs.LG])
    Offline reinforcement learning (RL) extends the paradigm of classical RL algorithms to purely learning from static datasets, without interacting with the underlying environment during the learning process. A key challenge of offline RL is the instability of policy training, caused by the mismatch between the distribution of the offline data and the undiscounted stationary state-action distribution of the learned policy. To avoid the detrimental impact of distribution mismatch, we regularize the undiscounted stationary distribution of the current policy towards the offline data during the policy optimization process. Further, we train a dynamics model to both implement this regularization and better estimate the stationary distribution of the current policy, reducing the error induced by distribution mismatch. On a wide range of continuous-control offline RL datasets, our method indicates competitive performance, which validates our algorithm. The code is publicly available.  ( 2 min )
    Asymmetric Tri-training for Debiasing Missing-Not-At-Random Explicit Feedback. (arXiv:1910.01444v6 [cs.SI] CROSS LISTED)
    In most real-world recommender systems, the observed rating data are subject to selection bias, and the data are thus missing-not-at-random. Developing a method to facilitate the learning of a recommender with biased feedback is one of the most challenging problems, as it is widely known that naive approaches under selection bias often lead to suboptimal results. A well-established solution for the problem is using propensity scoring techniques. The propensity score is the probability of each data being observed, and unbiased performance estimation is possible by weighting each data by the inverse of its propensity. However, the performance of the propensity-based unbiased estimation approach is often affected by choice of the propensity estimation model or the high variance problem. To overcome these limitations, we propose a model-agnostic meta-learning method inspired by the asymmetric tri-training framework for unsupervised domain adaptation. The proposed method utilizes two predictors to generate data with reliable pseudo-ratings and another predictor to make the final predictions. In a theoretical analysis, a propensity-independent upper bound of the true performance metric is derived, and it is demonstrated that the proposed method can minimize this bound. We conduct comprehensive experiments using public real-world datasets. The results suggest that the previous propensity-based methods are largely affected by the choice of propensity models and the variance problem caused by the inverse propensity weighting. Moreover, we show that the proposed meta-learning method is robust to these issues and can facilitate in developing effective recommendations from biased explicit feedback.  ( 3 min )
    The Dual PC Algorithm for Structure Learning. (arXiv:2112.09036v3 [stat.ML] UPDATED)
    Learning the graphical structure of Bayesian networks is key to describing data generating mechanisms in many complex applications but poses considerable computational challenges. Observational data can only identify the equivalence class of the directed acyclic graph underlying a Bayesian network model, and a variety of methods exist to tackle the problem. Under certain assumptions, the popular PC algorithm can consistently recover the correct equivalence class by reverse-engineering the conditional independence (CI) relationships holding in the variable distribution. Here, we propose the dual PC algorithm, a novel scheme to carry out the CI tests within the PC algorithm by leveraging the inverse relationship between covariance and precision matrices. By exploiting block matrix inversions we can simultaneously perform tests on partial correlations of complementary (or dual) conditioning sets. The multiple CI tests of the dual PC algorithm proceed by first considering marginal and full-order CI relationships and progressively moving to central-order ones. Simulation studies show that the dual PC algorithm outperforms the classic PC algorithm both in terms of run time and in recovering the underlying network structure, even in the presence of deviations from Gaussianity.  ( 2 min )
    Learned holographic light transport. (arXiv:2108.08253v3 [physics.optics] UPDATED)
    Computer-Generated Holography (CGH) algorithms often fall short in matching simulations with results from a physical holographic display. Our work addresses this mismatch by learning the holographic light transport in holographic displays. Using a camera and a holographic display, we capture the image reconstructions of optimized holograms that rely on ideal simulations to generate a dataset. Inspired by the ideal simulations, we learn a complex-valued convolution kernel that can propagate given holograms to captured photographs in our dataset. Our method can dramatically improve simulation accuracy and image quality in holographic displays while paving the way for physically informed learning approaches.  ( 2 min )
    Data-driven discovery of intrinsic dynamics. (arXiv:2108.05928v2 [cs.LG] UPDATED)
    Dynamical models underpin our ability to understand and predict the behavior of natural systems. Whether dynamical models are developed from first-principles derivations or from observational data, they are predicated on our choice of state variables. The choice of state variables is driven by convenience and intuition, and in the data-driven case the observed variables are often chosen to be the state variables. The dimensionality of these variables (and consequently the dynamical models) can be arbitrarily large, obscuring the underlying behavior of the system. In truth, these variables are often highly redundant and the system is driven by a much smaller set of latent intrinsic variables. In this study, we combine the mathematical theory of manifolds with the representational capacity of neural networks to develop a method that learns a system's intrinsic state variables directly from time series data, and also learns predictive models for their dynamics. What distinguishes our method is its ability to reduce data to the intrinsic dimensionality of the nonlinear manifold they live on. This ability is enabled by the concepts of charts and atlases from the theory of manifolds, whereby a manifold is represented by a collection of patches that are sewn together -- a necessary representation to attain intrinsic dimensionality. We demonstrate this approach on several high-dimensional systems with low-dimensional behavior. The resulting framework provides the ability to develop dynamical models of the lowest possible dimension, capturing the essence of a system.  ( 2 min )
    Deep Network Approximation in Terms of Intrinsic Parameters. (arXiv:2111.07964v2 [cs.LG] UPDATED)
    One of the arguments to explain the success of deep learning is the powerful approximation capacity of deep neural networks. Such capacity is generally accompanied by the explosive growth of the number of parameters, which, in turn, leads to high computational costs. It is of great interest to ask whether we can achieve successful deep learning with a small number of learnable parameters adapting to the target function. From an approximation perspective, this paper shows that the number of parameters that need to be learned can be significantly smaller than people typically expect. First, we theoretically design ReLU networks with a few learnable parameters to achieve an attractive approximation. We prove by construction that, for any Lipschitz continuous function $f$ on $[0,1]^d$ with a Lipschitz constant $\lambda>0$, a ReLU network with $n+2$ intrinsic parameters (those depending on $f$) can approximate $f$ with an exponentially small error $5\lambda \sqrt{d}\,2^{-n}$. Such a result is generalized to generic continuous functions. Furthermore, we show that the idea of learning a small number of parameters to achieve a good approximation can be numerically observed. We conduct several experiments to verify that training a small part of parameters can also achieve good results for classification problems if other parameters are pre-specified or pre-trained from a related problem.  ( 2 min )
    FENCE: Feasible Evasion Attacks on Neural Networks in Constrained Environments. (arXiv:1909.10480v4 [cs.CR] UPDATED)
    As advances in Deep Neural Networks (DNNs) demonstrate unprecedented levels of performance in many critical applications, their vulnerability to attacks is still an open question. We consider evasion attacks at testing time against Deep Learning in constrained environments, in which dependencies between features need to be satisfied. These situations may arise naturally in tabular data or may be the result of feature engineering in specific application domains, such as threat detection in cyber security. We propose a general iterative gradient-based framework called FENCE for crafting evasion attacks that take into consideration the specifics of constrained domains and application requirements. We apply it against Feed-Forward Neural Networks trained for two cyber security applications: network traffic botnet classification and malicious domain classification, to generate feasible adversarial examples. We extensively evaluate the success rate and performance of our attacks, compare their improvement over several baselines, and analyze factors that impact the attack success rate, including the optimization objective and the data imbalance. We show that with minimal effort (e.g., generating 12 additional network connections), an attacker can change the model's prediction from the Malicious class to Benign and evade the classifier. We show that models trained on datasets with higher imbalance are more vulnerable to our FENCE attacks. Finally, we demonstrate the potential of performing adversarial training in constrained domains to increase the model resilience against these evasion attacks.  ( 2 min )
    Learning Transport Processes with Machine Intelligence. (arXiv:2109.13096v3 [physics.plasm-ph] UPDATED)
    We present a machine learning based approach to address the study of transport processes, ubiquitous in continuous mechanics, with particular attention to those phenomena ruled by complex micro-physics, impractical to theoretical investigation, yet exhibiting emergent behavior describable by a closed mathematical expression. Our machine learning model, built using simple components and following a few well established practices, is capable of learning latent representations of the transport process substantially closer to the ground truth than expected from the nominal error characterising the data, leading to sound generalisation properties. This is demonstrated through an idealized study of the long standing problem of heat flux suppression relevant to fusion and cosmic plasmas. Our analysis shows that the result applies beyond those case specific assumptions and that, in particular, the accuracy of the learned representation is controllable through knowledge of the data quality (error properties) and a suitable choice of the dataset size. While the learned representation can be used as a plug-in for numerical modeling purposes, it can also be leveraged with the above error analysis to obtain reliable mathematical expressions describing the transport mechanism and of great theoretical value.  ( 2 min )
    Born-Infeld (BI) for AI: Energy-Conserving Descent (ECD) for Optimization. (arXiv:2201.11137v2 [cs.LG] UPDATED)
    We introduce a novel framework for optimization based on energy-conserving Hamiltonian dynamics in a strongly mixing (chaotic) regime and establish its key properties analytically and numerically. The prototype is a discretization of Born-Infeld dynamics, with a squared relativistic speed limit depending on the objective function. This class of frictionless, energy-conserving optimizers proceeds unobstructed until slowing naturally near the minimal loss, which dominates the phase space volume of the system. Building from studies of chaotic systems such as dynamical billiards, we formulate a specific algorithm with good performance on machine learning and PDE-solving tasks, including generalization. It cannot stop at a high local minimum, an advantage in non-convex loss functions, and proceeds faster than GD+momentum in shallow valleys.  ( 2 min )
    Latency Control for Keyword Spotting. (arXiv:2206.07261v1 [eess.AS])
    Conversational agents commonly utilize keyword spotting (KWS) to initiate voice interaction with the user. For user experience and privacy considerations, existing approaches to KWS largely focus on accuracy, which can often come at the expense of introduced latency. To address this tradeoff, we propose a novel approach to control KWS model latency and which generalizes to any loss function without explicit knowledge of the keyword endpoint. Through a single, tunable hyperparameter, our approach enables one to balance detection latency and accuracy for the targeted application. Empirically, we show that our approach gives superior performance under latency constraints when compared to existing methods. Namely, we make a substantial 25\% relative false accepts improvement for a fixed latency target when compared to the baseline state-of-the-art. We also show that when our approach is used in conjunction with a max-pooling loss, we are able to improve relative false accepts by 25 % at a fixed latency when compared to cross entropy loss.  ( 2 min )
    Masked Siamese ConvNets. (arXiv:2206.07700v1 [cs.CV])
    Self-supervised learning has shown superior performances over supervised methods on various vision benchmarks. The siamese network, which encourages embeddings to be invariant to distortions, is one of the most successful self-supervised visual representation learning approaches. Among all the augmentation methods, masking is the most general and straightforward method that has the potential to be applied to all kinds of input and requires the least amount of domain knowledge. However, masked siamese networks require particular inductive bias and practically only work well with Vision Transformers. This work empirically studies the problems behind masked siamese networks with ConvNets. We propose several empirical designs to overcome these problems gradually. Our method performs competitively on low-shot image classification and outperforms previous methods on object detection benchmarks. We discuss several remaining issues and hope this work can provide useful data points for future general-purpose self-supervised learning.  ( 2 min )
    ELUDE: Generating interpretable explanations via a decomposition into labelled and unlabelled features. (arXiv:2206.07690v1 [cs.CV])
    Deep learning models have achieved remarkable success in different areas of machine learning over the past decade; however, the size and complexity of these models make them difficult to understand. In an effort to make them more interpretable, several recent works focus on explaining parts of a deep neural network through human-interpretable, semantic attributes. However, it may be impossible to completely explain complex models using only semantic attributes. In this work, we propose to augment these attributes with a small set of uninterpretable features. Specifically, we develop a novel explanation framework ELUDE (Explanation via Labelled and Unlabelled DEcomposition) that decomposes a model's prediction into two parts: one that is explainable through a linear combination of the semantic attributes, and another that is dependent on the set of uninterpretable features. By identifying the latter, we are able to analyze the "unexplained" portion of the model, obtaining insights into the information used by the model. We show that the set of unlabelled features can generalize to multiple models trained with the same feature space and compare our work to two popular attribute-oriented methods, Interpretable Basis Decomposition and Concept Bottleneck, and discuss the additional insights ELUDE provides.  ( 2 min )
    Analysis of Augmentations for Contrastive ECG Representation Learning. (arXiv:2206.07656v1 [eess.SP])
    This paper systematically investigates the effectiveness of various augmentations for contrastive self-supervised learning of electrocardiogram (ECG) signals and identifies the best parameters. The baseline of our proposed self-supervised framework consists of two main parts: the contrastive learning and the downstream task. In the first stage, we train an encoder using a number of augmentations to extract generalizable ECG signal representations. We then freeze the encoder and finetune a few linear layers with different amounts of labelled data for downstream arrhythmia detection. We then experiment with various augmentations techniques and explore a range of parameters. Our experiments are done on PTB-XL, a large and publicly available 12-lead ECG dataset. The results show that applying augmentations in a specific range of complexities works better for self-supervised contrastive learning. For instance, when adding Gaussian noise, a sigma in the range of 0.1 to 0.2 achieves better results, while poor training occurs when the added noise is too small or too large (outside of the specified range). A similar trend is observed with other augmentations, demonstrating the importance of selecting the optimum level of difficulty for the added augmentations, as augmentations that are too simple will not result in effective training, while augmentations that are too difficult will also prevent the model from effective learning of generalized representations. Our work can influence future research on self-supervised contrastive learning on bio-signals and aid in selecting optimum parameters for different augmentations.  ( 2 min )
    Model-based RL with Optimistic Posterior Sampling: Structural Conditions and Sample Complexity. (arXiv:2206.07659v1 [cs.LG])
    We propose a general framework to design posterior sampling methods for model-based RL. We show that the proposed algorithms can be analyzed by reducing regret to Hellinger distance based conditional probability estimation. We further show that optimistic posterior sampling can control this Hellinger distance, when we measure model error via data likelihood. This technique allows us to design and analyze unified posterior sampling algorithms with state-of-the-art sample complexity guarantees for many model-based RL settings. We illustrate our general result in many special cases, demonstrating the versatility of our framework.  ( 2 min )
    Nystr\"om Kernel Mean Embeddings. (arXiv:2201.13055v2 [stat.ML] UPDATED)
    Kernel mean embeddings are a powerful tool to represent probability distributions over arbitrary spaces as single points in a Hilbert space. Yet, the cost of computing and storing such embeddings prohibits their direct use in large-scale settings. We propose an efficient approximation procedure based on the Nystr\"om method, which exploits a small random subset of the dataset. Our main result is an upper bound on the approximation error of this procedure. It yields sufficient conditions on the subsample size to obtain the standard $n^{-1/2}$ rate while reducing computational costs. We discuss applications of this result for the approximation of the maximum mean discrepancy and quadrature rules, and illustrate our theoretical findings with numerical experiments.  ( 2 min )
    Large Language Models are not Models of Natural Language: they are Corpus Models. (arXiv:2112.07055v2 [cs.CL] UPDATED)
    Natural Language Processing (NLP) has become one of the leading application areas in the current Artificial Intelligence boom. Transfer learning has enabled large deep learning neural networks trained on the language modeling task to vastly improve performance in almost all downstream language tasks. Interestingly, when the language models are trained with data that includes software code, they demonstrate remarkable abilities in generating functioning computer code from natural language specifications. We argue that this creates a conundrum for the claim that eliminative neural models are a radical restructuring in our understanding of cognition in that they eliminate the need for symbolic abstractions like generative phrase structure grammars. Because the syntax of programming languages is by design determined by phrase structure grammars, neural models that produce syntactic code are apparently uninformative about the theoretical foundations of programming languages. The demonstration that neural models perform well on tasks that involve clearly symbolic systems, proves that they cannot be used as an argument that language and other cognitive systems are not symbolic. Finally, we argue as a corollary that the term language model is misleading and propose the adoption of the working term corpus model instead, which better reflects the genesis and contents of the model.  ( 2 min )
    EDEN: Communication-Efficient and Robust Distributed Mean Estimation for Federated Learning. (arXiv:2108.08842v3 [cs.LG] UPDATED)
    Distributed Mean Estimation (DME) is a central building block in federated learning, where clients send local gradients to a parameter server for averaging and updating the model. Due to communication constraints, clients often use lossy compression techniques to compress the gradients, resulting in estimation inaccuracies. DME is more challenging when clients have diverse network conditions, such as constrained communication budgets and packet losses. In such settings, DME techniques often incur a significant increase in the estimation error leading to degraded learning performance. In this work, we propose a robust DME technique named EDEN that naturally handles heterogeneous communication budgets and packet losses. We derive appealing theoretical guarantees for EDEN and evaluate it empirically. Our results demonstrate that EDEN consistently improves over state-of-the-art DME techniques.  ( 2 min )
    VPNets: Volume-preserving neural networks for learning source-free dynamics. (arXiv:2204.13843v2 [cs.LG] UPDATED)
    We propose volume-preserving networks (VPNets) for learning unknown source-free dynamical systems using trajectory data. We propose three modules and combine them to obtain two network architectures, coined R-VPNet and LA-VPNet. The distinct feature of the proposed models is that they are intrinsic volume-preserving. In addition, the corresponding approximation theorems are proved, which theoretically guarantee the expressivity of the proposed VPNets to learn source-free dynamics. The effectiveness, generalization ability and structure-preserving property of the VP-Nets are demonstrated by numerical experiments.
    PhysGNN: A Physics-Driven Graph Neural Network Based Model for Predicting Soft Tissue Deformation in Image-Guided Neurosurgery. (arXiv:2109.04352v2 [eess.IV] UPDATED)
    Correctly capturing intraoperative brain shift in image-guided neurosurgical procedures is a critical task for aligning preoperative data with intraoperative geometry for ensuring accurate surgical navigation. While the finite element method (FEM) is a proven technique to effectively approximate soft tissue deformation through biomechanical formulations, their degree of success boils down to a trade-off between accuracy and speed. To circumvent this problem, the most recent works in this domain have proposed leveraging data-driven models obtained by training various machine learning algorithms, e.g. random forests, artificial neural networks (ANNs), with the results of finite element analysis (FEA) to speed up tissue deformation approximations by prediction. These methods, however, do not account for the structure of the finite element (FE) mesh during training that provides information on node connectivities as well as the distance between them, which can aid with approximating tissue deformation based on the proximity of force load points with the rest of the mesh nodes. Therefore, this work proposes a novel framework, PhysGNN, a data-driven model that approximates the solution of FEM by leveraging graph neural networks (GNNs), which are capable of accounting for the mesh structural information and inductive learning over unstructured grids and complex topological structures. Empirically, we demonstrate that the proposed architecture, PhysGNN, promises accurate and fast soft tissue deformation approximations and is competitive with the state-of-the-art (SOTA) algorithms while promising enhanced computational feasibility, therefore suitable for neurosurgical settings.
    Robust and Sparse Estimation of Linear Regression Coefficients with Heavy-tailed Noises and Covariates. (arXiv:2206.07594v1 [stat.ML])
    Robust and sparse estimation of linear regression coefficients is investigated. The situation addressed by the present paper is that covariates and noises are sampled from heavy-tailed distributions, and the covariates and noises are contaminated by malicious outliers. Our estimator can be computed efficiently. Further, our estimation error bound is sharp.
    Unbiased Recommender Learning from Missing-Not-At-Random Implicit Feedback. (arXiv:1909.03601v3 [stat.ML] CROSS LISTED)
    Recommender systems widely use implicit feedback such as click data because of its general availability. Although the presence of clicks signals the users' preference to some extent, the lack of such clicks does not necessarily indicate a negative response from the users, as it is possible that the users were not exposed to the items (positive-unlabeled problem). This leads to a difficulty in predicting the users' preferences from implicit feedback. Previous studies addressed the positive-unlabeled problem by uniformly upweighting the loss for the positive feedback data or estimating the confidence of each data having relevance information via the EM-algorithm. However, these methods failed to address the missing-not-at-random problem in which popular or frequently recommended items are more likely to be clicked than other items even if a user does not have a considerable interest in them. To overcome these limitations, we first define an ideal loss function to be optimized to realize recommendations that maximize the relevance and propose an unbiased estimator for the ideal loss. Subsequently, we analyze the variance of the proposed unbiased estimator and further propose a clipped estimator that includes the unbiased estimator as a special case. We demonstrate that the clipped estimator is expected to improve the performance of the recommender system, by considering the bias-variance trade-off. We conduct semi-synthetic and real-world experiments and demonstrate that the proposed method largely outperforms the baselines. In particular, the proposed method works better for rare items that are less frequently observed in the training data. The findings indicate that the proposed method can better achieve the objective of recommending items with the highest relevance.
    A Projection-Based K-space Transformer Network for Undersampled Radial MRI Reconstruction with Limited Training Subjects. (arXiv:2206.07219v1 [eess.IV])
    The recent development of deep learning combined with compressed sensing enables fast reconstruction of undersampled MR images and has achieved state-of-the-art performance for Cartesian k-space trajectories. However, non-Cartesian trajectories such as the radial trajectory need to be transformed onto a Cartesian grid in each iteration of the network training, slowing down the training process and posing inconvenience and delay during training. Multiple iterations of nonuniform Fourier transform in the networks offset the deep learning advantage of fast inference. Current approaches typically either work on image-to-image networks or grid the non-Cartesian trajectories before the network training to avoid the repeated gridding process. However, the image-to-image networks cannot ensure the k-space data consistency in the reconstructed images and the pre-processing of non-Cartesian k-space leads to gridding errors which cannot be compensated by the network training. Inspired by the Transformer network to handle long-range dependencies in sequence transduction tasks, we propose to rearrange the radial spokes to sequential data based on the chronological order of acquisition and use the Transformer to predict unacquired radial spokes from acquired ones. We propose novel data augmentation methods to generate a large amount of training data from a limited number of subjects. The network can be generated to different anatomical structures. Experimental results show superior performance of the proposed framework compared to state-of-the-art deep neural networks.
    "Why Here and Not There?" -- Diverse Contrasting Explanations of Dimensionality Reduction. (arXiv:2206.07391v1 [cs.LG])
    Dimensionality reduction is a popular preprocessing and a widely used tool in data mining. Transparency, which is usually achieved by means of explanations, is nowadays a widely accepted and crucial requirement of machine learning based systems like classifiers and recommender systems. However, transparency of dimensionality reduction and other data mining tools have not been considered much yet, still it is crucial to understand their behavior -- in particular practitioners might want to understand why a specific sample got mapped to a specific location. In order to (locally) understand the behavior of a given dimensionality reduction method, we introduce the abstract concept of contrasting explanations for dimensionality reduction, and apply a realization of this concept to the specific application of explaining two dimensional data visualization.
    A Collaboration Strategy in the Mining Pool for Proof-of-Neural-Architecture Consensus. (arXiv:2206.07089v1 [cs.DC])
    In most popular public accessible cryptocurrency systems, the mining pool plays a key role because mining cryptocurrency with the mining pool turns the non-profitable situation into profitable for individual miners. In many recent novel blockchain consensuses, the deep learning training procedure becomes the task for miners to prove their workload, thus the computation power of miners will not purely be spent on the hash puzzle. In this way, the hardware and energy will support the blockchain service and deep learning training simultaneously. While the incentive of miners is to earn tokens, individual miners are motivated to join mining pools to become more competitive. In this paper, we are the first to demonstrate a mining pool solution for novel consensuses based on deep learning. The mining pool manager partitions the full searching space into subspaces and all miners are scheduled to collaborate on the Neural Architecture Search (NAS) tasks in the assigned subspace. Experiments demonstrate that the performance of this type of mining pool is more competitive than an individual miner. Due to the uncertainty of miners' behaviors, the mining pool manager checks the standard deviation of the performance of high reward miners and prepares backup miners to ensure the completion of the tasks of high reward miners.
    The Mean-Squared Error of Double Q-Learning. (arXiv:2007.05034v3 [cs.LG] UPDATED)
    In this paper, we establish a theoretical comparison between the asymptotic mean-squared error of Double Q-learning and Q-learning. Our result builds upon an analysis for linear stochastic approximation based on Lyapunov equations and applies to both tabular setting and with linear function approximation, provided that the optimal policy is unique and the algorithms converge. We show that the asymptotic mean-squared error of Double Q-learning is exactly equal to that of Q-learning if Double Q-learning uses twice the learning rate of Q-learning and outputs the average of its two estimators. We also present some practical implications of this theoretical observation using simulations.
    Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions. (arXiv:2206.07252v1 [stat.ML])
    Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems. While the empirical success of SGD is often attributed to its computational efficiency and favorable generalization behavior, neither effect is well understood and disentangling them remains an open problem. Even in the simple setting of convex quadratic problems, worst-case analyses give an asymptotic convergence rate for SGD that is no better than full-batch gradient descent (GD), and the purported implicit regularization effects of SGD lack a precise explanation. In this work, we study the dynamics of multi-pass SGD on high-dimensional convex quadratics and establish an asymptotic equivalence to a stochastic differential equation, which we call homogenized stochastic gradient descent (HSGD), whose solutions we characterize explicitly in terms of a Volterra integral equation. These results yield precise formulas for the learning and risk trajectories, which reveal a mechanism of implicit conditioning that explains the efficiency of SGD relative to GD. We also prove that the noise from SGD negatively impacts generalization performance, ruling out the possibility of any type of implicit regularization in this context. Finally, we show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD (bootstrap risk).
    Branching Reinforcement Learning. (arXiv:2202.07995v2 [cs.LG] UPDATED)
    In this paper, we propose a novel Branching Reinforcement Learning (Branching RL) model, and investigate both Regret Minimization (RM) and Reward-Free Exploration (RFE) metrics for this model. Unlike standard RL where the trajectory of each episode is a single $H$-step path, branching RL allows an agent to take multiple base actions in a state such that transitions branch out to multiple successor states correspondingly, and thus it generates a tree-structured trajectory. This model finds important applications in hierarchical recommendation systems and online advertising. For branching RL, we establish new Bellman equations and key lemmas, i.e., branching value difference lemma and branching law of total variance, and also bound the total variance by only $O(H^2)$ under an exponentially-large trajectory. For RM and RFE metrics, we propose computationally efficient algorithms BranchVI and BranchRFE, respectively, and derive nearly matching upper and lower bounds. Our results are only polynomial in problem parameters despite exponentially-large trajectories.
    Corruption-Robust Contextual Search through Density Updates. (arXiv:2206.07528v1 [cs.LG])
    We study the problem of contextual search in the adversarial noise model. Let $d$ be the dimension of the problem, $T$ be the time horizon and $C$ be the total amount of noise in the system. For the $\eps$-ball loss, we give a tight regret bound of $O(C + d \log(1/\eps))$ improving over the $O(d^3 \log(1/\eps)) \log^2(T) + C \log(T) \log(1/\eps))$ bound of Krishnamurthy et al (STOC21). For the symmetric loss, we give an efficient algorithm with regret $O(C+d \log T)$. Our techniques are a significant departure from prior approaches. Specifically, we keep track of density functions over the candidate vectors instead of a knowledge set consisting of the candidate vectors consistent with the feedback obtained.
    Hardening DNNs against Transfer Attacks during Network Compression using Greedy Adversarial Pruning. (arXiv:2206.07406v1 [cs.LG])
    The prevalence and success of Deep Neural Network (DNN) applications in recent years have motivated research on DNN compression, such as pruning and quantization. These techniques accelerate model inference, reduce power consumption, and reduce the size and complexity of the hardware necessary to run DNNs, all with little to no loss in accuracy. However, since DNNs are vulnerable to adversarial inputs, it is important to consider the relationship between compression and adversarial robustness. In this work, we investigate the adversarial robustness of models produced by several irregular pruning schemes and by 8-bit quantization. Additionally, while conventional pruning removes the least important parameters in a DNN, we investigate the effect of an unconventional pruning method: removing the most important model parameters based on the gradient on adversarial inputs. We call this method Greedy Adversarial Pruning (GAP) and we find that this pruning method results in models that are resistant to transfer attacks from their uncompressed counterparts.
    Nonstationary Temporal Matrix Factorization for Multivariate Time Series Forecasting. (arXiv:2203.10651v2 [cs.LG] UPDATED)
    Modern time series datasets are often high-dimensional, incomplete/sparse, and nonstationary. These properties hinder the development of scalable and efficient solutions for time series forecasting and analysis. To address these challenges, we propose a Nonstationary Temporal Matrix Factorization (NoTMF) model, in which matrix factorization is used to reconstruct the whole time series matrix and vector autoregressive (VAR) process is imposed on a properly differenced copy of the temporal factor matrix. This approach not only preserves the low-rank property of the data but also offers consistent temporal dynamics. The learning process of NoTMF involves the optimization of two factor matrices and a collection of VAR coefficient matrices. To efficiently solve the optimization problem, we derive an alternating minimization framework, in which subproblems are solved using conjugate gradient and least squares methods. In particular, the use of conjugate gradient method offers an efficient routine and allows us to apply NoTMF on large-scale problems. Through extensive experiments on Uber movement speed dataset, we demonstrate the superior accuracy and effectiveness of NoTMF over other baseline models. Our results also confirm the importance of addressing the nonstationarity of real-world time series data such as spatiotemporal traffic flow/speed.
    Constrained Variational Policy Optimization for Safe Reinforcement Learning. (arXiv:2201.11927v2 [cs.LG] UPDATED)
    Safe reinforcement learning (RL) aims to learn policies that satisfy certain constraints before deploying them to safety-critical applications. Previous primal-dual style approaches suffer from instability issues and lack optimality guarantees. This paper overcomes the issues from the perspective of probabilistic inference. We introduce a novel Expectation-Maximization approach to naturally incorporate constraints during the policy learning: 1) a provable optimal non-parametric variational distribution could be computed in closed form after a convex optimization (E-step); 2) the policy parameter is improved within the trust region based on the optimal variational distribution (M-step). The proposed algorithm decomposes the safe RL problem into a convex optimization phase and a supervised learning phase, which yields a more stable training performance. A wide range of experiments on continuous robotic tasks shows that the proposed method achieves significantly better constraint satisfaction performance and better sample efficiency than baselines. The code is available at https://github.com/liuzuxin/cvpo-safe-rl.
    Combining Counterfactuals With Shapley Values To Explain Image Models. (arXiv:2206.07087v1 [cs.LG])
    With the widespread use of sophisticated machine learning models in sensitive applications, understanding their decision-making has become an essential task. Models trained on tabular data have witnessed significant progress in explanations of their underlying decision making processes by virtue of having a small number of discrete features. However, applying these methods to high-dimensional inputs such as images is not a trivial task. Images are composed of pixels at an atomic level and do not carry any interpretability by themselves. In this work, we seek to use annotated high-level interpretable features of images to provide explanations. We leverage the Shapley value framework from Game Theory, which has garnered wide acceptance in general XAI problems. By developing a pipeline to generate counterfactuals and subsequently using it to estimate Shapley values, we obtain contrastive and interpretable explanations with strong axiomatic guarantees.
    Offline Reinforcement Learning Under Value and Density-Ratio Realizability: The Power of Gaps. (arXiv:2203.13935v3 [cs.LG] UPDATED)
    We consider a challenging theoretical problem in offline reinforcement learning (RL): obtaining sample-efficiency guarantees with a dataset lacking sufficient coverage, under only realizability-type assumptions for the function approximators. While the existing theory has addressed learning under realizability and under non-exploratory data separately, no work has been able to address both simultaneously (except for a concurrent work which we compare in detail). Under an additional gap assumption, we provide guarantees to a simple pessimistic algorithm based on a version space formed by marginalized importance sampling (MIS), and the guarantee only requires the data to cover the optimal policy and the function classes to realize the optimal value and density-ratio functions. While similar gap assumptions have been used in other areas of RL theory, our work is the first to identify the utility and the novel mechanism of gap assumptions in offline RL with weak function approximation.
    No More Than 6ft Apart: Robust K-Means via Radius Upper Bounds. (arXiv:2203.02502v2 [cs.LG] UPDATED)
    Centroid based clustering methods such as k-means, k-medoids and k-centers are heavily applied as a go-to tool in exploratory data analysis. In many cases, those methods are used to obtain representative centroids of the data manifold for visualization or summarization of a dataset. Real world datasets often contain inherent abnormalities, e.g., repeated samples and sampling bias, that manifest imbalanced clustering. We propose to remedy such a scenario by introducing a maximal radius constraint $r$ on the clusters formed by the centroids, i.e., samples from the same cluster should not be more than $2r$ apart in terms of $\ell_2$ distance. We achieve this constraint by solving a semi-definite program, followed by a linear assignment problem with quadratic constraints. Through qualitative results, we show that our proposed method is robust towards dataset imbalances and sampling artifacts. To the best of our knowledge, ours is the first constrained k-means clustering method with hard radius constraints. Codes at https://bit.ly/kmeans-constrained
    Online Contextual Decision-Making with a Smart Predict-then-Optimize Method. (arXiv:2206.07316v1 [cs.LG])
    We study an online contextual decision-making problem with resource constraints. At each time period, the decision-maker first predicts a reward vector and resource consumption matrix based on a given context vector and then solves a downstream optimization problem to make a decision. The final goal of the decision-maker is to maximize the summation of the reward and the utility from resource consumption, while satisfying the resource constraints. We propose an algorithm that mixes a prediction step based on the "Smart Predict-then-Optimize (SPO)" method with a dual update step based on mirror descent. We prove regret bounds and demonstrate that the overall convergence rate of our method depends on the $\mathcal{O}(T^{-1/2})$ convergence of online mirror descent as well as risk bounds of the surrogate loss function used to learn the prediction model. Our algorithm and regret bounds apply to a general convex feasible region for the resource constraints, including both hard and soft resource constraint cases, and they apply to a wide class of prediction models in contrast to the traditional settings of linear contextual models or finite policy spaces. We also conduct numerical experiments to empirically demonstrate the strength of our proposed SPO-type methods, as compared to traditional prediction-error-only methods, on multi-dimensional knapsack and longest path instances.
    HyperPrompt: Prompt-based Task-Conditioning of Transformers. (arXiv:2203.00759v2 [cs.CL] UPDATED)
    Prompt-Tuning is a new paradigm for finetuning pre-trained language models in a parameter-efficient way. Here, we explore the use of HyperNetworks to generate hyper-prompts: we propose HyperPrompt, a novel architecture for prompt-based task-conditioning of self-attention in Transformers. The hyper-prompts are end-to-end learnable via generation by a HyperNetwork. HyperPrompt allows the network to learn task-specific feature maps where the hyper-prompts serve as task global memories for the queries to attend to, at the same time enabling flexible information sharing among tasks. We show that HyperPrompt is competitive against strong multi-task learning baselines with as few as $0.14\%$ of additional task-conditioning parameters, achieving great parameter and computational efficiency. Through extensive empirical experiments, we demonstrate that HyperPrompt can achieve superior performances over strong T5 multi-task learning baselines and parameter-efficient adapter variants including Prompt-Tuning and HyperFormer++ on Natural Language Understanding benchmarks of GLUE and SuperGLUE across many model sizes.
    Score-based Generative Modeling of Graphs via the System of Stochastic Differential Equations. (arXiv:2202.02514v3 [cs.LG] UPDATED)
    Generating graph-structured data requires learning the underlying distribution of graphs. Yet, this is a challenging problem, and the previous graph generative methods either fail to capture the permutation-invariance property of graphs or cannot sufficiently model the complex dependency between nodes and edges, which is crucial for generating real-world graphs such as molecules. To overcome such limitations, we propose a novel score-based generative model for graphs with a continuous-time framework. Specifically, we propose a new graph diffusion process that models the joint distribution of the nodes and edges through a system of stochastic differential equations (SDEs). Then, we derive novel score matching objectives tailored for the proposed diffusion process to estimate the gradient of the joint log-density with respect to each component, and introduce a new solver for the system of SDEs to efficiently sample from the reverse diffusion process. We validate our graph generation method on diverse datasets, on which it either achieves significantly superior or competitive performance to the baselines. Further analysis shows that our method is able to generate molecules that lie close to the training distribution yet do not violate the chemical valency rule, demonstrating the effectiveness of the system of SDEs in modeling the node-edge relationships. Our code is available at https://github.com/harryjo97/GDSS.
    CITRIS: Causal Identifiability from Temporal Intervened Sequences. (arXiv:2202.03169v3 [cs.LG] UPDATED)
    Understanding the latent causal factors of a dynamical system from visual observations is considered a crucial step towards agents reasoning in complex environments. In this paper, we propose CITRIS, a variational autoencoder framework that learns causal representations from temporal sequences of images in which underlying causal factors have possibly been intervened upon. In contrast to the recent literature, CITRIS exploits temporality and observing intervention targets to identify scalar and multidimensional causal factors, such as 3D rotation angles. Furthermore, by introducing a normalizing flow, CITRIS can be easily extended to leverage and disentangle representations obtained by already pretrained autoencoders. Extending previous results on scalar causal factors, we prove identifiability in a more general setting, in which only some components of a causal factor are affected by interventions. In experiments on 3D rendered image sequences, CITRIS outperforms previous methods on recovering the underlying causal variables. Moreover, using pretrained autoencoders, CITRIS can even generalize to unseen instantiations of causal factors, opening future research areas in sim-to-real generalization for causal representation learning.
    TPC: Transformation-Specific Smoothing for Point Cloud Models. (arXiv:2201.12733v3 [cs.CV] UPDATED)
    Point cloud models with neural network architectures have achieved great success and have been widely used in safety-critical applications, such as Lidar-based recognition systems in autonomous vehicles. However, such models are shown vulnerable to adversarial attacks which aim to apply stealthy semantic transformations such as rotation and tapering to mislead model predictions. In this paper, we propose a transformation-specific smoothing framework TPC, which provides tight and scalable robustness guarantees for point cloud models against semantic transformation attacks. We first categorize common 3D transformations into three categories: additive (e.g., shearing), composable (e.g., rotation), and indirectly composable (e.g., tapering), and we present generic robustness certification strategies for all categories respectively. We then specify unique certification protocols for a range of specific semantic transformations and their compositions. Extensive experiments on several common 3D transformations show that TPC significantly outperforms the state of the art. For example, our framework boosts the certified accuracy against twisting transformation along z-axis (within 20$^\circ$) from 20.3$\%$ to 83.8$\%$. Codes and models are available at https://github.com/Qianhewu/Point-Cloud-Smoothing.
    NeuPSL: Neural Probabilistic Soft Logic. (arXiv:2205.14268v2 [cs.LG] UPDATED)
    We present Neural Probabilistic Soft Logic (NeuPSL), a novel neuro-symbolic (NeSy) framework that unites state-of-the-art symbolic reasoning with the low-level perception of deep neural networks. To explicitly model the boundary between neural and symbolic representations, we introduce NeSy Energy-Based Models, a general family of energy-based models that combine neural and symbolic reasoning. Using this framework, we show how to seamlessly integrate neural and symbolic parameter learning and inference. We perform an extensive empirical evaluation and show that NeuPSL outperforms existing methods on joint inference and has significantly lower variance in almost all settings.
    GNNRank: Learning Global Rankings from Pairwise Comparisons via Directed Graph Neural Networks. (arXiv:2202.00211v2 [cs.LG] UPDATED)
    Recovering global rankings from pairwise comparisons has wide applications from time synchronization to sports team ranking. Pairwise comparisons corresponding to matches in a competition can be construed as edges in a directed graph (digraph), whose nodes represent e.g. competitors with an unknown rank. In this paper, we introduce neural networks into the ranking recovery problem by proposing the so-called GNNRank, a trainable GNN-based framework with digraph embedding. Moreover, new objectives are devised to encode ranking upsets/violations. The framework involves a ranking score estimation approach, and adds an inductive bias by unfolding the Fiedler vector computation of the graph constructed from a learnable similarity matrix. Experimental results on extensive data sets show that our methods attain competitive and often superior performance against baselines, as well as showing promising transfer ability. Codes and preprocessed data are at: \url{https://github.com/SherylHYX/GNNRank}.
    Non-Vacuous Generalisation Bounds for Shallow Neural Networks. (arXiv:2202.01627v3 [cs.LG] UPDATED)
    We focus on a specific class of shallow neural networks with a single hidden layer, namely those with $L_2$-normalised data and either a sigmoid-shaped Gaussian error function ("erf") activation or a Gaussian Error Linear Unit (GELU) activation. For these networks, we derive new generalisation bounds through the PAC-Bayesian theory; unlike most existing such bounds they apply to neural networks with deterministic rather than randomised parameters. Our bounds are empirically non-vacuous when the network is trained with vanilla stochastic gradient descent on MNIST and Fashion-MNIST.
    A Meta-Analysis of Distributionally-Robust Models. (arXiv:2206.07565v1 [cs.CV])
    State-of-the-art image classifiers trained on massive datasets (such as ImageNet) have been shown to be vulnerable to a range of both intentional and incidental distribution shifts. On the other hand, several recent classifiers with favorable out-of-distribution (OOD) robustness properties have emerged, achieving high accuracy on their target tasks while maintaining their in-distribution accuracy on challenging benchmarks. We present a meta-analysis on a wide range of publicly released models, most of which have been published over the last twelve months. Through this meta-analysis, we empirically identify four main commonalities for all the best-performing OOD-robust models, all of which illuminate the considerable promise of vision-language pre-training.
    Asynchronous SGD Beats Minibatch SGD Under Arbitrary Delays. (arXiv:2206.07638v1 [math.OC])
    The existing analysis of asynchronous stochastic gradient descent (SGD) degrades dramatically when any delay is large, giving the impression that performance depends primarily on the delay. On the contrary, we prove much better guarantees for the same asynchronous SGD algorithm regardless of the delays in the gradients, depending instead just on the number of parallel devices used to implement the algorithm. Our guarantees are strictly better than the existing analyses, and we also argue that asynchronous SGD outperforms synchronous minibatch SGD in the settings we consider. For our analysis, we introduce a novel recursion based on "virtual iterates" and delay-adaptive stepsizes, which allow us to derive state-of-the-art guarantees for both convex and non-convex objectives.
    Flatten the Curve: Efficiently Training Low-Curvature Neural Networks. (arXiv:2206.07144v1 [cs.LG])
    The highly non-linear nature of deep neural networks causes them to be susceptible to adversarial examples and have unstable gradients which hinders interpretability. However, existing methods to solve these issues, such as adversarial training, are expensive and often sacrifice predictive accuracy. In this work, we consider curvature, which is a mathematical quantity which encodes the degree of non-linearity. Using this, we demonstrate low-curvature neural networks (LCNNs) that obtain drastically lower curvature than standard models while exhibiting similar predictive performance, which leads to improved robustness and stable gradients, with only a marginally increased training time. To achieve this, we minimize a data-independent upper bound on the curvature of a neural network, which decomposes overall curvature in terms of curvatures and slopes of its constituent layers. To efficiently minimize this bound, we introduce two novel architectural components: first, a non-linearity called centered-softplus that is a stable variant of the softplus non-linearity, and second, a Lipschitz-constrained batch normalization layer. Our experiments show that LCNNs have lower curvature, more stable gradients and increased off-the-shelf adversarial robustness when compared to their standard high-curvature counterparts, all without affecting predictive performance. Our approach is easy to use and can be readily incorporated into existing neural network models.
    TLDR: Twin Learning for Dimensionality Reduction. (arXiv:2110.09455v2 [cs.CV] UPDATED)
    Dimensionality reduction methods are unsupervised approaches which learn low-dimensional spaces where some properties of the initial space, typically the notion of "neighborhood", are preserved. Such methods usually require propagation on large k-NN graphs or complicated optimization solvers. On the other hand, self-supervised learning approaches, typically used to learn representations from scratch, rely on simple and more scalable frameworks for learning. In this paper, we propose TLDR, a dimensionality reduction method for generic input spaces that is porting the recent self-supervised learning framework of Zbontar et al. (2021) to the specific task of dimensionality reduction, over arbitrary representations. We propose to use nearest neighbors to build pairs from a training set and a redundancy reduction loss to learn an encoder that produces representations invariant across such pairs. TLDR is a method that is simple, easy to train, and of broad applicability; it consists of an offline nearest neighbor computation step that can be highly approximated, and a straightforward learning process. Aiming for scalability, we focus on improving linear dimensionality reduction, and show consistent gains on image and document retrieval tasks, e.g. gaining +4% mAP over PCA on ROxford for GeM- AP, improving the performance of DINO on ImageNet or retaining it with a 10x compression.
    Deciphering Environmental Air Pollution with Large Scale City Data. (arXiv:2109.04572v2 [cs.LG] UPDATED)
    Air pollution poses a serious threat to sustainable environmental conditions in the 21st century. Its importance in determining the health and living standards in urban settings is only expected to increase with time. Various factors ranging from artificial emissions to natural phenomena are known to be primary causal agents or influencers behind rising air pollution levels. However, the lack of large scale data involving the major artificial and natural factors has hindered the research on the causes and relations governing the variability of the different air pollutants. Through this work, we introduce a large scale city-wise dataset for exploring the relationships among these agents over a long period of time. We also introduce a transformer based model - cosSquareFormer, for the problem of pollutant level estimation and forecasting. Our model outperforms most of the benchmark models for this task. We also analyze and explore the dataset through our model and other methodologies to bring out important inferences which enable us to understand the dynamics of the causal agents at a deeper level. Through our paper, we seek to provide a great set of foundations for further research into this domain that will demand critical attention of ours in the near future.
    RieszNet and ForestRiesz: Automatic Debiased Machine Learning with Neural Nets and Random Forests. (arXiv:2110.03031v3 [cs.LG] UPDATED)
    Many causal and policy effects of interest are defined by linear functionals of high-dimensional or non-parametric regression functions. $\sqrt{n}$-consistent and asymptotically normal estimation of the object of interest requires debiasing to reduce the effects of regularization and/or model selection on the object of interest. Debiasing is typically achieved by adding a correction term to the plug-in estimator of the functional, which leads to properties such as semi-parametric efficiency, double robustness, and Neyman orthogonality. We implement an automatic debiasing procedure based on automatically learning the Riesz representation of the linear functional using Neural Nets and Random Forests. Our method only relies on black-box evaluation oracle access to the linear functional and does not require knowledge of its analytic form. We propose a multitasking Neural Net debiasing method with stochastic gradient descent minimization of a combined Riesz representer and regression loss, while sharing representation layers for the two functions. We also propose a Random Forest method which learns a locally linear representation of the Riesz function. Even though our method applies to arbitrary functionals, we experimentally find that it performs well compared to the state of art neural net based algorithm of Shi et al. (2019) for the case of the average treatment effect functional. We also evaluate our method on the problem of estimating average marginal effects with continuous treatments, using semi-synthetic data of gasoline price changes on gasoline demand.
    Private Language Model Adaptation for Speech Recognition. (arXiv:2110.10026v3 [eess.AS] UPDATED)
    Speech model adaptation is crucial to handle the discrepancy between server-side proxy training data and actual data received on local devices of users. With the use of federated learning (FL), we introduce an efficient approach on continuously adapting neural network language models (NNLMs) on private devices with applications on automatic speech recognition (ASR). To address the potential speech transcription errors in the on-device training corpus, we perform empirical studies on comparing various strategies of leveraging token-level confidence scores to improve the NNLM quality in the FL settings. Experiments show that compared with no model adaptation, the proposed method achieves relative 2.6% and 10.8% word error rate (WER) reductions on two speech evaluation datasets, respectively. We also provide analysis in evaluating privacy guarantees of our presented procedure.
    Convergence and Price of Anarchy Guarantees of the Softmax Policy Gradient in Markov Potential Games. (arXiv:2206.07642v1 [cs.MA])
    We study the performance of policy gradient methods for the subclass of Markov games known as Markov potential games (MPGs), which extends the notion of normal-form potential games to the stateful setting and includes the important special case of the fully cooperative setting where the agents share an identical reward function. Our focus in this paper is to study the convergence of the policy gradient method for solving MPGs under softmax policy parameterization, both tabular and parameterized with general function approximators such as neural networks. We first show the asymptotic convergence of this method to a Nash equilibrium of MPGs for tabular softmax policies. Second, we derive the finite-time performance of the policy gradient in two settings: 1) using the log-barrier regularization, and 2) using the natural policy gradient under the best-response dynamics (NPG-BR). Finally, extending the notion of price of anarchy (POA) and smoothness in normal-form games, we introduce the POA for MPGs and provide a POA bound for NPG-BR. To our knowledge, this is the first POA bound for solving MPGs. To support our theoretical results, we empirically compare the convergence rates and POA of policy gradient variants for both tabular and neural softmax policies.
    Diffusion Models for Video Prediction and Infilling. (arXiv:2206.07696v1 [cs.CV])
    To predict and anticipate future outcomes or reason about missing information in a sequence is a key ability for agents to be able to make intelligent decisions. This requires strong temporally coherent generative capabilities. Diffusion models have shown huge success in several generative tasks lately, but have not been extensively explored in the video domain. We present Random-Mask Video Diffusion (RaMViD), which extends image diffusion models to videos using 3D convolutions, and introduces a new conditioning technique during training. By varying the mask we condition on, the model is able to perform video prediction, infilling and upsampling. Since we do not use concatenation to condition on a mask, as done in most conditionally trained diffusion models, we are able to decrease the memory footprint. We evaluated the model on two benchmark datasets for video prediction and one for video generation on which we achieved competitive results. On Kinetics-600 we achieved state-of-the-art for video prediction.
    Ripple Attention for Visual Perception with Sub-quadratic Complexity. (arXiv:2110.02453v2 [cs.CV] UPDATED)
    Transformer architectures are now central to sequence modeling tasks. At its heart is the attention mechanism, which enables effective modeling of long-term dependencies in a sequence. Recently, transformers have been successfully applied in the computer vision domain, where 2D images are first segmented into patches and then treated as 1D sequences. Such linearization, however, impairs the notion of spatial locality in images, which bears important visual clues. To bridge the gap, we propose ripple attention, a sub-quadratic attention mechanism for vision transformers. Built upon the recent kernel-based efficient attention mechanisms, we design a novel dynamic programming algorithm that weights contributions of different tokens to a query with respect to their relative spatial distances in the 2D space in linear observed time. Extensive experiments and analyses demonstrate the effectiveness of ripple attention on various visual tasks.
    Sublinear Algorithms for Hierarchical Clustering. (arXiv:2206.07633v1 [cs.DS])
    Hierarchical clustering over graphs is a fundamental task in data mining and machine learning with applications in domains such as phylogenetics, social network analysis, and information retrieval. Specifically, we consider the recently popularized objective function for hierarchical clustering due to Dasgupta. Previous algorithms for (approximately) minimizing this objective function require linear time/space complexity. In many applications the underlying graph can be massive in size making it computationally challenging to process the graph even using a linear time/space algorithm. As a result, there is a strong interest in designing algorithms that can perform global computation using only sublinear resources. The focus of this work is to study hierarchical clustering for massive graphs under three well-studied models of sublinear computation which focus on space, time, and communication, respectively, as the primary resources to optimize: (1) (dynamic) streaming model where edges are presented as a stream, (2) query model where the graph is queried using neighbor and degree queries, (3) MPC model where the graph edges are partitioned over several machines connected via a communication channel. We design sublinear algorithms for hierarchical clustering in all three models above. At the heart of our algorithmic results is a view of the objective in terms of cuts in the graph, which allows us to use a relaxed notion of cut sparsifiers to do hierarchical clustering while introducing only a small distortion in the objective function. Our main algorithmic contributions are then to show how cut sparsifiers of the desired form can be efficiently constructed in the query model and the MPC model. We complement our algorithmic results by establishing nearly matching lower bounds that rule out the possibility of designing better algorithms in each of these models.
    Two-stage Human Activity Recognition on Microcontrollers with Decision Trees and CNNs. (arXiv:2206.07652v1 [eess.SP])
    Human Activity Recognition (HAR) has become an increasingly popular task for embedded devices such as smartwatches. Most HAR systems for ultra-low power devices are based on classic Machine Learning (ML) models, whereas Deep Learning (DL), although reaching state-of-the-art accuracy, is less popular due to its high energy consumption, which poses a significant challenge for battery-operated and resource-constrained devices. In this work, we bridge the gap between on-device HAR and DL thanks to a hierarchical architecture composed of a decision tree (DT) and a one dimensional Convolutional Neural Network (1D CNN). The two classifiers operate in a cascaded fashion on two different sub-tasks: the DT classifies only the easiest activities, while the CNN deals with more complex ones. With experiments on a state-of-the-art dataset and targeting a single-core RISC-V MCU, we show that this approach allows to save up to 67.7% energy w.r.t. a "stand-alone" DL architecture at iso-accuracy. Additionally, the two-stage system either introduces a negligible memory overhead (up to 200 B) or on the contrary, reduces the total memory occupation.
    A General Theory for Client Sampling in Federated Learning. (arXiv:2107.12211v4 [cs.LG] UPDATED)
    While client sampling is a central operation of current state-of-the-art federated learning (FL) approaches, the impact of this procedure on the convergence and speed of FL remains under-investigated. In this work, we provide a general theoretical framework to quantify the impact of a client sampling scheme and of the clients heterogeneity on the federated optimization. First, we provide a unified theoretical ground for previously reported sampling schemes experimental results on the relationship between FL convergence and the variance of the aggregation weights. Second, we prove for the first time that the quality of FL convergence is also impacted by the resulting covariance between aggregation weights. Our theory is general, and is here applied to Multinomial Distribution (MD) and Uniform sampling, two default unbiased client sampling schemes of FL, and demonstrated through a series of experiments in non-iid and unbalanced scenarios. Our results suggest that MD sampling should be used as default sampling scheme, due to the resilience to the changes in data ratio during the learning process, while Uniform sampling is superior only in the special case when clients have the same amount of data.
    Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning. (arXiv:2206.07229v1 [cs.SD])
    Emotion classification of speech and assessment of the emotion strength are required in applications such as emotional text-to-speech and voice conversion. The emotion attribute ranking function based on Support Vector Machine (SVM) was proposed to predict emotion strength for emotional speech corpus. However, the trained ranking function doesn't generalize to new domains, which limits the scope of applications, especially for out-of-domain or unseen speech. In this paper, we propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech. This is achieved by the fusion of emotional data from various domains. We follow a multi-task learning network architecture that includes an acoustic encoder, a strength predictor, and an auxiliary emotion predictor. Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech. We release the source codes at: https://github.com/ttslr/StrengthNet.
    Wide Bayesian neural networks have a simple weight posterior: theory and accelerated sampling. (arXiv:2206.07673v1 [stat.ML])
    We introduce repriorisation, a data-dependent reparameterisation which transforms a Bayesian neural network (BNN) posterior to a distribution whose KL divergence to the BNN prior vanishes as layer widths grow. The repriorisation map acts directly on parameters, and its analytic simplicity complements the known neural network Gaussian process (NNGP) behaviour of wide BNNs in function space. Exploiting the repriorisation, we develop a Markov chain Monte Carlo (MCMC) posterior sampling algorithm which mixes faster the wider the BNN. This contrasts with the typically poor performance of MCMC in high dimensions. We observe up to 50x higher effective sample size relative to no reparametrisation for both fully-connected and residual networks. Improvements are achieved at all widths, with the margin between reparametrised and standard BNNs growing with layer width.
    Variable Bitrate Neural Fields. (arXiv:2206.07707v1 [cs.CV])
    Neural approximations of scalar and vector fields, such as signed distance functions and radiance fields, have emerged as accurate, high-quality representations. State-of-the-art results are obtained by conditioning a neural approximation with a lookup from trainable feature grids that take on part of the learning task and allow for smaller, more efficient neural networks. Unfortunately, these feature grids usually come at the cost of significantly increased memory consumption compared to stand-alone neural network models. We present a dictionary method for compressing such feature grids, reducing their memory consumption by up to 100x and permitting a multiresolution representation which can be useful for out-of-core streaming. We formulate the dictionary optimization as a vector-quantized auto-decoder problem which lets us learn end-to-end discrete neural representations in a space where no direct supervision is available and with dynamic topology and structure. Our source code will be available at https://github.com/nv-tlabs/vqad.
    Linear Complexity Randomized Self-attention Mechanism. (arXiv:2204.04667v2 [cs.LG] UPDATED)
    Recently, random feature attentions (RFAs) are proposed to approximate the softmax attention in linear time and space complexity by linearizing the exponential kernel. In this paper, we first propose a novel perspective to understand the bias in such approximation by recasting RFAs as self-normalized importance samplers. This perspective further sheds light on an \emph{unbiased} estimator for the whole softmax attention, called randomized attention (RA). RA constructs positive random features via query-specific distributions and enjoys greatly improved approximation fidelity, albeit exhibiting quadratic complexity. By combining the expressiveness in RA and the efficiency in RFA, we develop a novel linear complexity self-attention mechanism called linear randomized attention (LARA). Extensive experiments across various domains demonstrate that RA and LARA significantly improve the performance of RFAs by a substantial margin.
    Meaningfully Debugging Model Mistakes using Conceptual Counterfactual Explanations. (arXiv:2106.12723v3 [cs.LG] UPDATED)
    Understanding and explaining the mistakes made by trained models is critical to many machine learning objectives, such as improving robustness, addressing concept drift, and mitigating biases. However, this is often an ad hoc process that involves manually looking at the model's mistakes on many test samples and guessing at the underlying reasons for those incorrect predictions. In this paper, we propose a systematic approach, conceptual counterfactual explanations (CCE), that explains why a classifier makes a mistake on a particular test sample(s) in terms of human-understandable concepts (e.g. this zebra is misclassified as a dog because of faint stripes). We base CCE on two prior ideas: counterfactual explanations and concept activation vectors, and validate our approach on well-known pretrained models, showing that it explains the models' mistakes meaningfully. In addition, for new models trained on data with spurious correlations, CCE accurately identifies the spurious correlation as the cause of model mistakes from a single misclassified test sample. On two challenging medical applications, CCE generated useful insights, confirmed by clinicians, into biases and mistakes the model makes in real-world settings.
    A Survey on Graph Representation Learning Methods. (arXiv:2204.01855v2 [cs.LG] UPDATED)
    Graphs representation learning has been a very active research area in recent years. The goal of graph representation learning is to generate graph representation vectors that capture the structure and features of large graphs accurately. This is especially important because the quality of the graph representation vectors will affect the performance of these vectors in downstream tasks such as node classification, link prediction and anomaly detection. Many techniques are proposed for generating effective graph representation vectors. Two of the most prevalent categories of graph representation learning are graph embedding methods without using graph neural nets (GNN), which we denote as non-GNN based graph embedding methods, and graph neural nets (GNN) based methods. Non-GNN graph embedding methods are based on techniques such as random walks, temporal point processes and neural network learning methods. GNN-based methods, on the other hand, are the application of deep learning on graph data. In this survey, we provide an overview of these two categories and cover the current state-of-the-art methods for both static and dynamic graphs. Finally, we explore some open and ongoing research directions for future work.
    Shifting Capsule Networks from the Cloud to the Deep Edge. (arXiv:2110.02911v2 [cs.LG] UPDATED)
    Capsule networks (CapsNets) are an emerging trend in image processing. In contrast to a convolutional neural network, CapsNets are not vulnerable to object deformation, as the relative spatial information of the objects is preserved across the network. However, their complexity is mainly related to the capsule structure and the dynamic routing mechanism, which makes it almost unreasonable to deploy a CapsNet, in its original form, in a resource-constrained device powered by a small microcontroller (MCU). In an era where intelligence is rapidly shifting from the cloud to the edge, this high complexity imposes serious challenges to the adoption of CapsNets at the very edge. To tackle this issue, we present an API for the execution of quantized CapsNets in Arm Cortex-M and RISC-V MCUs. Our software kernels extend the Arm CMSIS-NN and RISC-V PULP-NN to support capsule operations with 8-bit integers as operands. Along with it, we propose a framework to perform post-training quantization of a CapsNet. Results show a reduction in memory footprint of almost 75%, with accuracy loss ranging from 0.07% to 0.18%. In terms of throughput, our Arm Cortex-M API enables the execution of primary capsule and capsule layers with medium-sized kernels in just 119.94 and 90.60 milliseconds (ms), respectively (STM32H755ZIT6U, Cortex-M7 @ 480 MHz). For the GAP-8 SoC (RISC-V RV32IMCXpulp @ 170 MHz), the latency drops to 7.02 and 38.03 ms, respectively.
    MBGDT:Robust Mini-Batch Gradient Descent. (arXiv:2206.07139v1 [cs.LG])
    In high dimensions, most machine learning method perform fragile even there are a little outliers. To address this, we hope to introduce a new method with the base learner, such as Bayesian regression or stochastic gradient descent to solve the problem of the vulnerability in the model. Because the mini-batch gradient descent allows for a more robust convergence than the batch gradient descent, we work a method with the mini-batch gradient descent, called Mini-Batch Gradient Descent with Trimming (MBGDT). Our method show state-of-art performance and have greater robustness than several baselines when we apply our method in designed dataset.
    Experimental Validation of Spectral-Spatial Power Evolution Design Using Raman Amplifiers. (arXiv:2206.07658v1 [cs.LG])
    We experimentally validate a machine learning-enabled Raman amplification framework, capable of jointly shaping the signal power evolution in two domains: frequency and fiber distance. The proposed experiment addresses the amplification in the whole C-band, by optimizing four first-order counter-propagating Raman pumps.
    MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields. (arXiv:2206.07697v1 [stat.ML])
    Creating fast and accurate force fields is a long-standing challenge in computational chemistry and materials science. Recently, several equivariant message passing neural networks (MPNNs) have been shown to outperform models built using other approaches in terms of accuracy. However, most MPNNs suffer from high computational cost and poor scalability. We propose that these limitations arise because MPNNs only pass two-body messages leading to a direct relationship between the number of layers and the expressivity of the network. In this work, we introduce MACE, a new equivariant MPNN model that uses higher body order messages. In particular, we show that using four-body messages reduces the required number of message passing iterations to just \emph{two}, resulting in a fast and highly parallelizable model, reaching or exceeding state-of-the-art accuracy on the rMD17, 3BPA, and AcAc benchmark tasks. We also demonstrate that using higher order messages leads to an improved steepness of the learning curves.
    Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone. (arXiv:2206.07643v1 [cs.CV])
    Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones, bringing gains in terms of memory and performance. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is available at https://github.com/microsoft/FIBER.
    Hyperparameter Sensitivity in Deep Outlier Detection: Analysis and a Scalable Hyper-Ensemble Solution. (arXiv:2206.07647v1 [cs.LG])
    Outlier detection (OD) literature exhibits numerous algorithms as it applies to diverse domains. However, given a new detection task, it is unclear how to choose an algorithm to use, nor how to set its hyperparameter(s) (HPs) in unsupervised settings. HP tuning is an ever-growing problem with the arrival of many new detectors based on deep learning. While they have appealing properties such as task- driven representation learning and end-to-end optimization, deep models come with a long list of HPs. Surprisingly, the issue of model selection in the outlier mining literature has been "the elephant in the room"; a significant factor in unlocking the utmost potential of deep methods, yet little said or done to systematically tackle the issue. In the first part of this paper, we conduct the first large-scale analysis on the HP sensitivity of deep OD methods, and through more than 35,000 trained models, quantitatively demonstrate that model selection is inevitable. Next, we design a HP-robust and scalable deep hyper-ensemble model called ROBOD that assembles models with varying HP configurations, bypassing the choice paralysis. Importantly, we introduce novel strategies to speed up ensemble training, such as parameter sharing, batch/simultaneous training, and data subsampling, that allow us to train fewer models with fewer parameters. Extensive experiments on both image and tabular datasets show that ROBOD achieves and retains robust, state-of-the-art detection performance as compared to its modern counterparts, while taking only 2-10% of the time by the naive hyper-ensemble with independent training.
    Sketch-Based Anomaly Detection in Streaming Graphs. (arXiv:2106.04486v2 [cs.DS] UPDATED)
    Given a stream of graph edges from a dynamic graph, how can we assign anomaly scores to edges and subgraphs in an online manner, for the purpose of detecting unusual behavior, using constant time and memory? For example, in intrusion detection, existing work seeks to detect either anomalous edges or anomalous subgraphs, but not both. In this paper, we first extend the count-min sketch data structure to a higher-order sketch. This higher-order sketch has the useful property of preserving the dense subgraph structure (dense subgraphs in the input turn into dense submatrices in the data structure). We then propose 4 online algorithms that utilize this enhanced data structure, which (a) detect both edge and graph anomalies; (b) process each edge and graph in constant memory and constant update time per newly arriving edge, and; (c) outperform state-of-the-art baselines on 4 real-world datasets. Our method is the first streaming approach that incorporates dense subgraph search to detect graph anomalies in constant memory and time.
    End-To-End Label Uncertainty Modeling for Speech-based Arousal Recognition Using Bayesian Neural Networks. (arXiv:2110.03299v3 [eess.AS] UPDATED)
    Emotions are subjective constructs. Recent end-to-end speech emotion recognition systems are typically agnostic to the subjective nature of emotions, despite their state-of-the-art performance. In this work, we introduce an end-to-end Bayesian neural network architecture to capture the inherent subjectivity in the arousal dimension of emotional expressions. To the best of our knowledge, this work is the first to use Bayesian neural networks for speech emotion recognition. At training, the network learns a distribution of weights to capture the inherent uncertainty related to subjective arousal annotations. To this end, we introduce a loss term that enables the model to be explicitly trained on a distribution of annotations, rather than training them exclusively on mean or gold-standard labels. We evaluate the proposed approach on the AVEC'16 dataset. Qualitative and quantitative analysis of the results reveals that the proposed model can aptly capture the distribution of subjective arousal annotations, with state-of-the-art results in mean and standard deviation estimations for uncertainty modeling.
    Applications of Generative Adversarial Networks in Neuroimaging and Clinical Neuroscience. (arXiv:2206.07081v1 [cs.LG])
    Generative adversarial networks (GANs) are one powerful type of deep learning models that have been successfully utilized in numerous fields. They belong to a broader family called generative methods, which generate new data with a probabilistic model by learning sample distribution from real examples. In the clinical context, GANs have shown enhanced capabilities in capturing spatially complex, nonlinear, and potentially subtle disease effects compared to traditional generative methods. This review appraises the existing literature on the applications of GANs in imaging studies of various neurological conditions, including Alzheimer's disease, brain tumors, brain aging, and multiple sclerosis. We provide an intuitive explanation of various GAN methods for each application and further discuss the main challenges, open questions, and promising future directions of leveraging GANs in neuroimaging. We aim to bridge the gap between advanced deep learning methods and neurology research by highlighting how GANs can be leveraged to support clinical decision making and contribute to a better understanding of the structural and functional patterns of brain diseases.
    Statistical and Computational Phase Transitions in Group Testing. (arXiv:2206.07640v1 [stat.ML])
    We study the group testing problem where the goal is to identify a set of k infected individuals carrying a rare disease within a population of size n, based on the outcomes of pooled tests which return positive whenever there is at least one infected individual in the tested group. We consider two different simple random procedures for assigning individuals to tests: the constant-column design and Bernoulli design. Our first set of results concerns the fundamental statistical limits. For the constant-column design, we give a new information-theoretic lower bound which implies that the proportion of correctly identifiable infected individuals undergoes a sharp "all-or-nothing" phase transition when the number of tests crosses a particular threshold. For the Bernoulli design, we determine the precise number of tests required to solve the associated detection problem (where the goal is to distinguish between a group testing instance and pure noise), improving both the upper and lower bounds of Truong, Aldridge, and Scarlett (2020). For both group testing models, we also study the power of computationally efficient (polynomial-time) inference procedures. We determine the precise number of tests required for the class of low-degree polynomial algorithms to solve the detection problem. This provides evidence for an inherent computational-statistical gap in both the detection and recovery problems at small sparsity levels. Notably, our evidence is contrary to that of Iliopoulos and Zadik (2021), who predicted the absence of a computational-statistical gap in the Bernoulli design.
    Online Variational Filtering and Parameter Learning. (arXiv:2110.13549v2 [stat.ML] UPDATED)
    We present a variational method for online state estimation and parameter learning in state-space models (SSMs), a ubiquitous class of latent variable models for sequential data. As per standard batch variational techniques, we use stochastic gradients to simultaneously optimize a lower bound on the log evidence with respect to both model parameters and a variational approximation of the states' posterior distribution. However, unlike existing approaches, our method is able to operate in an entirely online manner, such that historic observations do not require revisitation after being incorporated and the cost of updates at each time step remains constant, despite the growing dimensionality of the joint posterior distribution of the states. This is achieved by utilizing backward decompositions of this joint posterior distribution and of its variational approximation, combined with Bellman-type recursions for the evidence lower bound and its gradients. We demonstrate the performance of this methodology across several examples, including high-dimensional SSMs and sequential Variational Auto-Encoders.
    Semi-Supervised Segmentation of Mitochondria from Electron Microscopy Images Using Spatial Continuity. (arXiv:2206.02392v1 [cs.CV] CROSS LISTED)
    Morphology of mitochondria plays critical roles in mediating their physiological functions. Accurate segmentation of mitochondria from 3D electron microscopy (EM) images is essential to quantitative characterization of their morphology at the nanometer scale. Fully supervised deep learning models developed for this task achieve excellent performance but require substantial amounts of annotated data for training. However, manual annotation of EM images is laborious and time-consuming because of their large volumes, limited contrast, and low signal-to-noise ratios (SNRs). To overcome this challenge, we propose a semi-supervised deep learning model that segments mitochondria by leveraging the spatial continuity of their structural, morphological, and contextual information in both labeled and unlabeled images. We use random piecewise affine transformation to synthesize comprehensive and realistic mitochondrial morphology for augmentation of training data. Experiments on the EPFL dataset show that our model achieves performance similar as that of state-of-the-art fully supervised models but requires only ~20% of their annotated training data. Our semi-supervised model is versatile and can also accurately segment other spatially continuous structures from EM images. Data and code of this study are openly accessible at https://github.com/cbmi-group/MPP.
    Prefix Language Models are Unified Modal Learners. (arXiv:2206.07699v1 [cs.CV])
    With the success of vision-language pre-training, we have witnessed the state-of-the-art has been pushed on multi-modal understanding and generation. However, the current pre-training paradigm is either incapable of targeting all modalities at once (e.g., text generation and image generation), or requires multi-fold well-designed tasks which significantly limits the scalability. We demonstrate that a unified modal model could be learned with a prefix language modeling objective upon text and image sequences. Thanks to the simple but powerful pre-training paradigm, our proposed model, DaVinci, is simple to train, scalable to huge data, and adaptable to a variety of downstream tasks across modalities (language / vision / vision+language), types (understanding / generation) and settings (e.g., zero-shot, fine-tuning, linear evaluation) with a single unified architecture. DaVinci achieves the competitive performance on a wide range of 26 understanding / generation tasks, and outperforms previous unified vision-language models on most tasks, including ImageNet classification (+1.6%), VQAv2 (+1.4%), COCO caption generation (BLEU@4 +1.1%, CIDEr +1.5%) and COCO image generation (IS +0.9%, FID -1.0%), at the comparable model and data scale. Furthermore, we offer a well-defined benchmark for future research by reporting the performance on different scales of the pre-training dataset on a heterogeneous and wide distribution coverage. Our results establish new, stronger baselines for future comparisons at different data scales and shed light on the difficulties of comparing VLP models more generally.
    FastMapSVM: Classifying Complex Objects Using the FastMap Algorithm and Support-Vector Machines. (arXiv:2204.05112v3 [cs.CV] UPDATED)
    Neural Networks and related Deep Learning methods are currently at the leading edge of technologies used for classifying objects. However, they generally demand large amounts of time and data for model training; and their learned models can sometimes be difficult to interpret. In this paper, we advance FastMapSVM -- an interpretable Machine Learning framework for classifying complex objects -- as an advantageous alternative to Neural Networks for general classification tasks. FastMapSVM extends the applicability of Support-Vector Machines (SVMs) to domains with complex objects by combining the complementary strengths of FastMap and SVMs. FastMap is an efficient linear-time algorithm that maps complex objects to points in a Euclidean space while preserving pairwise domain-specific distances between them. We demonstrate the efficiency and effectiveness of FastMapSVM in the context of classifying seismograms. We show that its performance, in terms of precision, recall, and accuracy, is comparable to that of other state-of-the-art methods. However, compared to other methods, FastMapSVM uses significantly smaller amounts of time and data for model training. It also provides a perspicuous visualization of the objects and the classification boundaries between them. We expect FastMapSVM to be viable for classification tasks in many other real-world domains.
    Learning Heuristics for Template-based CEGIS of Loop Invariants with Reinforcement Learning. (arXiv:2107.09766v3 [cs.AI] UPDATED)
    Loop-invariant synthesis is the basis of program verification. Due to the undecidability of the problem in general, a tool for invariant synthesis necessarily uses heuristics. Despite the common belief that the design of heuristics is vital for the performance of a synthesizer, heuristics are often engineered by their developers based on experience and intuition, sometimes in an \emph{ad-hoc} manner. In this work, we propose an approach to systematically learning heuristics for template-based CounterExample-Guided Inductive Synthesis (CEGIS) with reinforcement learning. As a concrete example, we implement the approach on top of PCSat, which is an invariant synthesizer based on template-based CEGIS. Experiments show that PCSat guided by the heuristics learned by our framework not only outperforms existing state-of-the-art CEGIS-based solvers such as HoICE and the neural solver Code2Inv, but also has slight advantages over non-CEGIS-based solvers such as Eldarica and Spacer in linear Constrained Horn Clause (CHC) solving.
    Morphence-2.0: Evasion-Resilient Moving Target Defense Powered by Out-of-Distribution Detection. (arXiv:2206.07321v1 [cs.CR])
    Evasion attacks against machine learning models often succeed via iterative probing of a fixed target model, whereby an attack that succeeds once will succeed repeatedly. One promising approach to counter this threat is making a model a moving target against adversarial inputs. To this end, we introduce Morphence-2.0, a scalable moving target defense (MTD) powered by out-of-distribution (OOD) detection to defend against adversarial examples. By regularly moving the decision function of a model, Morphence-2.0 makes it significantly challenging for repeated or correlated attacks to succeed. Morphence-2.0 deploys a pool of models generated from a base model in a manner that introduces sufficient randomness when it responds to prediction queries. Via OOD detection, Morphence-2.0 is equipped with a scheduling approach that assigns adversarial examples to robust decision functions and benign samples to an undefended accurate models. To ensure repeated or correlated attacks fail, the deployed pool of models automatically expires after a query budget is reached and the model pool is seamlessly replaced by a new model pool generated in advance. We evaluate Morphence-2.0 on two benchmark image classification datasets (MNIST and CIFAR10) against 4 reference attacks (3 white-box and 1 black-box). Morphence-2.0 consistently outperforms prior defenses while preserving accuracy on clean data and reducing attack transferability. We also show that, when powered by OOD detection, Morphence-2.0 is able to precisely make an input-based movement of the model's decision function that leads to higher prediction accuracy on both adversarial and benign queries.
    Atrial Fibrillation Detection Using Weight-Pruned, Log-Quantised Convolutional Neural Networks. (arXiv:2206.07649v1 [eess.SP])
    Deep neural networks (DNN) are a promising tool in medical applications. However, the implementation of complex DNNs on battery-powered devices is challenging due to high energy costs for communication. In this work, a convolutional neural network model is developed for detecting atrial fibrillation from electrocardiogram (ECG) signals. The model demonstrates high performance despite being trained on limited, variable-length input data. Weight pruning and logarithmic quantisation are combined to introduce sparsity and reduce model size, which can be exploited for reduced data movement and lower computational complexity. The final model achieved a 91.1% model compression ratio while maintaining high model accuracy of 91.7% and less than 1% loss.
    BRIDGE: Byzantine-resilient Decentralized Gradient Descent. (arXiv:1908.08098v3 [stat.ML] UPDATED)
    Machine learning has begun to play a central role in many applications. A multitude of these applications typically also involve datasets that are distributed across multiple computing devices/machines due to either design constraints (e.g., multiagent systems) or computational/privacy reasons (e.g., learning on smartphone data). Such applications often require the learning tasks to be carried out in a decentralized fashion, in which there is no central server that is directly connected to all nodes. In real-world decentralized settings, nodes are prone to undetected failures due to malfunctioning equipment, cyberattacks, etc., which are likely to crash non-robust learning algorithms. The focus of this paper is on robustification of decentralized learning in the presence of nodes that have undergone Byzantine failures. The Byzantine failure model allows faulty nodes to arbitrarily deviate from their intended behaviors, thereby ensuring designs of the most robust of algorithms. But the study of Byzantine resilience within decentralized learning, in contrast to distributed learning, is still in its infancy. In particular, existing Byzantine-resilient decentralized learning methods either do not scale well to large-scale machine learning models, or they lack statistical convergence guarantees that help characterize their generalization errors. In this paper, a scalable, Byzantine-resilient decentralized machine learning framework termed Byzantine-resilient decentralized gradient descent (BRIDGE) is introduced. Algorithmic and statistical convergence guarantees for one variant of BRIDGE are also provided in the paper for both strongly convex problems and a class of nonconvex problems. In addition, large-scale decentralized learning experiments are used to establish that the BRIDGE framework is scalable and it delivers competitive results for Byzantine-resilient convex and nonconvex learning.
    Re-evaluating Word Mover's Distance. (arXiv:2105.14403v3 [cs.LG] UPDATED)
    The word mover's distance (WMD) is a fundamental technique for measuring the similarity of two documents. As the crux of WMD, it can take advantage of the underlying geometry of the word space by employing an optimal transport formulation. The original study on WMD reported that WMD outperforms classical baselines such as bag-of-words (BOW) and TF-IDF by significant margins in various datasets. In this paper, we point out that the evaluation in the original study could be misleading. We re-evaluate the performances of WMD and the classical baselines and find that the classical baselines are competitive with WMD if we employ an appropriate preprocessing, i.e., L1 normalization. In addition, we introduce an analogy between WMD and L1-normalized BOW and find that not only the performance of WMD but also the distance values resemble those of BOW in high dimensional spaces.
    A Deep Generative Model of Neonatal Cortical Surface Development. (arXiv:2206.07542v1 [q-bio.NC])
    The neonatal cortical surface is known to be affected by preterm birth, and the subsequent changes to cortical organisation have been associated with poorer neurodevelopmental outcomes. Deep Generative models have the potential to lead to clinically interpretable models of disease, but developing these on the cortical surface is challenging since established techniques for learning convolutional filters are inappropriate on non-flat topologies. To close this gap, we implement a surface-based CycleGAN using mixture model CNNs (MoNet) to translate sphericalised neonatal cortical surface features (curvature and T1w/T2w cortical myelin) between different stages of cortical maturity. Results show our method is able to reliably predict changes in individual patterns of cortical organisation at later stages of gestation, validated by comparison to longitudinal data; and translate appearance between preterm and term gestation (> 37 weeks gestation), validated through comparison with a trained term/preterm classifier. Simulated differences in cortical maturation are consistent with observations in the literature.
    Topological Simplification of Signals for Inference and Approximate Reconstruction. (arXiv:2206.07486v1 [eess.SP])
    As Internet of Things (IoT) devices become both cheaper and more powerful, researchers are increasingly finding solutions to their scientific curiosities both financially and computationally feasible. When operating with restricted power or communications budgets, however, devices can only send highly-compressed data. Such circumstances are common for devices placed away from electric grids that can only communicate via satellite, a situation particularly plausible for environmental sensor networks. These restrictions can be further complicated by potential variability in the communications budget, for example a solar-powered device needing to expend less energy when transmitting data on a cloudy day. We propose a novel, topology-based, lossy compression method well-equipped for these restrictive yet variable circumstances. This technique, Topological Signal Compression, allows sending compressed signals that utilize the entirety of a variable communications budget. To demonstrate our algorithm's capabilities, we perform entropy calculations as well as a classification exercise on increasingly topologically simplified signals from the Free-Spoken Digit Dataset and explore the stability of the resulting performance against common baselines.
    Achieving Downstream Fairness with Geometric Repair. (arXiv:2203.07490v2 [cs.LG] UPDATED)
    We study a fair machine learning (ML) setting where an 'upstream' model developer is tasked with producing a fair ML model that will be used by several similar but distinct 'downstream' users. This setting introduces new challenges that are unaddressed by many existing fairness interventions, echoing existing critiques that current methods are not broadly applicable across the diversifying needs of real-world fair ML use cases. To this end, we address the up/down stream setting by adopting a distributional-based view of fair classification. Specifically, we introduce a new fairness definition, distributional parity, that measures disparities in the distribution of outcomes across protected groups, and present a post-processing method to minimize this measure using techniques from optimal transport. We show that our method is able that creates fairer outcomes for all downstream users, across a variety of fairness definitions, and works at inference time on unlabeled data. We verify this claim experimentally, through comparison to several similar methods and across four benchmark tasks. Ultimately we argue that fairer classification outcomes can be produced through the development of setting-specific interventions.
    Training a neural network with exciton-polariton optical nonlinearity. (arXiv:2107.11156v2 [cs.LG] UPDATED)
    In contrast to software simulations of neural networks, hardware implementations have often limited or no tunability. While such networks promise great improvements in terms of speed and energy efficiency, their performance is limited by the difficulty to apply efficient training. We propose and realize experimentally an optical system where highly efficient backpropagation training can be applied through an array of highly nonlinear, non-tunable nodes. The system includes exciton-polariton nodes realizing nonlinear activation functions. We demonstrate a high classification accuracy in the MNIST handwritten digit benchmark in a single hidden layer system.
    Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm. (arXiv:2110.10017v3 [cs.LG] UPDATED)
    Learning optimal behavior from existing data is one of the most important problems in Reinforcement Learning (RL). This is known as "off-policy control" in RL where an agent's objective is to compute an optimal policy based on the data obtained from the given policy (known as the behavior policy). As the optimal policy can be very different from the behavior policy, learning optimal behavior is very hard in the "off-policy" setting compared to the "on-policy" setting where new data from the policy updates will be utilized in learning. This work proposes an off-policy natural actor-critic algorithm that utilizes state-action distribution correction for handling the off-policy behavior and the natural policy gradient for sample efficiency. The existing natural gradient-based actor-critic algorithms with convergence guarantees require fixed features for approximating both policy and value functions. This often leads to sub-optimal learning in many RL applications. On the other hand, our proposed algorithm utilizes compatible features that enable one to use arbitrary neural networks to approximate the policy and the value function and guarantee convergence to a locally optimal policy. We illustrate the benefit of the proposed off-policy natural gradient algorithm by comparing it with the vanilla gradient actor-critic algorithm on benchmark RL tasks.
    Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information. (arXiv:2110.08420v2 [cs.CL] UPDATED)
    Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. However, this comparison provides little understanding of how difficult each instance in a given distribution is, or what attributes make the dataset difficult for a given model. To address these questions, we frame dataset difficulty -- w.r.t. a model $\mathcal{V}$ -- as the lack of $\mathcal{V}$-$\textit{usable information}$ (Xu et al., 2019), where a lower value indicates a more difficult dataset for $\mathcal{V}$. We further introduce $\textit{pointwise $\mathcal{V}$-information}$ (PVI) for measuring the difficulty of individual instances w.r.t. a given distribution. While standard evaluation metrics typically only compare different models for the same dataset, $\mathcal{V}$-$\textit{usable information}$ and PVI also permit the converse: for a given model $\mathcal{V}$, we can compare different datasets, as well as different instances/slices of the same dataset. Furthermore, our framework allows for the interpretability of different input attributes via transformations of the input, which we use to discover annotation artefacts in widely-used NLP benchmarks.
    Masked Frequency Modeling for Self-Supervised Visual Pre-Training. (arXiv:2206.07706v1 [cs.CV])
    We present Masked Frequency Modeling (MFM), a unified frequency-domain-based approach for self-supervised pre-training of visual models. Instead of randomly inserting mask tokens to the input embeddings in the spatial domain, in this paper, we shift the perspective to the frequency domain. Specifically, MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum. Our key insight is that predicting masked components in the frequency domain is more ideal to reveal underlying image patterns rather than predicting masked patches in the spatial domain, due to the heavy spatial redundancy. Our findings suggest that with the right configuration of mask-and-predict strategy, both the structural information within high-frequency components and the low-level statistics among low-frequency counterparts are useful in learning good representations. For the first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token. Experimental results on ImageNet and several robustness benchmarks show the competitive performance and advanced robustness of MFM compared with recent masked image modeling approaches. Furthermore, we also comprehensively investigate the effectiveness of classical image restoration tasks for representation learning from a unified frequency perspective and reveal their intriguing relations with our MFM approach. Project page: https://www.mmlab-ntu.com/project/mfm/index.html.
    Scale-free Unconstrained Online Learning for Curved Losses. (arXiv:2202.05630v2 [cs.LG] UPDATED)
    A sequence of works in unconstrained online convex optimisation have investigated the possibility of adapting simultaneously to the norm $U$ of the comparator and the maximum norm $G$ of the gradients. In full generality, matching upper and lower bounds are known which show that this comes at the unavoidable cost of an additive $G U^3$, which is not needed when either $G$ or $U$ is known in advance. Surprisingly, recent results by Kempka et al. (2019) show that no such price for adaptivity is needed in the specific case of $1$-Lipschitz losses like the hinge loss. We follow up on this observation by showing that there is in fact never a price to pay for adaptivity if we specialise to any of the other common supervised online learning losses: our results cover log loss, (linear and non-parametric) logistic regression, square loss prediction, and (linear and non-parametric) least-squares regression. We also fill in several gaps in the literature by providing matching lower bounds with an explicit dependence on $U$. In all cases we obtain scale-free algorithms, which are suitably invariant under rescaling of the data. Our general goal is to establish achievable rates without concern for computational efficiency, but for linear logistic regression we also provide an adaptive method that is as efficient as the recent non-adaptive algorithm by Agarwal et al. (2021).
    Exploring Chemical Space with Score-based Out-of-distribution Generation. (arXiv:2206.07632v1 [q-bio.BM])
    A well-known limitation of existing works on molecule generation is that the generated molecules highly resemble those in the training set. To generate truly novel molecules with completely different structures that may have even better properties than known molecules for de novo drug discovery, more powerful exploration in the chemical space is necessary. To this end, we propose Molecular Out-Of-distribution Diffusion (MOOD), a novel score-based diffusion scheme that incorporates out-of-distribution (OOD) control in the generative stochastic differential equation (SDE) with simple control of a hyperparameter, thus requires no additional computational costs unlike existing methods (e.g., RL-based methods). However, some novel molecules may be chemically implausible, or may not meet the basic requirements of real-world drugs. Thus, MOOD performs conditional generation by utilizing the gradients from a property prediction network that guides the reverse-time diffusion to high-scoring regions according to multiple target properties such as protein-ligand interactions, drug-likeness, and synthesizability. This allows MOOD to search for novel and meaningful molecules rather than generating unseen yet trivial ones. We experimentally validate that MOOD is able to explore the chemical space beyond the training distribution, generating molecules that outscore ones found with existing methods, and even the top 0.01% of the original training pool.
    RepNAS: Searching for Efficient Re-parameterizing Blocks. (arXiv:2109.03508v4 [cs.LG] UPDATED)
    In the past years, significant improvements in the field of neural architecture search(NAS) have been made. However, it is still challenging to search for efficient networks due to the gap between the searched constraint and real inference time exists. To search for a high-performance network with low inference time, several previous works set a computational complexity constraint for the search algorithm. However, many factors affect the speed of inference(e.g., FLOPs, MACs). The correlation between a single indicator and the latency is not strong. Currently, some re-parameterization(Rep) techniques are proposed to convert multi-branch to single-path architecture which is inference-friendly. Nevertheless, multi-branch architectures are still human-defined and inefficient. In this work, we propose a new search space that is suitable for structural re-parameterization techniques. RepNAS, a one-stage NAS approach, is present to efficiently search the optimal diverse branch block(ODBB) for each layer under the branch number constraint. Our experimental results show the searched ODBB can easily surpass the manual diverse branch block(DBB) with efficient training.
    Double Robustness for Complier Parameters and a Semiparametric Test for Complier Characteristics. (arXiv:1909.05244v6 [stat.ML] UPDATED)
    We study low dimensional complier parameters that are identified using a binary instrumental variable $Z$, which is valid conditional on a possibly high dimensional vector of covariates $X$. We characterize the doubly robust moment function for the entire class of complier parameters defined by Abadie (2003) by combining two classic formulations: the Wald formula and the $\kappa$ weight. In particular, we reinterpret the $\kappa$ weight as the Riesz representer to the Wald formula, which appears to be a new insight. The main result includes new cases such as average complier characteristics. We use the main result to propose a hypothesis test, free of functional form restrictions, to evaluate (i) whether two different instruments induce compliers with the same observable characteristics on average, and (ii) whether compliers have observable characteristics that are the same as the full population on average. By developing this hypothesis test, we equip empirical researchers with a new robustness check.
    Differentiable Top-k Classification Learning. (arXiv:2206.07290v1 [cs.LG])
    The top-k classification accuracy is one of the core metrics in machine learning. Here, k is conventionally a positive integer, such as 1 or 5, leading to top-1 or top-5 training objectives. In this work, we relax this assumption and optimize the model for multiple k simultaneously instead of using a single k. Leveraging recent advances in differentiable sorting and ranking, we propose a differentiable top-k cross-entropy classification loss. This allows training the network while not only considering the top-1 prediction, but also, e.g., the top-2 and top-5 predictions. We evaluate the proposed loss function for fine-tuning on state-of-the-art architectures, as well as for training from scratch. We find that relaxing k does not only produce better top-5 accuracies, but also leads to top-1 accuracy improvements. When fine-tuning publicly available ImageNet models, we achieve a new state-of-the-art for these models.
    On the fast convergence of minibatch heavy ball momentum. (arXiv:2206.07553v1 [cs.LG])
    Simple stochastic momentum methods are widely used in machine learning optimization, but their good practical performance is at odds with an absence of theoretical guarantees of acceleration in the literature. In this work, we aim to close the gap between theory and practice by showing that stochastic heavy ball momentum, which can be interpreted as a randomized Kaczmarz algorithm with momentum, retains the fast linear rate of (deterministic) heavy ball momentum on quadratic optimization problems, at least when minibatching with a sufficiently large batch size is used. The analysis relies on carefully decomposing the momentum transition matrix, and using new spectral norm concentration bounds for products of independent random matrices. We provide numerical experiments to demonstrate that our bounds are reasonably sharp.
    Bayesian Federated Learning via Predictive Distribution Distillation. (arXiv:2206.07562v1 [cs.LG])
    For most existing federated learning algorithms, each round consists of minimizing a loss function at each client to learn an optimal model at the client, followed by aggregating these client models at the server. Point estimation of the model parameters at the clients does not take into account the uncertainty in the models estimated at each client. In many situations, however, especially in limited data settings, it is beneficial to take into account the uncertainty in the client models for more accurate and robust predictions. Uncertainty also provides useful information for other important tasks, such as active learning and out-of-distribution (OOD) detection. We present a framework for Bayesian federated learning where each client infers the posterior predictive distribution using its training data and present various ways to aggregate these client-specific predictive distributions at the server. Since communicating and aggregating predictive distributions can be challenging and expensive, our approach is based on distilling each client's predictive distribution into a single deep neural network. This enables us to leverage advances in standard federated learning to Bayesian federated learning as well. Unlike some recent works that have tried to estimate model uncertainty of each client, our work also does not make any restrictive assumptions, such as the form of the client's posterior distribution. We evaluate our approach on classification in federated setting, as well as active learning and OOD detection in federated settings, on which our approach outperforms various existing federated learning baselines.
    Characteristic kernels on Hilbert spaces, Banach spaces, and on sets of measures. (arXiv:2206.07588v1 [stat.ML])
    We present new classes of positive definite kernels on non-standard spaces that are integrally strictly positive definite or characteristic. In particular, we discuss radial kernels on separable Hilbert spaces, and introduce broad classes of kernels on Banach spaces and on metric spaces of strong negative type. The general results are used to give explicit classes of kernels on separable $L^p$ spaces and on sets of measures.
    Knowledge Management System with NLP-Assisted Annotations: A Brief Survey and Outlook. (arXiv:2206.07304v1 [cs.DB])
    Knowledge management systems are in high demand for industrial researchers, chemical or research enterprises, or evidence-based decision making. However, existing systems have limitations in categorizing and organizing paper insights or relationships. Traditional databases are usually disjoint with logging systems, which limit its utility in generating concise, collated overviews. In this work, we briefly survey existing approaches of this problem space and propose a unified framework that utilizes relational databases to log hierarchical information to facilitate the research and writing process, or generate useful knowledge from references or insights from connected concepts. This framework of knowledge management system enables novel functionalities encompassing improved hierarchical notetaking, AI-assisted brainstorming, and multi-directional relationships. Potential applications include managing inventories and changes for manufacture or research enterprises, or generating analytic reports with evidence-based decision making.
    Finite-Sample Guarantees for High-Dimensional DML. (arXiv:2206.07386v1 [econ.EM])
    Debiased machine learning (DML) offers an attractive way to estimate treatment effects in observational settings, where identification of causal parameters requires a conditional independence or unconfoundedness assumption, since it allows to control flexibly for a potentially very large number of covariates. This paper gives novel finite-sample guarantees for joint inference on high-dimensional DML, bounding how far the finite-sample distribution of the estimator is from its asymptotic Gaussian approximation. These guarantees are useful to applied researchers, as they are informative about how far off the coverage of joint confidence bands can be from the nominal level. There are many settings where high-dimensional causal parameters may be of interest, such as the ATE of many treatment profiles, or the ATE of a treatment on many outcomes. We also cover infinite-dimensional parameters, such as impacts on the entire marginal distribution of potential outcomes. The finite-sample guarantees in this paper complement the existing results on consistency and asymptotic normality of DML estimators, which are either asymptotic or treat only the one-dimensional case.
    FOLD-TR: A Scalable and Efficient Inductive Learning Algorithm for Learning To Rank. (arXiv:2206.07295v1 [cs.LG])
    FOLD-R++ is a new inductive learning algorithm for binary classification tasks. It generates an (explainable) normal logic program for mixed type (numerical and categorical) data. We present a customized FOLD-R++ algorithm with the ranking framework, called FOLD-TR, that aims to rank new items following the ranking pattern in the training data. Like FOLD-R++, the FOLD-TR algorithm is able to handle mixed-type data directly and provide native justification to explain the comparison between a pair of items.
    Contrastive Learning as Goal-Conditioned Reinforcement Learning. (arXiv:2206.07568v1 [cs.LG])
    In reinforcement learning (RL), it is easier to solve a task if given a good representation. While deep RL should automatically acquire such good representations, prior work often finds that learning representations in an end-to-end fashion is unstable and instead equip RL algorithms with additional representation learning parts (e.g., auxiliary losses, data augmentation). How can we design RL algorithms that directly acquire good representations? In this paper, instead of adding representation learning parts to an existing RL algorithm, we show (contrastive) representation learning methods can be cast as RL algorithms in their own right. To do this, we build upon prior work and apply contrastive representation learning to action-labeled trajectories, in such a way that the (inner product of) learned representations exactly corresponds to a goal-conditioned value function. We use this idea to reinterpret a prior RL method as performing contrastive learning, and then use the idea to propose a much simpler method that achieves similar performance. Across a range of goal-conditioned RL tasks, we demonstrate that contrastive RL methods achieve higher success rates than prior non-contrastive methods, including in the offline RL setting. We also show that contrastive RL outperforms prior methods on image-based tasks, without using data augmentation or auxiliary objectives.
    Brownian Noise Reduction: Maximizing Privacy Subject to Accuracy Constraints. (arXiv:2206.07234v1 [cs.LG])
    There is a disconnect between how researchers and practitioners handle privacy-utility tradeoffs. Researchers primarily operate from a privacy first perspective, setting strict privacy requirements and minimizing risk subject to these constraints. Practitioners often desire an accuracy first perspective, possibly satisfied with the greatest privacy they can get subject to obtaining sufficiently small error. Ligett et al. have introduced a "noise reduction" algorithm to address the latter perspective. The authors show that by adding correlated Laplace noise and progressively reducing it on demand, it is possible to produce a sequence of increasingly accurate estimates of a private parameter while only paying a privacy cost for the least noisy iterate released. In this work, we generalize noise reduction to the setting of Gaussian noise, introducing the Brownian mechanism. The Brownian mechanism works by first adding Gaussian noise of high variance corresponding to the final point of a simulated Brownian motion. Then, at the practitioner's discretion, noise is gradually decreased by tracing back along the Brownian path to an earlier time. Our mechanism is more naturally applicable to the common setting of bounded $\ell_2$-sensitivity, empirically outperforms existing work on common statistical tasks, and provides customizable control of privacy loss over the entire interaction with the practitioner. We complement our Brownian mechanism with ReducedAboveThreshold, a generalization of the classical AboveThreshold algorithm that provides adaptive privacy guarantees. Overall, our results demonstrate that one can meet utility constraints while still maintaining strong levels of privacy.
    DiffWire: Inductive Graph Rewiring via the Lov\'asz Bound. (arXiv:2206.07369v1 [cs.LG])
    Graph Neural Networks (GNNs) have been shown to achieve competitive results to tackle graph-related tasks, such as node and graph classification, link prediction and node and graph clustering in a variety of domains. Most GNNs use a message passing framework and hence are called MPNNs. Despite their promising results, MPNNs have been reported to suffer from over-smoothing, over-squashing and under-reaching. Graph rewiring and graph pooling have been proposed in the literature as solutions to address these limitations. However, most state-of-the-art graph rewiring methods fail to preserve the global topology of the graph, are not differentiable (inductive) and require the tuning of hyper-parameters. In this paper, we propose DiffWire, a novel framework for graph rewiring in MPNNs that is principled, fully differentiable and parameter-free by leveraging the Lov\'asz bound. Our approach provides a unified theory for graph rewiring by proposing two new, complementary layers in MPNNs: first, CTLayer, a layer that learns the commute times and uses them as a relevance function for edge re-weighting; second, GAPLayer, a layer to optimize the spectral gap, depending on the nature of the network and the task at hand. We empirically validate the value of our proposed approach and each of these layers separately with benchmark datasets for graph classification. DiffWire brings together the learnability of commute times to related definitions of curvature, opening the door to the development of more expressive MPNNs.
    Machines Explaining Linear Programs. (arXiv:2206.07194v1 [cs.LG])
    There has been a recent push in making machine learning models more interpretable so that their performance can be trusted. Although successful, these methods have mostly focused on the deep learning methods while the fundamental optimization methods in machine learning such as linear programs (LP) have been left out. Even if LPs can be considered as whitebox or clearbox models, they are not easy to understand in terms of relationships between inputs and outputs. As a linear program only provides the optimal solution to an optimization problem, further explanations are often helpful. In this work, we extend the attribution methods for explaining neural networks to linear programs. These methods explain the model by providing relevance scores for the model inputs, to show the influence of each input on the output. Alongside using classical gradient-based attribution methods we also propose a way to adapt perturbation-based attribution methods to LPs. Our evaluations of several different linear and integer problems showed that attribution methods can generate useful explanations for linear programs. However, we also demonstrate that using a neural attribution method directly might come with some drawbacks, as the properties of these methods on neural networks do not necessarily transfer to linear programs. The methods can also struggle if a linear program has more than one optimal solution, as a solver just returns one possible solution. Our results can hopefully be used as a good starting point for further research in this direction.
    Test-Time Adaptation for Visual Document Understanding. (arXiv:2206.07240v1 [cs.CV])
    Self-supervised pretraining has been able to produce transferable representations for various visual document understanding (VDU) tasks. However, the ability of such representations to adapt to new distribution shifts at test-time has not been studied yet. We propose DocTTA, a novel test-time adaptation approach for documents that leverages cross-modality self-supervised learning via masked visual language modeling as well as pseudo labeling to adapt models learned on a \textit{source} domain to an unlabeled \textit{target} domain at test time. We also introduce new benchmarks using existing public datasets for various VDU tasks including entity recognition, key-value extraction, and document visual question answering tasks where DocTTA improves the source model performance up to 1.79\% in (F1 score), 3.43\% (F1 score), and 17.68\% (ANLS score), respectively while drastically reducing calibration error on target data.
    QONNX: Representing Arbitrary-Precision Quantized Neural Networks. (arXiv:2206.07527v1 [cs.LG])
    We present extensions to the Open Neural Network Exchange (ONNX) intermediate representation format to represent arbitrary-precision quantized neural networks. We first introduce support for low precision quantization in existing ONNX-based quantization formats by leveraging integer clipping, resulting in two new backward-compatible variants: the quantized operator format with clipping and quantize-clip-dequantize (QCDQ) format. We then introduce a novel higher-level ONNX format called quantized ONNX (QONNX) that introduces three new operators -- Quant, BipolarQuant, and Trunc -- in order to represent uniform quantization. By keeping the QONNX IR high-level and flexible, we enable targeting a wider variety of platforms. We also present utilities for working with QONNX, as well as examples of its usage in the FINN and hls4ml toolchains. Finally, we introduce the QONNX model zoo to share low-precision quantized neural networks.
    ARES: Locally Adaptive Reconstruction-based Anomaly Scoring. (arXiv:2206.07604v1 [cs.LG])
    How can we detect anomalies: that is, samples that significantly differ from a given set of high-dimensional data, such as images or sensor data? This is a practical problem with numerous applications and is also relevant to the goal of making learning algorithms more robust to unexpected inputs. Autoencoders are a popular approach, partly due to their simplicity and their ability to perform dimension reduction. However, the anomaly scoring function is not adaptive to the natural variation in reconstruction error across the range of normal samples, which hinders their ability to detect real anomalies. In this paper, we empirically demonstrate the importance of local adaptivity for anomaly scoring in experiments with real data. We then propose our novel Adaptive Reconstruction Error-based Scoring approach, which adapts its scoring based on the local behaviour of reconstruction error over the latent space. We show that this improves anomaly detection performance over relevant baselines in a wide variety of benchmark datasets.
    Rethinking Initialization of the Sinkhorn Algorithm. (arXiv:2206.07630v1 [stat.ML])
    Computing an optimal transport (OT) coupling between distributions plays an increasingly important role in machine learning. While OT problems can be solved as linear programs, adding an entropic smoothing term is known to result in solvers that are faster and more robust to outliers, differentiable and easier to parallelize. The Sinkhorn fixed point algorithm is the cornerstone of these approaches, and, as a result, multiple attempts have been made to shorten its runtime using, for instance, annealing, momentum or acceleration. The premise of this paper is that \textit{initialization} of the Sinkhorn algorithm has received comparatively little attention, possibly due to two preconceptions: as the regularized OT problem is convex, it may not be worth crafting a tailored initialization as \textit{any} is guaranteed to work; secondly, because the Sinkhorn algorithm is often differentiated in end-to-end pipelines, data-dependent initializations could potentially bias gradient estimates obtained by unrolling iterations. We challenge this conventional wisdom and show that carefully chosen initializations can result in dramatic speed-ups, and will not bias gradients which are computed with implicit differentiation. We detail how initializations can be recovered from closed-form or approximate OT solutions, using known results in the 1D or Gaussian settings. We show empirically that these initializations can be used off-the-shelf, with little to no tuning, and result in consistent speed-ups for a variety of OT problems.
    Loss Functions for Classification using Structured Entropy. (arXiv:2206.07122v1 [stat.ML])
    Cross-entropy loss is the standard metric used to train classification models in deep learning and gradient boosting. It is well-known that this loss function fails to account for similarities between the different values of the target. We propose a generalization of entropy called {\em structured entropy} which uses a random partition to incorporate the structure of the target variable in a manner which retains many theoretical properties of standard entropy. We show that a structured cross-entropy loss yields better results on several classification problems where the target variable has an a priori known structure. The approach is simple, flexible, easily computable, and does not rely on a hierarchically defined notion of structure.
    Principal Trade-off Analysis. (arXiv:2206.07520v1 [cs.GT])
    This paper develops Principal Trade-off Analysis (PTA), a decomposition method, analogous to Principal Component Analysis (PCA), which permits the representation of any game as the weighted sum of disc games (continuous R-P-S games). Applying PTA to empirically generated tournament graphs produces a sequence of embeddings into orthogonal 2D feature planes representing independent strategic trade-offs. Each trade-off generates a mode of cyclic competition. Like PCA, PTA provides optimal low rank estimates of the tournament graphs that can be truncated for approximation. The complexity of cyclic competition can be quantified by computing the number of significant cyclic modes. We illustrate the PTA via application to a pair of games (Blotto, Pokemon). The resulting 2D disc game representations are shown to be well suited for visualization and are easily interpretable. In Blotto, PTA identifies game symmetries, and specifies strategic trade-offs associated with distinct win conditions. For Pokemon, PTA embeddings produce clusters in the embedding space that naturally correspond to Pokemon types, a design in the game that produces cyclic trade offs.
    Calibrating Agent-based Models to Microdata with Graph Neural Networks. (arXiv:2206.07570v1 [cs.MA])
    Calibrating agent-based models (ABMs) to data is among the most fundamental requirements to ensure the model fulfils its desired purpose. In recent years, simulation-based inference methods have emerged as powerful tools for performing this task when the model likelihood function is intractable, as is often the case for ABMs. In some real-world use cases of ABMs, both the observed data and the ABM output consist of the agents' states and their interactions over time. In such cases, there is a tension between the desire to make full use of the rich information content of such granular data on the one hand, and the need to reduce the dimensionality of the data to prevent difficulties associated with high-dimensional learning tasks on the other. A possible resolution is to construct lower-dimensional time-series through the use of summary statistics describing the macrostate of the system at each time point. However, a poor choice of summary statistics can result in an unacceptable loss of information from the original dataset, dramatically reducing the quality of the resulting calibration. In this work, we instead propose to learn parameter posteriors associated with granular microdata directly using temporal graph neural networks. We will demonstrate that such an approach offers highly compelling inductive biases for Bayesian inference using the raw ABM microstates as output.
    Learning to Accelerate Partial Differential Equations via Latent Global Evolution. (arXiv:2206.07681v1 [cs.LG])
    Simulating the time evolution of Partial Differential Equations (PDEs) of large-scale systems is crucial in many scientific and engineering domains such as fluid dynamics, weather forecasting and their inverse optimization problems. However, both classical solvers and recent deep learning-based surrogate models are typically extremely computationally intensive, because of their local evolution: they need to update the state of each discretized cell at each time step during inference. Here we develop Latent Evolution of PDEs (LE-PDE), a simple, fast and scalable method to accelerate the simulation and inverse optimization of PDEs. LE-PDE learns a compact, global representation of the system and efficiently evolves it fully in the latent space with learned latent evolution models. LE-PDE achieves speed-up by having a much smaller latent dimension to update during long rollout as compared to updating in the input space. We introduce new learning objectives to effectively learn such latent dynamics to ensure long-term stability. We further introduce techniques for speeding-up inverse optimization of boundary conditions for PDEs via backpropagation through time in latent space, and an annealing technique to address the non-differentiability and sparse interaction of boundary conditions. We test our method in a 1D benchmark of nonlinear PDEs, 2D Navier-Stokes flows into turbulent phase and an inverse optimization of boundary conditions in 2D Navier-Stokes flow. Compared to state-of-the-art deep learning-based surrogate models and other strong baselines, we demonstrate up to 128x reduction in the dimensions to update, and up to 15x improvement in speed, while achieving competitive accuracy.
    Fair Ranking as Fair Division: Impact-Based Individual Fairness in Ranking. (arXiv:2206.07247v1 [cs.IR])
    Rankings have become the primary interface in two-sided online markets. Many have noted that the rankings not only affect the satisfaction of the users (e.g., customers, listeners, employers, travelers), but that the position in the ranking allocates exposure -- and thus economic opportunity -- to the ranked items (e.g., articles, products, songs, job seekers, restaurants, hotels). This has raised questions of fairness to the items, and most existing works have addressed fairness by explicitly linking item exposure to item relevance. However, we argue that any particular choice of such a link function may be difficult to defend, and we show that the resulting rankings can still be unfair. To avoid these shortcomings, we develop a new axiomatic approach that is rooted in principles of fair division. This not only avoids the need to choose a link function, but also more meaningfully quantifies the impact on the items beyond exposure. Our axioms of envy-freeness and dominance over uniform ranking postulate that for a fair ranking policy every item should prefer their own rank allocation over that of any other item, and that no item should be actively disadvantaged by the rankings. To compute ranking policies that are fair according to these axioms, we propose a new ranking objective related to the Nash Social Welfare. We show that the solution has guarantees regarding its envy-freeness, its dominance over uniform rankings for every item, and its Pareto optimality. In contrast, we show that conventional exposure-based fairness can produce large amounts of envy and have a highly disparate impact on the items. Beyond these theoretical results, we illustrate empirically how our framework controls the trade-off between impact-based individual item fairness and user utility.  ( 2 min )
    Body Gesture Recognition to Control a Social Robot. (arXiv:2206.07538v1 [cs.RO])
    In this work, we propose a gesture based language to allow humans to interact with robots using their body in a natural way. We have created a new gesture detection model using neural networks and a custom dataset of humans performing a set of body gestures to train our network. Furthermore, we compare body gesture communication with other communication channels to acknowledge the importance of adding this knowledge to robots. The presented approach is extensively validated in diverse simulations and real-life experiments with non-trained volunteers. This attains remarkable results and shows that it is a valuable framework for social robotics applications, such as human robot collaboration or human-robot interaction.
    Epistemic Deep Learning. (arXiv:2206.07609v1 [cs.LG])
    The belief function approach to uncertainty quantification as proposed in the Demspter-Shafer theory of evidence is established upon the general mathematical models for set-valued observations, called random sets. Set-valued predictions are the most natural representations of uncertainty in machine learning. In this paper, we introduce a concept called epistemic deep learning based on the random-set interpretation of belief functions to model epistemic learning in deep neural networks. We propose a novel random-set convolutional neural network for classification that produces scores for sets of classes by learning set-valued ground truth representations. We evaluate different formulations of entropy and distance measures for belief functions as viable loss functions for these random-set networks. We also discuss methods for evaluating the quality of epistemic predictions and the performance of epistemic random-set neural networks. We demonstrate through experiments that the epistemic approach produces better performance results when compared to traditional approaches of estimating uncertainty.
    Classification of ECG based on Hybrid Features using CNNs for Wearable Applications. (arXiv:2206.07648v1 [eess.SP])
    Sudden cardiac death and arrhythmia account for a large percentage of all deaths worldwide. Electrocardiography (ECG) is the most widely used screening tool for cardiovascular diseases. Traditionally, ECG signals are classified manually, requiring experience and great skill, while being time-consuming and prone to error. Thus machine learning algorithms have been widely adopted because of their ability to perform complex data analysis. Features derived from the points of interest in ECG - mainly Q, R, and S, are widely used for arrhythmia detection. In this work, we demonstrate improved performance for ECG classification using hybrid features and three different models, building on a 1-D convolutional neural network (CNN) model that we had proposed in the past. An RR interval features based model proposed in this work achieved an accuracy of 98.98%, which is an improvement over the baseline model. To make the model immune to noise, we updated the model using frequency features and achieved good sustained performance in presence of noise with a slightly lower accuracy of 98.69%. Further, another model combining the frequency features and the RR interval features was developed, which achieved a high accuracy of 99% with good sustained performance in noisy environments. Due to its high accuracy and noise immunity, the proposed model which combines multiple hybrid features, is well suited for ambulatory wearable sensing applications.
    Benefits of Additive Noise in Composing Classes with Bounded Capacity. (arXiv:2206.07199v1 [stat.ML])
    We observe that given two (compatible) classes of functions $\mathcal{F}$ and $\mathcal{H}$ with small capacity as measured by their uniform covering numbers, the capacity of the composition class $\mathcal{H} \circ \mathcal{F}$ can become prohibitively large or even unbounded. We then show that adding a small amount of Gaussian noise to the output of $\mathcal{F}$ before composing it with $\mathcal{H}$ can effectively control the capacity of $\mathcal{H} \circ \mathcal{F}$, offering a general recipe for modular design. To prove our results, we define new notions of uniform covering number of random functions with respect to the total variation and Wasserstein distances. We instantiate our results for the case of multi-layer sigmoid neural networks. Preliminary empirical results on MNIST dataset indicate that the amount of noise required to improve over existing uniform bounds can be numerically negligible (i.e., element-wise i.i.d. Gaussian noise with standard deviation $10^{-240}$). The source codes are available at https://github.com/fathollahpour/composition_noise.
    BIO-CXRNET: A Robust Multimodal Stacking Machine Learning Technique for Mortality Risk Prediction of COVID-19 Patients using Chest X-Ray Images and Clinical Data. (arXiv:2206.07595v1 [eess.IV])
    Fast and accurate detection of the disease can significantly help in reducing the strain on the healthcare facility of any country to reduce the mortality during any pandemic. The goal of this work is to create a multimodal system using a novel machine learning framework that uses both Chest X-ray (CXR) images and clinical data to predict severity in COVID-19 patients. In addition, the study presents a nomogram-based scoring technique for predicting the likelihood of death in high-risk patients. This study uses 25 biomarkers and CXR images in predicting the risk in 930 COVID-19 patients admitted during the first wave of COVID-19 (March-June 2020) in Italy. The proposed multimodal stacking technique produced the precision, sensitivity, and F1-score, of 89.03%, 90.44%, and 89.03%, respectively to identify low or high-risk patients. This multimodal approach improved the accuracy by 6% in comparison to the CXR image or clinical data alone. Finally, nomogram scoring system using multivariate logistic regression -- was used to stratify the mortality risk among the high-risk patients identified in the first stage. Lactate Dehydrogenase (LDH), O2 percentage, White Blood Cells (WBC) Count, Age, and C-reactive protein (CRP) were identified as useful predictor using random forest feature selection model. Five predictors parameters and a CXR image based nomogram score was developed for quantifying the probability of death and categorizing them into two risk groups: survived (=50%), respectively. The multi-modal technique was able to predict the death probability of high-risk patients with an F1 score of 92.88 %. The area under the curves for the development and validation cohorts are 0.981 and 0.939, respectively.
    Deep Multi-Task Networks For Occluded Pedestrian Pose Estimation. (arXiv:2206.07510v1 [cs.CV])
    Most of the existing works on pedestrian pose estimation do not consider estimating the pose of an occluded pedestrians, as the annotations of the occluded parts are not available in relevant automotive datasets. For example, CityPersons, a well-known dataset for pedestrian detection in automotive scenes does not provide pose annotations, whereas MS-COCO, a non-automotive dataset, contains human pose estimation. In this work, we propose a multi-task framework to extract pedestrian features through detection and instance segmentation tasks performed separately on these two distributions. Thereafter, an encoder learns pose specific features using an unsupervised instance-level domain adaptation method for the pedestrian instances from both distributions. The proposed framework has improved state-of-the-art performances of pose estimation, pedestrian detection, and instance segmentation.
    Sparse Subspace Clustering in Diverse Multiplex Network Model. (arXiv:2206.07602v1 [stat.ML])
    The paper considers the DIverse MultiPLEx (DIMPLE) network model, introduced in Pensky and Wang (2021), where all layers of the network have the same collection of nodes and are equipped with the Stochastic Block Models. In addition, all layers can be partitioned into groups with the same community structures, although the layers in the same group may have different matrices of block connection probabilities. The DIMPLE model generalizes a multitude of papers that study multilayer networks with the same community structures in all layers, as well as the Mixture Multilayer Stochastic Block Model (MMLSBM), where the layers in the same group have identical matrices of block connection probabilities. While Pensky and Wang (2021) applied spectral clustering to the proxy of the adjacency tensor, the present paper uses Sparse Subspace Clustering (SSC) for identifying groups of layers with identical community structures. Under mild conditions, the latter leads to the strongly consistent between-layer clustering. In addition, SSC allows to handle much larger networks than methodology of Pensky and Wang (2021), and is perfectly suitable for application of parallel computing.
    Resource-Constrained Edge AI with Early Exit Prediction. (arXiv:2206.07269v1 [cs.LG])
    By leveraging the data sample diversity, the early-exit network recently emerges as a prominent neural network architecture to accelerate the deep learning inference process. However, intermediate classifiers of the early exits introduce additional computation overhead, which is unfavorable for resource-constrained edge artificial intelligence (AI). In this paper, we propose an early exit prediction mechanism to reduce the on-device computation overhead in a device-edge co-inference system supported by early-exit networks. Specifically, we design a low-complexity module, namely the Exit Predictor, to guide some distinctly "hard" samples to bypass the computation of the early exits. Besides, considering the varying communication bandwidth, we extend the early exit prediction mechanism for latency-aware edge inference, which adapts the prediction thresholds of the Exit Predictor and the confidence thresholds of the early-exit network via a few simple regression models. Extensive experiment results demonstrate the effectiveness of the Exit Predictor in achieving a better tradeoff between accuracy and on-device computation overhead for early-exit networks. Besides, compared with the baseline methods, the proposed method for latency-aware edge inference attains higher inference accuracy under different bandwidth conditions.  ( 2 min )
    Autonomous Platoon Control with Integrated Deep Reinforcement Learning and Dynamic Programming. (arXiv:2206.07536v1 [eess.SY])
    Deep Reinforcement Learning (DRL) is regarded as a potential method for car-following control and has been mostly studied to support a single following vehicle. However, it is more challenging to learn a stable and efficient car-following policy when there are multiple following vehicles in a platoon, especially with unpredictable leading vehicle behavior. In this context, we adopt an integrated DRL and Dynamic Programming (DP) approach to learn autonomous platoon control policies, which embeds the Deep Deterministic Policy Gradient (DDPG) algorithm into a finite-horizon value iteration framework. Although the DP framework can improve the stability and performance of DDPG, it has the limitations of lower sampling and training efficiency. In this paper, we propose an algorithm, namely Finite-Horizon-DDPG with Sweeping through reduced state space using Stationary approximation (FH-DDPG-SS), which uses three key ideas to overcome the above limitations, i.e., transferring network weights backward in time, stationary policy approximation for earlier time steps, and sweeping through reduced state space. In order to verify the effectiveness of FH-DDPG-SS, simulation using real driving data is performed, where the performance of FH-DDPG-SS is compared with those of the benchmark algorithms. Finally, platoon safety and string stability for FH-DDPG-SS are demonstrated.
    Subsurface Depths Structure Maps Reconstruction with Generative Adversarial Networks. (arXiv:2206.07388v1 [physics.geo-ph])
    This paper described a method for reconstruction of detailed-resolution depth structure maps, usually obtained after the 3D seismic surveys, using the data from 2D seismic depth maps. The method uses two algorithms based on the generative-adversarial neural network architecture. The first algorithm StyleGAN2-ADA accumulates in the hidden space of the neural network the semantic images of mountainous terrain forms first, and then with help of transfer learning, in the ideal case - the structure geometry of stratigraphic horizons. The second algorithm, the Pixel2Style2Pixel encoder, using the semantic level of generalization of the first algorithm, learns to reconstruct the original high-resolution images from their degraded copies (super-resolution technology). There was demonstrated a methodological approach to transferring knowledge on the structural forms of stratigraphic horizon boundaries from the well-studied areas to the underexplored ones. Using the multimodal synthesis of Pixel2Style2Pixel encoder, it is proposed to create a probabilistic depth space, where each point of the project area is represented by the density of probabilistic depth distribution of equally probable reconstructed geological forms of structural images. Assessment of the reconstruction quality was carried out for two blocks. Using this method, credible detailed depth reconstructions comparable with the quality of 3D seismic maps have been obtained from 2D seismic maps.  ( 2 min )
    An Intelligent Assistant for Converting City Requirements to Formal Specification. (arXiv:2206.07152v1 [cs.AI])
    As more and more monitoring systems have been deployed to smart cities, there comes a higher demand for converting new human-specified requirements to machine-understandable formal specifications automatically. However, these human-specific requirements are often written in English and bring missing, inaccurate, or ambiguous information. In this paper, we present CitySpec, an intelligent assistant system for requirement specification in smart cities. CitySpec not only helps overcome the language differences brought by English requirements and formal specifications, but also offers solutions to those missing, inaccurate, or ambiguous information. The goal of this paper is to demonstrate how CitySpec works. Specifically, we present three demos: (1) interactive completion of requirements in CitySpec; (2) human-in-the-loop correction while CitySepc encounters exceptions; (3) online learning in CitySpec.  ( 2 min )
    Modern Machine-Learning Predictive Models for Diagnosing Infectious Diseases. (arXiv:2206.07365v1 [cs.LG])
    Controlling infectious diseases is a major health priority because they can spread and infect humans, thus evolving into epidemics or pandemics. Therefore, early detection of infectious diseases is a significant need, and many researchers have developed models to diagnose them in the early stages. This paper reviewed research articles for recent machine-learning (ML) algorithms applied to infectious disease diagnosis. We searched the Web of Science, ScienceDirect, PubMed, Springer, and IEEE databases from 2015 to 2022, identified the pros and cons of the reviewed ML models, and discussed the possible recommendations to advance the studies in this field. We found that most of the articles used small datasets, and few of them used real-time data. Our results demonstrated that a suitable ML technique depends on the nature of the dataset and the desired goal.  ( 2 min )
    Towards Goal, Feasibility, and Diversity-Oriented Deep Generative Models in Design. (arXiv:2206.07170v1 [cs.LG])
    Deep Generative Machine Learning Models (DGMs) have been growing in popularity across the design community thanks to their ability to learn and mimic complex data distributions. DGMs are conventionally trained to minimize statistical divergence between the distribution over generated data and distribution over the dataset on which they are trained. While sufficient for the task of generating "realistic" fake data, this objective is typically insufficient for design synthesis tasks. Instead, design problems typically call for adherence to design requirements, such as performance targets and constraints. Advancing DGMs in engineering design requires new training objectives which promote engineering design objectives. In this paper, we present the first Deep Generative Model that simultaneously optimizes for performance, feasibility, diversity, and target achievement. We benchmark performance of the proposed method against several Deep Generative Models over eight evaluation metrics that focus on feasibility, diversity, and satisfaction of design performance targets. Methods are tested on a challenging multi-objective bicycle frame design problem with skewed, multimodal data of different datatypes. The proposed framework was found to outperform all Deep Generative Models in six of eight metrics.  ( 2 min )
    A smile is all you need: Predicting limiting activity coefficients from SMILES with natural language processing. (arXiv:2206.07048v1 [physics.chem-ph])
    Knowledge of mixtures' phase equilibria is crucial in nature and technical chemistry. Phase equilibria calculations of mixtures require activity coefficients. However, experimental data on activity coefficients is often limited due to high cost of experiments. For an accurate and efficient prediction of activity coefficients, machine learning approaches have been recently developed. However, current machine learning approaches still extrapolate poorly for activity coefficients of unknown molecules. In this work, we introduce the SMILES-to-Properties-Transformer (SPT), a natural language processing network to predict binary limiting activity coefficients from SMILES codes. To overcome the limitations of available experimental data, we initially train our network on a large dataset of synthetic data sampled from COSMO-RS (10 Million data points) and then fine-tune the model on experimental data (20 870 data points). This training strategy enables SPT to accurately predict limiting activity coefficients even for unknown molecules, cutting the mean prediction error in half compared to state-of-the-art models for activity coefficient predictions such as COSMO-RS, UNIFAC, and improving on recent machine learning approaches.  ( 2 min )
    CARD: Classification and Regression Diffusion Models. (arXiv:2206.07275v1 [stat.ML])
    Learning the distribution of a continuous or categorical response variable $\boldsymbol y$ given its covariates $\boldsymbol x$ is a fundamental problem in statistics and machine learning. Deep neural network-based supervised learning algorithms have made great progress in predicting the mean of $\boldsymbol y$ given $\boldsymbol x$, but they are often criticized for their ability to accurately capture the uncertainty of their predictions. In this paper, we introduce classification and regression diffusion (CARD) models, which combine a denoising diffusion-based conditional generative model and a pre-trained conditional mean estimator, to accurately predict the distribution of $\boldsymbol y$ given $\boldsymbol x$. We demonstrate the outstanding ability of CARD in conditional distribution prediction with both toy examples and real-world datasets, the experimental results on which show that CARD in general outperforms state-of-the-art methods, including Bayesian neural network-based ones that are designed for uncertainty estimation, especially when the conditional distribution of $\boldsymbol y$ given $\boldsymbol x$ is multi-modal.  ( 2 min )
    XAI Establishes a Common Ground Between Machine Learning and Causality. (arXiv:2110.02395v2 [cs.LG] UPDATED)
    A handful of recent works have argued on the connection between machine learning and causality. In a reverse thought process, starting from the grounding of mental models in causal models, we strengthen these initial works with results that suggest XAI essentially requiring machine learning to learn models that are causally consistent with the task at hand. By recognizing how human mental models (HMM) are naturally represented by the Pearlian Structural Causal Model (SCM), we make two key observations through the construction of an example metric space for linear SCM: first, that the notion of a "true" data-underlying SCM is justified, and second, that an aggregation of human-derived SCM might point to said "true" SCM. Motivated by the implications of these insights, we conclude with a third observation which argues that interpretations derived from HMM must imply interpretability in the SCM framework. Following this intuition, we present an original derivation using these priorly established first principles to reveal a human-readable interpretation scheme consistent with the given SCM, justifying the naming Structural Causal Interpretations (SCI). Going further, we analyze these SCIs and their mathematical properties theoretically and empirically. We prove that any existing graph induction method (GIM) is in fact interpretable in the SCI-sense. Our first experiment (E1) assesses the quality of such GIM-based SCI. In (E2) we observe evidence for our conjecture on improved sample-efficiency for SCI-based learning. For (E3) we conduct a study (N=22) and observe superiority in human-based SCI over GIM ones, corroborating our initial hypothesis.  ( 2 min )
    VCT: A Video Compression Transformer. (arXiv:2206.07307v1 [cs.CV])
    We show how transformers can be used to vastly simplify neural video compression. Previous methods have been relying on an increasing number of architectural biases and priors, including motion prediction and warping operations, resulting in complex models. Instead, we independently map input frames to representations and use a transformer to model their dependencies, letting it predict the distribution of future representations given the past. The resulting video compression transformer outperforms previous methods on standard video compression data sets. Experiments on synthetic data show that our model learns to handle complex motion patterns such as panning, blurring and fading purely from data. Our approach is easy to implement, and we release code to facilitate future research.  ( 2 min )
    Predicting Gender via Eye Movements. (arXiv:2206.07442v1 [cs.LG])
    In this paper, we report the first stable results on gender prediction via eye movements. We use a dataset with images of faces as stimuli and with a large number of 370 participants. Stability has two meanings for us: first that we are able to estimate the standard deviation (SD) of a single prediction experiment (it is around 4.1 %); this is achieved by varying the number of participants. And second, we are able to provide a mean accuracy with a very low standard error (SEM): our accuracy is 65.2 %, and the SEM is 0.80 %; this is achieved through many runs of randomly selecting training and test sets for the prediction. Our study shows that two particular classifiers achieve the best accuracies: Random Forests and Logistic Regression. Our results reconfirm previous findings that females are more biased towards the left eyes of the stimuli.  ( 2 min )
    Automatic Detection of Rice Disease in Images of Various Leaf Sizes. (arXiv:2206.07344v1 [cs.CV])
    Fast, accurate and affordable rice disease detection method is required to assist rice farmers tackling equipment and expertise shortages problems. In this paper, we focused on the solution using computer vision technique to detect rice diseases from rice field photograph images. Dealing with images took in real-usage situation by general farmers is quite challenging due to various environmental factors, and rice leaf object size variation is one major factor caused performance gradation. To solve this problem, we presented a technique combining a CNN object detection with image tiling technique, based on automatically estimated width size of rice leaves in the images as a size reference for dividing the original input image. A model to estimate leaf width was created by small size CNN such as 18 layer ResNet architecture model. A new divided tiled sub-image set with uniformly sized object was generated and used as input for training a rice disease prediction model. Our technique was evaluated on 4,960 images of eight different types of rice leaf diseases, including blast, blight, brown spot, narrow brown spot, orange, red stripe, rice grassy stunt virus, and streak disease. The mean absolute percentage error (MAPE) for leaf width prediction task evaluated on all eight classes was 11.18% in the experiment, indicating that the leaf width prediction model performed well. The mean average precision (mAP) of the prediction performance on YOLOv4 architecture was enhanced from 87.56% to 91.14% when trained and tested with the tiled dataset. According to our study, the proposed image tiling technique improved rice disease detection efficiency.  ( 2 min )
    Cautious Learning of Multiattribute Preferences. (arXiv:2206.07341v1 [cs.AI])
    This paper is dedicated to a cautious learning methodology for predicting preferences between alternatives characterized by binary attributes (formally, each alternative is seen as a subset of attributes). By "cautious", we mean that the model learned to represent the multi-attribute preferences is general enough to be compatible with any strict weak order on the alternatives, and that we allow ourselves not to predict some preferences if the data collected are not compatible with a reliable prediction. A predicted preference will be considered reliable if all the simplest models (following Occam's razor principle) explaining the training data agree on it. Predictions are based on an ordinal dominance relation between alternatives [Fishburn and LaValle, 1996]. The dominance relation relies on an uncertainty set encompassing the possible values of the parameters of the multi-attribute utility function. Numerical tests are provided to evaluate the richness and the reliability of the predictions made.  ( 2 min )
    A Survey : Neural Networks for AMR-to-Text. (arXiv:2206.07328v1 [cs.CL])
    AMR-to-text is one of the key techniques in the NLP community that aims at generating sentences from the Abstract Meaning Representation (AMR) graphs. Since AMR was proposed in 2013, the study on AMR-to-Text has become increasingly prevalent as an essential branch of structured data to text because of the unique advantages of AMR as a high-level semantic description of natural language. In this paper, we provide a brief survey of AMR-to-Text. Firstly, we introduce the current scenario of this technique and point out its difficulties. Secondly, based on the methods used in previous studies, we roughly divided them into five categories according to their respective mechanisms, i.e., Rules-based, Seq-to-Seq-based, Graph-to-Seq-based, Transformer-based, and Pre-trained Language Model (PLM)-based. In particular, we detail the neural network-based method and present the latest progress of AMR-to-Text, which refers to AMR reconstruction, Decoder optimization, etc. Furthermore, we present the benchmarks and evaluation methods of AMR-to-Text. Eventually, we provide a summary of current techniques and the outlook for future research.
    Explainable expected goal models for performance analysis in football analytics. (arXiv:2206.07212v1 [cs.LG])
    The expected goal provides a more representative measure of the team and player performance which also suit the low-scoring nature of football instead of score in modern football. The score of a match involves randomness and often may not represent the performance of the teams and players, therefore it has been popular to use the alternative statistics in recent years such as shots on target, ball possessions, and drills. To measure the probability of a shot being a goal by the expected goal, several features are used to train an expected goal model which is based on the event and tracking football data. The selection of these features, the size and date of the data, and the model which are used as the parameters that may affect the performance of the model. Using black-box machine learning models for increasing the predictive performance of the model decreases its interpretability that causes the loss of information that can be gathered from the model. This paper proposes an accurate expected goal model trained consisting of 315,430 shots from seven seasons between 2014-15 and 2020-21 of the top-five European football leagues. Moreover, this model is explained by using explainable artificial intelligence tool to obtain an explainable expected goal model for evaluating a team or player performance. To best of our knowledge, this is the first paper that demonstrates a practical application of an explainable artificial intelligence tool aggregated profiles to explain a group of observations on an accurate expected goal model for monitoring the team and player performance. Moreover, these methods can be generalized to other sports branches.  ( 2 min )
    Learning the Structure of Large Networked Systems Obeying Conservation Laws. (arXiv:2206.07083v1 [stat.ML])
    Many networked systems such as electric networks, the brain, and social networks of opinion dynamics are known to obey conservation laws. Examples of this phenomenon include the Kirchoff laws in electric networks and opinion consensus in social networks. Conservation laws in networked systems may be modeled as balance equations of the form $X = B^{*} Y$, where the sparsity pattern of $B^{*}$ captures the connectivity of the network, and $Y, X \in \mathbb{R}^p$ are vectors of "potentials" and "injected flows" at the nodes respectively. The node potentials $Y$ cause flows across edges and the flows $X$ injected at the nodes are extraneous to the network dynamics. In several practical systems, the network structure is often unknown and needs to be estimated from data. Towards this, one has access to samples of the node potentials $Y$, but only the statistics of the node injections $X$. Motivated by this important problem, we study the estimation of the sparsity structure of the matrix $B^{*}$ from $n$ samples of $Y$ under the assumption that the node injections $X$ follow a Gaussian distribution with a known covariance $\Sigma_X$. We propose a new $\ell_{1}$-regularized maximum likelihood estimator for this problem in the high-dimensional regime where the size of the network $p$ is larger than sample size $n$. We show that this optimization problem is convex in the objective and admits a unique solution. Under a new mutual incoherence condition, we establish sufficient conditions on the triple $(n,p,d)$ for which exact sparsity recovery of $B^{*}$ is possible with high probability; $d$ is the degree of the graph. We also establish guarantees for the recovery of $B^{*}$ in the element-wise maximum, Frobenius, and operator norms. Finally, we complement these theoretical results with experimental validation of the performance of the proposed estimator on synthetic and real-world data.  ( 3 min )
    Inverse design of nano-photonic wavelength demultiplexer with a deep neural network approach. (arXiv:2206.07114v1 [physics.optics])
    In this paper, we propose a pre-trained-combined neural network (PTCN) as a comprehensive solution to the inverse design of an integrated photonic circuit. By utilizing both the initially pre-trained inverse and forward model with a joint training process, our PTCN model shows remarkable tolerance to the quantity and quality of the training data. As a proof of concept demonstration, the inverse design of a wavelength demultiplexer is used to verify the effectiveness of the PTCN model. The correlation coefficient of the prediction by the presented PTCN model remains greater than 0.974 even when the size of training data is decreased to 17%. The experimental results show a good agreement with predictions, and demonstrate a wavelength demultiplexer with an ultra-compact footprint, a high transmission efficiency with a transmission loss of -2dB, a low reflection of -10dB, and low crosstalk around -7dB simultaneously.  ( 2 min )
    Binary Single-dimensional Convolutional Neural Network for Seizure Prediction. (arXiv:2206.07518v1 [eess.SP])
    Nowadays, several deep learning methods are proposed to tackle the challenge of epileptic seizure prediction. However, these methods still cannot be implemented as part of implantable or efficient wearable devices due to their large hardware and corresponding high-power consumption. They usually require complex feature extraction process, large memory for storing high precision parameters and complex arithmetic computation, which greatly increases required hardware resources. Moreover, available yield poor prediction performance, because they adopt network architecture directly from image recognition applications fails to accurately consider the characteristics of EEG signals. We propose in this paper a hardware-friendly network called Binary Single-dimensional Convolutional Neural Network (BSDCNN) intended for epileptic seizure prediction. BSDCNN utilizes 1D convolutional kernels to improve prediction performance. All parameters are binarized to reduce the required computation and storage, except the first layer. Overall area under curve, sensitivity, and false prediction rate reaches 0.915, 89.26%, 0.117/h and 0.970, 94.69%, 0.095/h on American Epilepsy Society Seizure Prediction Challenge (AES) dataset and the CHB-MIT one respectively. The proposed architecture outperforms recent works while offering 7.2 and 25.5 times reductions on the size of parameter and computation, respectively.
    NatGen: Generative pre-training by "Naturalizing" source code. (arXiv:2206.07585v1 [cs.PL])
    Pre-trained Generative Language models (e.g. PLBART, CodeT5, SPT-Code) for source code yielded strong results on several tasks in the past few years, including code generation and translation. These models have adopted varying pre-training objectives to learn statistics of code construction from very large-scale corpora in a self-supervised fashion; the success of pre-trained models largely hinges on these pre-training objectives. This paper proposes a new pre-training objective, "Naturalizing" of source code, exploiting code's bimodal, dual-channel (formal & natural channels) nature. Unlike natural language, code's bimodal, dual-channel nature allows us to generate semantically equivalent code at scale. We introduce six classes of semantic preserving transformations to introduce un-natural forms of code, and then force our model to produce more natural original programs written by developers. Learning to generate equivalent, but more natural code, at scale, over large corpora of open-source code, without explicit manual supervision, helps the model learn to both ingest & generate code. We fine-tune our model in three generative Software Engineering tasks: code generation, code translation, and code refinement with limited human-curated labeled data and achieve state-of-the-art performance rivaling CodeT5. We show that our pre-trained model is especially competitive at zero-shot and few-shot learning, and better at learning code properties (e.g., syntax, data flow).
    Deep Koopman Operator with Control for Nonlinear Systems. (arXiv:2202.08004v2 [cs.RO] UPDATED)
    Recently Koopman operator has become a promising data-driven tool to facilitate real-time control for unknown nonlinear systems. It maps nonlinear systems into equivalent linear systems in embedding space, ready for real-time linear control methods. However, designing an appropriate Koopman embedding function remains a challenging task. Furthermore, most Koopman-based algorithms only consider nonlinear systems with linear control input, resulting in lousy prediction and control performance when the system is fully nonlinear with the control input. In this work, we propose an end-to-end deep learning framework to learn the Koopman embedding function and Koopman Operator together to alleviate such difficulties. We first parameterize the embedding function and Koopman Operator with the neural network and train them end-to-end with the K-steps loss function. Then, an auxiliary control network is augmented to encode the nonlinear state-dependent control term to model the nonlinearity in the control input. This encoded term is considered the new control variable instead to ensure linearity of the modeled system in the embedding system.We next deploy Linear Quadratic Regulator (LQR) on the linear embedding space to derive the optimal control policy and decode the actual control input from the control net. Experimental results demonstrate that our approach outperforms other existing methods, reducing the prediction error by order of magnitude and achieving superior control performance in several nonlinear dynamic systems like damping pendulum, CartPole, and the seven DOF robotic manipulator.
    A Multiple kernel testing procedure for non-proportional hazards in factorial designs. (arXiv:2206.07239v1 [stat.ME])
    In this paper we propose a Multiple kernel testing procedure to infer survival data when several factors (e.g. different treatment groups, gender, medical history) and their interaction are of interest simultaneously. Our method is able to deal with complex data and can be seen as an alternative to the omnipresent Cox model when assumptions such as proportionality cannot be justified. Our methodology combines well-known concepts from Survival Analysis, Machine Learning and Multiple Testing: differently weighted log-rank tests, kernel methods and multiple contrast tests. By that, complex hazard alternatives beyond the classical proportional hazard set-up can be detected. Moreover, multiple comparisons are performed by fully exploiting the dependence structure of the single testing procedures to avoid a loss of power. In all, this leads to a flexible and powerful procedure for factorial survival designs whose theoretical validity is proven by martingale arguments and the theory for $V$-statistics. We evaluate the performance of our method in an extensive simulation study and illustrate it by a real data analysis.  ( 2 min )
    Smart Meter Data Anomaly Detection using Variational Recurrent Autoencoders with Attention. (arXiv:2206.07519v1 [eess.SP])
    In the digitization of energy systems, sensors and smart meters are increasingly being used to monitor production, operation and demand. Detection of anomalies based on smart meter data is crucial to identify potential risks and unusual events at an early stage, which can serve as a reference for timely initiation of appropriate actions and improving management. However, smart meter data from energy systems often lack labels and contain noise and various patterns without distinctively cyclical. Meanwhile, the vague definition of anomalies in different energy scenarios and highly complex temporal correlations pose a great challenge for anomaly detection. Many traditional unsupervised anomaly detection algorithms such as cluster-based or distance-based models are not robust to noise and not fully exploit the temporal dependency in a time series as well as other dependencies amongst multiple variables (sensors). This paper proposes an unsupervised anomaly detection method based on a Variational Recurrent Autoencoder with attention mechanism. with "dirty" data from smart meters, our method pre-detects missing values and global anomalies to shrink their contribution while training. This paper makes a quantitative comparison with the VAE-based baseline approach and four other unsupervised learning methods, demonstrating its effectiveness and superiority. This paper further validates the proposed method by a real case study of detecting the anomalies of water supply temperature from an industrial heating plant.
    Location-based Twitter Filtering for the Creation of Low-Resource Language Datasets in Indonesian Local Languages. (arXiv:2206.07238v1 [cs.CL])
    Twitter contains an abundance of linguistic data from the real world. We examine Twitter for user-generated content in low-resource languages such as local Indonesian. For NLP to work in Indonesian, it must consider local dialects, geographic context, and regional culture influence Indonesian languages. This paper identifies the problems we faced when constructing a Local Indonesian NLP dataset. Furthermore, we are developing a framework for creating, collecting, and classifying Local Indonesian datasets for NLP. Using twitter's geolocation tool for automatic annotating.  ( 2 min )
    Towards a Solution to Bongard Problems: A Causal Approach. (arXiv:2206.07196v1 [cs.LG])
    To date, Bongard Problems (BP) remain one of the few fortresses of AI history yet to be raided by the powerful models of the current era. We present a systematic analysis using modern techniques from the intersection of causality and AI/ML in a humble effort of reviving research around BPs. Specifically, we first compile the BPs into a Markov decision process, then secondly pose causal assumptions on the data generating process arguing for their applicability to BPs, and finally apply reinforcement learning techniques for solving the BPs subject to the causal assumptions.  ( 2 min )
    MPI: Evaluating and Inducing Personality in Pre-trained Language Models. (arXiv:2206.07550v1 [cs.CL])
    Originated as a philosophical quest, personality discerns how individuals differ from each other in terms of thinking, feeling, and behaving. Towards building social machines that work with humans on a daily basis, we are motivated to ask: (1) Do existing pre-trained language models possess personality, akin to their human counterpart? If so, (2) how can we evaluate them? Further, given this evaluation framework, (3) how can we induce a certain personality in a fully controllable fashion? To tackle these three questions, we propose the Machine Personality Inventory (MPI) dataset for evaluating the machine personality; MPI follows standardized personality tests, built upon the Big Five Personality Factors (Big Five) theory and personality assessment inventories. By evaluating models with MPI, we provide the first piece of evidence showing the existence of personality in pre-trained language models. We further devise a Chain Prompting method to induce the language model with a specific personality in a controllable manner, capable of producing diversified behaviors. We hope to shed light on future studies by adopting personality as the essential psychological guidance for various downstream tasks, building more human-like and in situ dialogue agents.
    Intelligent analysis of EEG signals to assess consumer decisions: A Study on Neuromarketing. (arXiv:2206.07484v1 [eess.SP])
    Neuromarketing is an emerging field that combines neuroscience and marketing to understand the factors that influence consumer decisions better. The study proposes a method to understand consumers' positive and negative reactions to advertisements (ads) and products by analysing electroencephalogram (EEG) signals. These signals are recorded using a low-cost single electrode headset from volunteers belonging to the ages 18-22. A detailed subject dependent (SD) and subject independent (SI) analysis was performed employing machine learning methods like Naive Bayes (NB), Support Vector Machine (SVM), k-nearest neighbour and Decision Tree and the proposed deep learning (DL) model. SVM and NB yielded an accuracy (Acc.) of 0.63 for the SD analysis. In SI analysis, SVM performed better for the advertisement, product and gender-based analysis. Furthermore, the performance of the DL model was on par with that of SVM, especially, in product and ads-based analysis.
    Can pruning improve certified robustness of neural networks?. (arXiv:2206.07311v1 [cs.LG])
    With the rapid development of deep learning, the sizes of neural networks become larger and larger so that the training and inference often overwhelm the hardware resources. Given the fact that neural networks are often over-parameterized, one effective way to reduce such computational overhead is neural network pruning, by removing redundant parameters from trained neural networks. It has been recently observed that pruning can not only reduce computational overhead but also can improve empirical robustness of deep neural networks (NNs), potentially owing to removing spurious correlations while preserving the predictive accuracies. This paper for the first time demonstrates that pruning can generally improve certified robustness for ReLU-based NNs under the complete verification setting. Using the popular Branch-and-Bound (BaB) framework, we find that pruning can enhance the estimated bound tightness of certified robustness verification, by alleviating linear relaxation and sub-domain split problems. We empirically verify our findings with off-the-shelf pruning methods and further present a new stability-based pruning method tailored for reducing neuron instability, that outperforms existing pruning methods in enhancing certified robustness. Our experiments show that by appropriately pruning an NN, its certified accuracy can be boosted up to 8.2% under standard training, and up to 24.5% under adversarial training on the CIFAR10 dataset. We additionally observe the existence of certified lottery tickets that can match both standard and certified robust accuracies of the original dense models across different datasets. Our findings offer a new angle to study the intriguing interaction between sparsity and robustness, i.e. interpreting the interaction of sparsity and certified robustness via neuron stability. Codes are available at: https://github.com/VITA-Group/CertifiedPruning.
    A Comprehensive Survey on Deep Clustering: Taxonomy, Challenges, and Future Directions. (arXiv:2206.07579v1 [cs.LG])
    Clustering is a fundamental machine learning task which has been widely studied in the literature. Classic clustering methods follow the assumption that data are represented as features in a vectorized form through various representation learning techniques. As the data become increasingly complicated and complex, the shallow (traditional) clustering methods can no longer handle the high-dimensional data type. With the huge success of deep learning, especially the deep unsupervised learning, many representation learning techniques with deep architectures have been proposed in the past decade. Recently, the concept of Deep Clustering, i.e., jointly optimizing the representation learning and clustering, has been proposed and hence attracted growing attention in the community. Motivated by the tremendous success of deep learning in clustering, one of the most fundamental machine learning tasks, and the large number of recent advances in this direction, in this paper we conduct a comprehensive survey on deep clustering by proposing a new taxonomy of different state-of-the-art approaches. We summarize the essential components of deep clustering and categorize existing methods by the ways they design interactions between deep representation learning and clustering. Moreover, this survey also provides the popular benchmark datasets, evaluation metrics and open-source implementations to clearly illustrate various experimental settings. Last but not least, we discuss the practical applications of deep clustering and suggest challenging topics deserving further investigations as future directions.
    Automating the resolution of flight conflicts: Deep reinforcement learning in service of air traffic controllers. (arXiv:2206.07403v1 [cs.MA])
    Dense and complex air traffic scenarios require higher levels of automation than those exhibited by tactical conflict detection and resolution (CD\&R) tools that air traffic controllers (ATCO) use today. However, the air traffic control (ATC) domain, being safety critical, requires AI systems to which operators are comfortable to relinquishing control, guaranteeing operational integrity and automation adoption. Two major factors towards this goal are quality of solutions, and transparency in decision making. This paper proposes using a graph convolutional reinforcement learning method operating in a multiagent setting where each agent (flight) performs a CD\&R task, jointly with other agents. We show that this method can provide high-quality solutions with respect to stakeholders interests (air traffic controllers and airspace users), addressing operational transparency issues.
    Mean-Semivariance Policy Optimization via Risk-Averse Reinforcement Learning. (arXiv:2206.07376v1 [cs.LG])
    Keeping risk under control is often more crucial than maximizing expected reward in real-world decision-making situations, such as finance, robotics, autonomous driving, etc. The most natural choice of risk measures is variance, while it penalizes the upside volatility as much as the downside part. Instead, the (downside) semivariance, which captures negative deviation of a random variable under its mean, is more suitable for risk-averse proposes. This paper aims at optimizing the mean-semivariance (MSV) criterion in reinforcement learning w.r.t. steady rewards. Since semivariance is time-inconsistent and does not satisfy the standard Bellman equation, the traditional dynamic programming methods are inapplicable to MSV problems directly. To tackle this challenge, we resort to the Perturbation Analysis (PA) theory and establish the performance difference formula for MSV. We reveal that the MSV problem can be solved by iteratively solving a sequence of RL problems with a policy-dependent reward function. Further, we propose two on-policy algorithms based on the policy gradient theory and the trust region method. Finally, we conduct diverse experiments from simple bandit problems to continuous control tasks in MuJoCo, which demonstrate the effectiveness of our proposed methods.
    Blind Estimation of a Doubly Selective OFDM Channel: A Deep Learning Algorithm and Theory. (arXiv:2206.07483v1 [eess.SP])
    We provide a new generation solution to the fundamental old problem of a doubly selective fading channel estimation for orthogonal frequency division multiplexing (OFDM) systems. For systems based on OFDM, we propose a deep learning (DL)-based blind doubly selective channel estimator. This estimator does require no pilot symbols, unlike the corresponding state-of-the-art estimators, even during the estimation of a deep fading doubly selective channel. We also provide the first of its kind theory on the testing mean squared error (MSE) performance of our investigated blind OFDM channel estimator based on over-parameterized ReLU FNNs.
    Machine Learning is Abduction Inference. (arXiv:2206.07586v1 [cs.AI])
    Concept of Abduction with Gradated Contradictions is introduced here as a form of Peirce's abduction inference. The general form of abduction criterion is formalized in the proposed Logic of Gradated Contradictions and Logic of Recursive Aggregation. Common steps of an abduction procedure as minimization of such a criterion are specified as well. It is demonstrated on examples of 14 popular textbook learners (from hierarchical clustering to k-NN and SVR) that each of them performs AGC. The proposed theory explains real life learners, yet it avoids any mention of statistics, so it can be considered as a logical alternative to the statistical learning theory.
    Global Convergence of Federated Learning for Mixed Regression. (arXiv:2206.07279v1 [cs.LG])
    This paper studies the problem of model training under Federated Learning when clients exhibit cluster structure. We contextualize this problem in mixed regression, where each client has limited local data generated from one of $k$ unknown regression models. We design an algorithm that achieves global convergence from any initialization, and works even when local data volume is highly unbalanced -- there could exist clients that contain $O(1)$ data points only. Our algorithm first runs moment descent on a few anchor clients (each with $\tilde{\Omega}(k)$ data points) to obtain coarse model estimates. Then each client alternately estimates its cluster labels and refines the model estimates based on FedAvg or FedProx. A key innovation in our analysis is a uniform estimate on the clustering errors, which we prove by bounding the VC dimension of general polynomial concept classes based on the theory of algebraic geometry.
    VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection. (arXiv:2206.07458v1 [cs.CV])
    The goal of this work is to reconstruct speech from a silent talking face video. Recent studies have shown impressive performance on synthesizing speech from silent talking face videos. However, they have not explicitly considered on varying identity characteristics of different speakers, which place a challenge in the video-to-speech synthesis, and this becomes more critical in unseen-speaker settings. Distinct from the previous methods, our approach is to separate the speech content and the visage-style from a given silent talking face video. By guiding the model to independently focus on modeling the two representations, we can obtain the speech of high intelligibility from the model even when the input video of an unseen subject is given. To this end, we introduce speech-visage selection module that separates the speech content and the speaker identity from the visual features of the input video. The disentangled representations are jointly incorporated to synthesize speech through visage-style based synthesizer which generates speech by coating the visage-styles while maintaining the speech content. Thus, the proposed framework brings the advantage of synthesizing the speech containing the right content even when the silent talking face video of an unseen subject is given. We validate the effectiveness of the proposed framework on the GRID, TCD-TIMIT volunteer, and LRW datasets. The synthesized speech can be heard in supplementary materials.
    Estimating the Optimal Covariance with Imperfect Mean in Diffusion Probabilistic Models. (arXiv:2206.07309v1 [cs.LG])
    Diffusion probabilistic models (DPMs) are a class of powerful deep generative models (DGMs). Despite their success, the iterative generation process over the full timesteps is much less efficient than other DGMs such as GANs. Thus, the generation performance on a subset of timesteps is crucial, which is greatly influenced by the covariance design in DPMs. In this work, we consider diagonal and full covariances to improve the expressive power of DPMs. We derive the optimal result for such covariances, and then correct it when the mean of DPMs is imperfect. Both the optimal and the corrected ones can be decomposed into terms of conditional expectations over functions of noise. Building upon it, we propose to estimate the optimal covariance and its correction given imperfect mean by learning these conditional expectations. Our method can be applied to DPMs with both discrete and continuous timesteps. We consider the diagonal covariance in our implementation for computational efficiency. For an efficient practical implementation, we adopt a parameter sharing scheme and a two-stage training process. Empirically, our method outperforms a wide variety of covariance design on likelihood results, and improves the sample quality especially on a small number of timesteps.
    Query-Adaptive Predictive Inference with Partial Labels. (arXiv:2206.07236v1 [stat.ML])
    The cost and scarcity of fully supervised labels in statistical machine learning encourage using partially labeled data for model validation as a cheaper and more accessible alternative. Effectively collecting and leveraging weakly supervised data for large-space structured prediction tasks thus becomes an important part of an end-to-end learning system. We propose a new computationally-friendly methodology to construct predictive sets using only partially labeled data on top of black-box predictive models. To do so, we introduce "probe" functions as a way to describe weakly supervised instances and define a false discovery proportion-type loss, both of which seamlessly adapt to partial supervision and structured prediction -- ranking, matching, segmentation, multilabel or multiclass classification. Our experiments highlight the validity of our predictive set construction as well as the attractiveness of a more flexible user-dependent loss framework.
    ALASCA: Rethinking Label Smoothing for Deep Learning Under Label Noise. (arXiv:2206.07277v1 [cs.LG])
    As label noise, one of the most popular distribution shifts, severely degrades deep neural networks' generalization performance, robust training with noisy labels is becoming an important task in modern deep learning. In this paper, we propose our framework, coined as Adaptive LAbel smoothing on Sub-ClAssifier (ALASCA), that provides a robust feature extractor with theoretical guarantee and negligible additional computation. First, we derive that the label smoothing (LS) incurs implicit Lipschitz regularization (LR). Furthermore, based on these derivations, we apply the adaptive LS (ALS) on sub-classifiers architectures for the practical application of adaptive LR on intermediate layers. We conduct extensive experiments for ALASCA and combine it with previous noise-robust methods on several datasets and show our framework consistently outperforms corresponding baselines.
    Multi-Objective Hyperparameter Optimization -- An Overview. (arXiv:2206.07438v1 [cs.LG])
    Hyperparameter optimization constitutes a large part of typical modern machine learning workflows. This arises from the fact that machine learning methods and corresponding preprocessing steps often only yield optimal performance when hyperparameters are properly tuned. But in many applications, we are not only interested in optimizing ML pipelines solely for predictive accuracy; additional metrics or constraints must be considered when determining an optimal configuration, resulting in a multi-objective optimization problem. This is often neglected in practice, due to a lack of knowledge and readily available software implementations for multi-objective hyperparameter optimization. In this work, we introduce the reader to the basics of multi- objective hyperparameter optimization and motivate its usefulness in applied ML. Furthermore, we provide an extensive survey of existing optimization strategies, both from the domain of evolutionary algorithms and Bayesian optimization. We illustrate the utility of MOO in several specific ML applications, considering objectives such as operating conditions, prediction time, sparseness, fairness, interpretability and robustness.
    Investigating Multi-Feature Selection and Ensembling for Audio Classification. (arXiv:2206.07511v1 [cs.SD])
    Deep Learning (DL) algorithms have shown impressive performance in diverse domains. Among them, audio has attracted many researchers over the last couple of decades due to some interesting patterns--particularly in classification of audio data. For better performance of audio classification, feature selection and combination play a key role as they have the potential to make or break the performance of any DL model. To investigate this role, we conduct an extensive evaluation of the performance of several cutting-edge DL models (i.e., Convolutional Neural Network, EfficientNet, MobileNet, Supper Vector Machine and Multi-Perceptron) with various state-of-the-art audio features (i.e., Mel Spectrogram, Mel Frequency Cepstral Coefficients, and Zero Crossing Rate) either independently or as a combination (i.e., through ensembling) on three different datasets (i.e., Free Spoken Digits Dataset, Audio Urdu Digits Dataset, and Audio Gujarati Digits Dataset). Overall, results suggest feature selection depends on both the dataset and the model. However, feature combinations should be restricted to the only features that already achieve good performances when used individually (i.e., mostly Mel Spectrogram, Mel Frequency Cepstral Coefficients). Such feature combination/ensembling enabled us to outperform the previous state-of-the-art results irrespective of our choice of DL model.
    A Survey of Detection Methods for Die Attachment and Wire Bonding Defects in Integrated Circuit Manufacturing. (arXiv:2206.07481v1 [eess.SP])
    Defect detection plays a vital role in the manufacturing process of integrated circuits (ICs). Die attachment and wire bonding are two steps of the manufacturing process that determine the quality and reliability of the power and signal transmission in an IC. This paper presents a survey or literature review of the methods used for detecting these defects based on different sensing modalities used including optical, radiological, acoustical, and infrared thermography. A discussion of the detection methods used is provided in this survey. Both conventional and deep learning approaches for detecting die attachment and wire bonding defects are considered along with challenges and future research directions.
    A Deep Learning Network for the Classification of Intracardiac Electrograms in Atrial Tachycardia. (arXiv:2206.07515v1 [eess.SP])
    A key technology enabling the success of catheter ablation treatment for atrial tachycardia is activation mapping, which relies on manual local activation time (LAT) annotation of all acquired intracardiac electrogram (EGM) signals. This is a time-consuming and error-prone procedure, due to the difficulty in identifying the signal activation peaks for fractionated signals. This work presents a Deep Learning approach for the automated classification of EGM signals into three different types: normal, abnormal, and unclassified, which forms part of the LAT annotation pipeline, and contributes towards bypassing the need for manual annotations of the LAT. The Deep Learning network, the CNN-LSTM model, is a hybrid network architecture which combines convolutional neural network (CNN) layers with long short-term memory (LSTM) layers. 1452 EGM signals from a total of 9 patients undergoing clinically-indicated 3D cardiac mapping were used for the training, validation and testing of our models. From our findings, the CNN-LSTM model achieved an accuracy of 81% for the balanced dataset. For comparison, we separately developed a rule-based Decision Trees model which attained an accuracy of 67% for the same balanced dataset. Our work elucidates that analysing the EGM signals using a set of explicitly specified rules as proposed by the Decision Trees model is not suitable as EGM signals are complex. The CNN-LSTM model, on the other hand, has the ability to learn the complex, intrinsic features within the signals and identify useful features to differentiate the EGM signals.
    The Manifold Hypothesis for Gradient-Based Explanations. (arXiv:2206.07387v1 [cs.LG])
    When do gradient-based explanation algorithms provide meaningful explanations? We propose a necessary criterion: their feature attributions need to be aligned with the tangent space of the data manifold. To provide evidence for this hypothesis, we introduce a framework based on variational autoencoders that allows to estimate and generate image manifolds. Through experiments across a range of different datasets -- MNIST, EMNIST, CIFAR10, X-ray pneumonia and Diabetic Retinopathy detection -- we demonstrate that the more a feature attribution is aligned with the tangent space of the data, the more structured and explanatory it tends to be. In particular, the attributions provided by popular post-hoc methods such as Integrated Gradients, SmoothGrad and Input $\times$ Gradient tend to be more strongly aligned with the data manifold than the raw gradient. As a consequence, we suggest that explanation algorithms should actively strive to align their explanations with the data manifold. In part, this can be achieved by adversarial training, which leads to better alignment across all datasets. Some form of adjustment to the model architecture or training algorithm is necessary, since we show that generalization of neural networks alone does not imply the alignment of model gradients with the data manifold.
    A Survey on Gradient Inversion: Attacks, Defenses and Future Directions. (arXiv:2206.07284v1 [cs.LG])
    Recent studies have shown that the training samples can be recovered from gradients, which are called Gradient Inversion (GradInv) attacks. However, there remains a lack of extensive surveys covering recent advances and thorough analysis of this issue. In this paper, we present a comprehensive survey on GradInv, aiming to summarize the cutting-edge research and broaden the horizons for different domains. Firstly, we propose a taxonomy of GradInv attacks by characterizing existing attacks into two paradigms: iteration- and recursion-based attacks. In particular, we dig out some critical ingredients from the iteration-based attacks, including data initialization, model training and gradient matching. Second, we summarize emerging defense strategies against GradInv attacks. We find these approaches focus on three perspectives covering data obscuration, model improvement and gradient protection. Finally, we discuss some promising directions and open problems for further research.
    Detection of magnetohydrodynamic waves by using machine learning. (arXiv:2206.07334v1 [physics.flu-dyn])
    Nonlinear wave interactions, such as shock refraction at an inclined density interface, in magnetohydrodynamic (MHD) lead to a plethora of wave patterns with myriad wave types. Identification of different types of MHD waves is an important and challenging task in such complex wave patterns. Moreover, owing to the multiplicity of solutions and their admissibility for different systems, especially for intermediate-type MHD shock waves, the identification of MHD wave types is complicated if one solely relies on the Rankine-Hugoniot jump conditions. MHD wave detection is further exacerbated by the unphysical smearing of discontinuous shock waves in numerical simulations. We present two MHD wave detection methods based on a convolutional neural network (CNN) which enables the classification of waves and identification of their locations. The first method separates the output into a regression (location prediction) and a classification problem assuming the number of waves for each training data is fixed. In the second method, the number of waves is not specified a priori and the algorithm, using only regression, predicts the waves' locations and classifies their types. The first fixed output model efficiently provides high precision and recall, the accuracy of the entire neural network achieved is up to 0.99, and the classification accuracy of some waves approaches unity. The second detection model has relatively lower performance, with more sensitivity to the setting of parameters, such as the number of grid cells N_{grid} and the thresholds of confidence score and class probability, etc. The proposed two methods demonstrate very strong potential to be applied for MHD wave detection in some complex wave structures and interactions.
    Learning Large-scale Subsurface Simulations with a Hybrid Graph Network Simulator. (arXiv:2206.07680v1 [cs.LG])
    Subsurface simulations use computational models to predict the flow of fluids (e.g., oil, water, gas) through porous media. These simulations are pivotal in industrial applications such as petroleum production, where fast and accurate models are needed for high-stake decision making, for example, for well placement optimization and field development planning. Classical finite difference numerical simulators require massive computational resources to model large-scale real-world reservoirs. Alternatively, streamline simulators and data-driven surrogate models are computationally more efficient by relying on approximate physics models, however they are insufficient to model complex reservoir dynamics at scale. Here we introduce Hybrid Graph Network Simulator (HGNS), which is a data-driven surrogate model for learning reservoir simulations of 3D subsurface fluid flows. To model complex reservoir dynamics at both local and global scale, HGNS consists of a subsurface graph neural network (SGNN) to model the evolution of fluid flows, and a 3D-U-Net to model the evolution of pressure. HGNS is able to scale to grids with millions of cells per time step, two orders of magnitude higher than previous surrogate models, and can accurately predict the fluid flow for tens of time steps (years into the future). Using an industry-standard subsurface flow dataset (SPE-10) with 1.1 million cells, we demonstrate that HGNS is able to reduce the inference time up to 18 times compared to standard subsurface simulators, and that it outperforms other learning-based models by reducing long-term prediction errors by up to 21%.
    Automatic Clipping: Differentially Private Deep Learning Made Easier and Stronger. (arXiv:2206.07136v1 [cs.LG])
    Per-example gradient clipping is a key algorithmic step that enables practical differential private (DP) training for deep learning models. The choice of clipping norm $R$, however, is shown to be vital for achieving high accuracy under DP. We propose an easy-to-use replacement, called AutoClipping, that eliminates the need to tune $R$ for any DP optimizers, including DP-SGD, DP-Adam, DP-LAMB and many others. The automatic variants are as private and computationally efficient as existing DP optimizers, but require no DP-specific hyperparameters and thus make DP training as amenable as the standard non-private training. We give a rigorous convergence analysis of automatic DP-SGD in the non-convex setting, which shows that it enjoys an asymptotic convergence rate that matches the standard SGD. We also demonstrate on various language and vision tasks that automatic clipping outperforms or matches the state-of-the-art, and can be easily employed with minimal changes to existing codebases.
    Using Machine Learning to Augment Dynamic Time Warping Based Signal Classification. (arXiv:2206.07200v1 [cs.LG])
    Modern applications such as voice recognition rely on the ability to compare signals to pre-recorded ones to classify them. However, this comparison typically needs to ignore differences due to signal noise, temporal offset, signal magnitude, and other external factors. The Dynamic Time Warping (DTW) algorithm quantifies this similarity by finding corresponding regions between the signals and non-linearly warping one signal by stretching and shrinking it. Unfortunately, searching through all "warps" of a signal to find the best corresponding regions is computationally expensive. The FastDTW algorithm improves performance, but sacrifices accuracy by only considering small signal warps. My goal is to improve the speed of DTW while maintaining high accuracy. My key insight is that in any particular application domain, signals exhibit specific types of variation. For example, the accelerometer signal measured for two different people would differ based on their stride length and weight. My system, called Machine Learning DTW (MLDTW), uses machine learning to learn the types of warps that are common in a particular domain. It then uses the learned model to improve DTW performance by limiting the search of potential warps appropriately. My results show that compared to FastDTW, MLDTW is at least as fast and reduces errors by 60% on average across four different data sets. These improvements will significantly impact a wide variety of applications (e.g. health monitoring) and enable more scalable processing of multivariate, higher frequency, and longer signal recordings.
    Unknown-Aware Domain Adversarial Learning for Open-Set Domain Adaptation. (arXiv:2206.07551v1 [cs.LG])
    Open-Set Domain Adaptation (OSDA) assumes that a target domain contains unknown classes, which are not discovered in a source domain. Existing domain adversarial learning methods are not suitable for OSDA because distribution matching with \textit{unknown} classes leads to the negative transfer. Previous OSDA methods have focused on matching the source and the target distribution by only utilizing \textit{known} classes. However, this \textit{known}-only matching may fail to learn the target-\textit{unknown} feature space. Therefore, we propose Unknown-Aware Domain Adversarial Learning (UADAL), which \textit{aligns} the source and the targe-\textit{known} distribution while simultaneously \textit{segregating} the target-\textit{unknown} distribution in the feature alignment procedure. We provide theoretical analyses on the optimized state of the proposed \textit{unknown-aware} feature alignment, so we can guarantee both \textit{alignment} and \textit{segregation} theoretically. Empirically, we evaluate UADAL on the benchmark datasets, which shows that UADAL outperforms other methods with better feature alignments by reporting the state-of-the-art performances.
    Attributions Beyond Neural Networks: The Linear Program Case. (arXiv:2206.07203v1 [cs.LG])
    Linear Programs (LPs) have been one of the building blocks in machine learning and have championed recent strides in differentiable optimizers for learning systems. While there exist solvers for even high-dimensional LPs, understanding said high-dimensional solutions poses an orthogonal and unresolved problem. We introduce an approach where we consider neural encodings for LPs that justify the application of attribution methods from explainable artificial intelligence (XAI) designed for neural learning systems. The several encoding functions we propose take into account aspects such as feasibility of the decision space, the cost attached to each input, or the distance to special points of interest. We investigate the mathematical consequences of several XAI methods on said neural LP encodings. We empirically show that the attribution methods Saliency and LIME reveal indistinguishable results up to perturbation levels, and we propose the property of Directedness as the main discriminative criterion between Saliency and LIME on one hand, and a perturbation-based Feature Permutation approach on the other hand. Directedness indicates whether an attribution method gives feature attributions with respect to an increase of that feature. We further notice the baseline selection problem beyond the classical computer vision setting for Integrated Gradients.
    Tearing Apart NOTEARS: Controlling the Graph Prediction via Variance Manipulation. (arXiv:2206.07195v1 [cs.LG])
    Simulations are ubiquitous in machine learning. Especially in graph learning, simulations of Directed Acyclic Graphs (DAG) are being deployed for evaluating new algorithms. In the literature, it was recently argued that continuous-optimization approaches to structure discovery such as NOTEARS might be exploiting the sortability of the variable's variances in the available data due to their use of least square losses. Specifically, since structure discovery is a key problem in science and beyond, we want to be invariant to the scale being used for measuring our data (e.g. meter versus centimeter should not affect the causal direction inferred by the algorithm). In this work, we further strengthen this initial, negative empirical suggestion by both proving key results in the multivariate case and corroborating with further empirical evidence. In particular, we show that we can control the resulting graph with our targeted variance attacks, even in the case where we can only partially manipulate the variances of the data.
    CLNode: Curriculum Learning for Node Classification. (arXiv:2206.07258v1 [cs.LG])
    Node classification is a fundamental graph-based task that aims to predict the classes of unlabeled nodes, for which Graph Neural Networks (GNNs) are the state-of-the-art methods. In current GNNs, training nodes (or training samples) are treated equally throughout training. The quality of the samples, however, varies greatly according to the graph structure. Consequently, the performance of GNNs could be harmed by two types of low-quality samples: (1) Inter-class nodes situated near class boundaries that connect neighboring classes. These nodes' representations lack the typical characteristics of their corresponding classes. Because GNNs are data-driven approaches, training on these nodes could degrade the accuracy. (2) Mislabeled nodes. In real-world graphs, nodes are often mislabeled, which can significantly degrade the robustness of GNNs. To mitigate the detrimental effect of the low-quality samples, we present CLNode (Curriculum Learning for Node Classification), which automatically adjusts the weights of samples during training based on their quality. Specifically, we first design a neighborhood-based difficulty measurer to accurately measure the quality of samples. Subsequently, based on these measurements, we employ a training scheduler to adjust the sample weights in each training epoch. To evaluate the effectiveness of CLNode, we conduct extensive experiments by applying it to four representative backbone GNNs. Experimental results on six real-world networks demonstrate that CLNode is a general framework that can be combined with various GNNs to improve their accuracy and robustness.
    E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR. (arXiv:2204.10749v2 [cs.SD] UPDATED)
    Improving the performance of end-to-end ASR models on long utterances ranging from minutes to hours in length is an ongoing challenge in speech recognition. A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundary locations based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for real-world speech where, e.g., a complete sentence that should be taken as a whole may contain hesitations in the middle ("set an alarm for... 5 o'clock"). We propose to replace the VAD with an end-to-end ASR model capable of predicting segment boundaries in a streaming fashion, allowing the segmentation decision to be conditioned not only on better acoustic features but also on semantic features from the decoded text with negligible extra computation. In experiments on real world long-form audio (YouTube) with lengths of up to 30 minutes, we demonstrate 8.5% relative WER improvement and 250 ms reduction in median end-of-segment latency compared to the VAD segmenter baseline on a state-of-the-art Conformer RNN-T model.
    Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition. (arXiv:2203.05008v2 [cs.CL] UPDATED)
    Language model fusion helps smart assistants recognize words which are rare in acoustic data but abundant in text-only corpora (typed search logs). However, such corpora have properties that hinder downstream performance, including being (1) too large, (2) beset with domain-mismatched content, and (3) heavy-headed rather than heavy-tailed (excessively many duplicate search queries such as "weather"). We show that three simple strategies for selecting language modeling data can dramatically improve rare-word recognition without harming overall performance. First, to address the heavy-headedness, we downsample the data according to a soft log function, which tunably reduces high frequency (head) sentences. Second, to encourage rare-word exposure, we explicitly filter for words rare in the acoustic data. Finally, we tackle domain-mismatch via perplexity-based contrastive selection, filtering for examples matched to the target domain. We down-select a large corpus of web search queries by a factor of 53x and achieve better LM perplexities than without down-selection. When shallow-fused with a state-of-the-art, production speech engine, our LM achieves WER reductions of up to 24% relative on rare-word sentences (without changing overall WER) compared to a baseline LM trained on the raw corpus. These gains are further validated through favorable side-by-side evaluations on live voice search traffic.
    Stability of image reconstruction algorithms. (arXiv:2206.07128v1 [math.OC])
    Robustness and stability of image reconstruction algorithms have recently come under scrutiny. Their importance to medical imaging cannot be overstated. We review the known results for the topical variational regularization strategies ($\ell_2$ and $\ell_1$ regularization), and present new stability results for $\ell_p$ regularized linear inverse problems for $p\in(1,\infty)$. Our results generalize well to the respective $L_p(\Omega)$ function spaces.
    Proximal Splitting Adversarial Attacks for Semantic Segmentation. (arXiv:2206.07179v1 [cs.LG])
    Classification has been the focal point of research on adversarial attacks, but only a few works investigate methods suited to denser prediction tasks, such as semantic segmentation. The methods proposed in these works do not accurately solve the adversarial segmentation problem and, therefore, are overoptimistic in terms of size of the perturbations required to fool models. Here, we propose a white-box attack for these models based on a proximal splitting to produce adversarial perturbations with much smaller $\ell_1$, $\ell_2$, or $\ell_\infty$ norms. Our attack can handle large numbers of constraints within a nonconvex minimization framework via an Augmented Lagrangian approach, coupled with adaptive constraint scaling and masking strategies. We demonstrate that our attack significantly outperforms previously proposed ones, as well as classification attacks that we adapted for segmentation, providing a first comprehensive benchmark for this dense task. Our results push current limits concerning robustness evaluations in segmentation tasks.
    DeepRecon: Joint 2D Cardiac Segmentation and 3D Volume Reconstruction via A Structure-Specific Generative Method. (arXiv:2206.07163v1 [cs.CV])
    Joint 2D cardiac segmentation and 3D volume reconstruction are fundamental to building statistical cardiac anatomy models and understanding functional mechanisms from motion patterns. However, due to the low through-plane resolution of cine MR and high inter-subject variance, accurately segmenting cardiac images and reconstructing the 3D volume are challenging. In this study, we propose an end-to-end latent-space-based framework, DeepRecon, that generates multiple clinically essential outcomes, including accurate image segmentation, synthetic high-resolution 3D image, and 3D reconstructed volume. Our method identifies the optimal latent representation of the cine image that contains accurate semantic information for cardiac structures. In particular, our model jointly generates synthetic images with accurate semantic information and segmentation of the cardiac structures using the optimal latent representation. We further explore downstream applications of 3D shape reconstruction and 4D motion pattern adaptation by the different latent-space manipulation strategies.The simultaneously generated high-resolution images present a high interpretable value to assess the cardiac shape and motion.Experimental results demonstrate the effectiveness of our approach on multiple fronts including 2D segmentation, 3D reconstruction, downstream 4D motion pattern adaption performance.
    Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt. (arXiv:2206.07137v1 [cs.LG])
    Training on web-scale data can take months. But much computation and time is wasted on redundant and noisy points that are already learnt or not learnable. To accelerate training, we introduce Reducible Holdout Loss Selection (RHO-LOSS), a simple but principled technique which selects approximately those points for training that most reduce the model's generalization loss. As a result, RHO-LOSS mitigates the weaknesses of existing data selection methods: techniques from the optimization literature typically select 'hard' (e.g. high loss) points, but such points are often noisy (not learnable) or less task-relevant. Conversely, curriculum learning prioritizes 'easy' points, but such points need not be trained on once learned. In contrast, RHO-LOSS selects points that are learnable, worth learning, and not yet learnt. RHO-LOSS trains in far fewer steps than prior art, improves accuracy, and speeds up training on a wide range of datasets, hyperparameters, and architectures (MLPs, CNNs, and BERT). On the large web-scraped image dataset Clothing-1M, RHO-LOSS trains in 18x fewer steps and reaches 2% higher final accuracy than uniform data shuffling.
    PDE-Based Optimal Strategy for Unconstrained Online Learning. (arXiv:2201.07877v2 [cs.LG] UPDATED)
    Unconstrained Online Linear Optimization (OLO) is a practical problem setting to study the training of machine learning models. Existing works proposed a number of potential-based algorithms, but in general the design of these potential functions relies heavily on guessing. To streamline this workflow, we present a framework that generates new potential functions by solving a Partial Differential Equation (PDE). Specifically, when losses are 1-Lipschitz, our framework produces a novel algorithm with anytime regret bound $C\sqrt{T}+||u||\sqrt{2T}[\sqrt{\log(1+||u||/C)}+2]$, where $C$ is a user-specified constant and $u$ is any comparator unknown and unbounded a priori. Such a bound attains an optimal loss-regret trade-off without the impractical doubling trick. Moreover, a matching lower bound shows that the leading order term, including the constant multiplier $\sqrt{2}$, is tight. To our knowledge, the proposed algorithm is the first to achieve such optimalities.
    Clustered Scheduling and Communication Pipelining For Efficient Resource Management Of Wireless Federated Learning. (arXiv:2206.07631v1 [cs.LG])
    This paper proposes using communication pipelining to enhance the wireless spectrum utilization efficiency and convergence speed of federated learning in mobile edge computing applications. Due to limited wireless sub-channels, a subset of the total clients is scheduled in each iteration of federated learning algorithms. On the other hand, the scheduled clients wait for the slowest client to finish its computation. We propose to first cluster the clients based on the time they need per iteration to compute the local gradients of the federated learning model. Then, we schedule a mixture of clients from all clusters to send their local updates in a pipelined manner. In this way, instead of just waiting for the slower clients to finish their computation, more clients can participate in each iteration. While the time duration of a single iteration does not change, the proposed method can significantly reduce the number of required iterations to achieve a target accuracy. We provide a generic formulation for optimal client clustering under different settings, and we analytically derive an efficient algorithm for obtaining the optimal solution. We also provide numerical results to demonstrate the gains of the proposed method for different datasets and deep learning architectures.
    BaIT: Barometer for Information Trustworthiness. (arXiv:2206.07535v1 [cs.LG])
    This paper presents a new approach to the FNC-1 fake news classification task which involves employing pre-trained encoder models from similar NLP tasks, namely sentence similarity and natural language inference, and two neural network architectures using this approach are proposed. Methods in data augmentation are explored as a means of tackling class imbalance in the dataset, employing common pre-existing methods and proposing a method for sample generation in the under-represented class using a novel sentence negation algorithm. Comparable overall performance with existing baselines is achieved, while significantly increasing accuracy on an under-represented but nonetheless important class for FNC-1.
    Diffusion Transport Alignment. (arXiv:2206.07305v1 [stat.ML])
    The integration of multimodal data presents a challenge in cases when the study of a given phenomena by different instruments or conditions generates distinct but related domains. Many existing data integration methods assume a known one-to-one correspondence between domains of the entire dataset, which may be unrealistic. Furthermore, existing manifold alignment methods are not suited for cases where the data contains domain-specific regions, i.e., there is not a counterpart for a certain portion of the data in the other domain. We propose Diffusion Transport Alignment (DTA), a semi-supervised manifold alignment method that exploits prior correspondence knowledge between only a few points to align the domains. By building a diffusion process, DTA finds a transportation plan between data measured from two heterogeneous domains with different feature spaces, which by assumption, share a similar geometrical structure coming from the same underlying data generating process. DTA can also compute a partial alignment in a data-driven fashion, resulting in accurate alignments when some data are measured in only one domain. We empirically demonstrate that DTA outperforms other methods in aligning multimodal data in this semisupervised setting. We also empirically show that the alignment obtained by DTA can improve the performance of machine learning tasks, such as domain adaptation, inter-domain feature mapping, and exploratory data analysis, while outperforming competing methods.
    Preliminary study on the impact of EEG density on TMS-EEG classification in Alzheimer's disease. (arXiv:2206.07492v1 [eess.SP])
    Transcranial magnetic stimulation co-registered with electroencephalographic (TMS-EEG) has previously proven a helpful tool in the study of Alzheimer's disease (AD). In this work, we investigate the use of TMS-evoked EEG responses to classify AD patients from healthy controls (HC). By using a dataset containing 17AD and 17HC, we extract various time domain features from individual TMS responses and average them over a low, medium and high density EEG electrode set. Within a leave-one-subject-out validation scenario, the best classification performance for AD vs. HC was obtained using a high-density electrode with a Random Forest classifier. The accuracy, sensitivity and specificity were of 92.7%, 96.58% and 88.2% respectively.
    Defending Observation Attacks in Deep Reinforcement Learning via Detection and Denoising. (arXiv:2206.07188v1 [cs.LG])
    Neural network policies trained using Deep Reinforcement Learning (DRL) are well-known to be susceptible to adversarial attacks. In this paper, we consider attacks manifesting as perturbations in the observation space managed by the external environment. These attacks have been shown to downgrade policy performance significantly. We focus our attention on well-trained deterministic and stochastic neural network policies in the context of continuous control benchmarks subject to four well-studied observation space adversarial attacks. To defend against these attacks, we propose a novel defense strategy using a detect-and-denoise schema. Unlike previous adversarial training approaches that sample data in adversarial scenarios, our solution does not require sampling data in an environment under attack, thereby greatly reducing risk during training. Detailed experimental results show that our technique is comparable with state-of-the-art adversarial training approaches.
    Self-Supervision on Images and Text Reduces Reliance on Visual Shortcut Features. (arXiv:2206.07155v1 [cs.LG])
    Deep learning models trained in a fully supervised manner have been shown to rely on so-called "shortcut" features. Shortcut features are inputs that are associated with the outcome of interest in the training data, but are either no longer associated or not present in testing or deployment settings. Here we provide experiments that show recent self-supervised models trained on images and text provide more robust image representations and reduce the model's reliance on visual shortcut features on a realistic medical imaging example. Additionally, we find that these self-supervised models "forget" shortcut features more quickly than fully supervised ones when fine-tuned on labeled data. Though not a complete solution, our experiments provide compelling evidence that self-supervised models trained on images and text provide some resilience to visual shortcut features.
    Lattice Convolutional Networks for Learning Ground States of Quantum Many-Body Systems. (arXiv:2206.07370v1 [quant-ph])
    Deep learning methods have been shown to be effective in representing ground-state wave functions of quantum many-body systems. Existing methods use convolutional neural networks (CNNs) for square lattices due to their image-like structures. For non-square lattices, existing method uses graph neural network (GNN) in which structure information is not precisely captured, thereby requiring additional hand-crafted sublattice encoding. In this work, we propose lattice convolutions in which a set of proposed operations are used to convert non-square lattices into grid-like augmented lattices on which regular convolution can be applied. Based on the proposed lattice convolutions, we design lattice convolutional networks (LCN) that use self-gating and attention mechanisms. Experimental results show that our method achieves performance on par or better than existing methods on spin 1/2 $J_1$-$J_2$ Heisenberg model over the square, honeycomb, triangular, and kagome lattices while without using hand-crafted encoding.
    Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction. (arXiv:2206.07085v1 [cs.LG])
    Normalization layers (e.g., Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets, but they clearly also help generalization, even in not-so-deep nets. Motivated by the long-held belief that flatter minima lead to better generalization, this paper gives mathematical analysis and supporting experiments suggesting that normalization (together with accompanying weight-decay) encourages GD to reduce the sharpness of loss surface. Here "sharpness" is carefully defined given that the loss is scale-invariant, a known consequence of normalization. Specifically, for a fairly broad class of neural nets with normalization, our theory explains how GD with a finite learning rate enters the so-called Edge of Stability (EoS) regime, and characterizes the trajectory of GD in this regime via a continuous sharpness-reduction flow.
    Lazy Queries Can Reduce Variance in Zeroth-order Optimization. (arXiv:2206.07126v1 [cs.LG])
    A major challenge of applying zeroth-order (ZO) methods is the high query complexity, especially when queries are costly. We propose a novel gradient estimation technique for ZO methods based on adaptive lazy queries that we term as LAZO. Different from the classic one-point or two-point gradient estimation methods, LAZO develops two alternative ways to check the usefulness of old queries from previous iterations, and then adaptively reuses them to construct the low-variance gradient estimates. We rigorously establish that through judiciously reusing the old queries, LAZO can reduce the variance of stochastic gradient estimates so that it not only saves queries per iteration but also achieves the regret bound for the symmetric two-point method. We evaluate the numerical performance of LAZO, and demonstrate the low-variance property and the performance gain of LAZO in both regret and query complexity relative to several existing ZO methods. The idea of LAZO is general, and can be applied to other variants of ZO methods.
    On Enforcing Better Conditioned Meta-Learning for Rapid Few-Shot Adaptation. (arXiv:2206.07260v1 [cs.LG])
    Inspired by the concept of preconditioning, we propose a novel method to increase adaptation speed for gradient-based meta-learning methods without incurring extra parameters. We demonstrate that recasting the optimization problem to a non-linear least-squares formulation provides a principled way to actively enforce a $\textit{well-conditioned}$ parameter space for meta-learning models based on the concepts of the condition number and local curvature. Our comprehensive evaluations show that the proposed method significantly outperforms its unconstrained counterpart especially during initial adaptation steps, while achieving comparable or better overall results on several few-shot classification tasks -- creating the possibility of dynamically choosing the number of adaptation steps at inference time.
    Improving Solar Flare Prediction by Time Series Outlier Detection. (arXiv:2206.07197v1 [cs.LG])
    Solar flares not only pose risks to outer space technologies and astronauts' well being, but also cause disruptions on earth to our hight-tech, interconnected infrastructure our lives highly depend on. While a number of machine-learning methods have been proposed to improve flare prediction, none of them, to the best of our knowledge, have investigated the impact of outliers on the reliability and those models' performance. In this study, we investigate the impact of outliers in a multivariate time series benchmark dataset, namely SWAN-SF, on flare prediction models, and test our hypothesis. That is, there exist outliers in SWAN-SF, removal of which enhances the performance of the prediction models on unseen datasets. We employ Isolation Forest to detect the outliers among the weaker flare instances. Several experiments are carried out using a large range of contamination rates which determine the percentage of present outliers. We asses the quality of each dataset in terms of its actual contamination using TimeSeriesSVC. In our best finding, we achieve a 279% increase in True Skill Statistic and 68% increase in Heidke Skill Score. The results show that overall a significant improvement can be achieved to flare prediction if outliers are detected and removed properly.
    GraphFM: Improving Large-Scale GNN Training via Feature Momentum. (arXiv:2206.07161v1 [cs.LG])
    Training of graph neural networks (GNNs) for large-scale node classification is challenging. A key difficulty lies in obtaining accurate hidden node representations while avoiding the neighborhood explosion problem. Here, we propose a new technique, named as feature momentum (FM), that uses a momentum step to incorporate historical embeddings when updating feature representations. We develop two specific algorithms, known as GraphFM-IB and GraphFM-OB, that consider in-batch and out-of-batch data, respectively. GraphFM-IB applies FM to in-batch sampled data, while GraphFM-OB applies FM to out-of-batch data that are 1-hop neighborhood of in-batch data. We provide a rigorous convergence analysis for GraphFM-IB and theoretical insight of GraphFM-OB for the estimation error of feature embeddings. Empirically, we observe that GraphFM-IB can effectively alleviate the neighborhood explosion problem of existing methods. In addition, GraphFM-OB achieves promising performance on multiple large-scale graph datasets.
    TeKo: Text-Rich Graph Neural Networks with External Knowledge. (arXiv:2206.07253v1 [cs.SI])
    Graph Neural Networks (GNNs) have gained great popularity in tackling various analytical tasks on graph-structured data (i.e., networks). Typical GNNs and their variants follow a message-passing manner that obtains network representations by the feature propagation process along network topology, which however ignore the rich textual semantics (e.g., local word-sequence) that exist in many real-world networks. Existing methods for text-rich networks integrate textual semantics by mainly utilizing internal information such as topics or phrases/words, which often suffer from an inability to comprehensively mine the text semantics, limiting the reciprocal guidance between network structure and text semantics. To address these problems, we propose a novel text-rich graph neural network with external knowledge (TeKo), in order to take full advantage of both structural and textual information within text-rich networks. Specifically, we first present a flexible heterogeneous semantic network that incorporates high-quality entities and interactions among documents and entities. We then introduce two types of external knowledge, that is, structured triplets and unstructured entity description, to gain a deeper insight into textual semantics. We further design a reciprocal convolutional mechanism for the constructed heterogeneous semantic network, enabling network structure and textual semantics to collaboratively enhance each other and learn high-level network representations. Extensive experimental results on four public text-rich networks as well as a large-scale e-commerce searching dataset illustrate the superior performance of TeKo over state-of-the-art baselines.
    Fast and Reliable Evaluation of Adversarial Robustness with Minimum-Margin Attack. (arXiv:2206.07314v1 [cs.LG])
    The AutoAttack (AA) has been the most reliable method to evaluate adversarial robustness when considerable computational resources are available. However, the high computational cost (e.g., 100 times more than that of the project gradient descent attack) makes AA infeasible for practitioners with limited computational resources, and also hinders applications of AA in the adversarial training (AT). In this paper, we propose a novel method, minimum-margin (MM) attack, to fast and reliably evaluate adversarial robustness. Compared with AA, our method achieves comparable performance but only costs 3% of the computational time in extensive experiments. The reliability of our method lies in that we evaluate the quality of adversarial examples using the margin between two targets that can precisely identify the most adversarial example. The computational efficiency of our method lies in an effective Sequential TArget Ranking Selection (STARS) method, ensuring that the cost of the MM attack is independent of the number of classes. The MM attack opens a new way for evaluating adversarial robustness and provides a feasible and reliable way to generate high-quality adversarial examples in AT.
    To Aggregate or Not? Learning with Separate Noisy Labels. (arXiv:2206.07181v1 [cs.LG])
    The rawly collected training data often comes with separate noisy labels collected from multiple imperfect annotators (e.g., via crowdsourcing). Typically one would first aggregate the separate noisy labels into one and apply standard training methods. The literature has also studied extensively on effective aggregation approaches. This paper revisits this choice and aims to provide an answer to the question of whether one should aggregate separate noisy labels into single ones or use them separately as given. We theoretically analyze the performance of both approaches under the empirical risk minimization framework for a number of popular loss functions, including the ones designed specifically for the problem of learning with noisy labels. Our theorems conclude that label separation is preferred over label aggregation when the noise rates are high, or the number of labelers/annotations is insufficient. Extensive empirical results validate our conclusion.
    Category-Agnostic 6D Pose Estimation with Conditional Neural Processes. (arXiv:2206.07162v1 [cs.CV])
    We present a novel meta-learning approach for 6D pose estimation on unknown objects. In contrast to "instance-level" pose estimation methods, our algorithm learns object representation in a category-agnostic way, which endows it with strong generalization capabilities within and across object categories. Specifically, we employ a conditional neural process-based meta-learning approach to train an encoder to capture texture and geometry of an object in a latent representation, based on very few RGB-D images and ground-truth keypoints. The latent representation is then used by a simultaneously meta-trained decoder to predict the 6D pose of the object in new images. To evaluate our algorithm, experiments are conducted on our new fully-annotated synthetic datasets generated from Multiple Categories in Multiple Scenes (MCMS). Experimental results demonstrate that our model performs well on unseen objects with various shapes and appearances.
    Codec at SemEval-2022 Task 5: Multi-Modal Multi-Transformer Misogynous Meme Classification Framework. (arXiv:2206.07190v1 [cs.CL])
    In this paper we describe our work towards building a generic framework for both multi-modal embedding and multi-label binary classification tasks, while participating in task 5 (Multimedia Automatic Misogyny Identification) of SemEval 2022 competition. Since pretraining deep models from scratch is a resource and data hungry task, our approach is based on three main strategies. We combine different state-of-the-art architectures to capture a wide spectrum of semantic signals from the multi-modal input. We employ a multi-task learning scheme to be able to use multiple datasets from the same knowledge domain to help increase the model's performance. We also use multiple objectives to regularize and fine tune different system components.
    Adaptive Threshold Sampling. (arXiv:1708.04970v2 [stat.ML] UPDATED)
    Sampling is a fundamental problem in computer science and statistics. However, for a given task and stream, it is often not possible to choose good sampling probabilities in advance. We derive a general framework for adaptively changing the sampling probabilities via a collection of thresholds.In general, adaptive sampling procedures introduce dependence amongst the sampled points, making it difficult to compute expectations and ensure estimators are unbiased or consistent. Our framework address this issue and further shows when adaptive thresholds can be treated as if they were fixed thresholds which samples items independently. This makes our adaptive sampling schemes simple to apply as there is no need to create custom estimators for the sampling method. Using our framework, we derive new samplers that can address a broad range of new and existing problems including sampling with memory rather than sample size budgets, stratified samples, multiple objectives, distinct counting, and sliding windows. In particular, we design a sampling procedure for the top-K problem where, unlike in the heavy-hitter problem, the sketch size and sampling probabilities are adaptively chosen.
    Near-Exact Recovery for Tomographic Inverse Problems via Deep Learning. (arXiv:2206.07050v1 [eess.IV])
    This work is concerned with the following fundamental question in scientific machine learning: Can deep-learning-based methods solve noise-free inverse problems to near-perfect accuracy? Positive evidence is provided for the first time, focusing on a prototypical computed tomography (CT) setup. We demonstrate that an iterative end-to-end network scheme enables reconstructions close to numerical precision, comparable to classical compressed sensing strategies. Our results build on our winning submission to the recent AAPM DL-Sparse-View CT Challenge. Its goal was to identify the state-of-the-art in solving the sparse-view CT inverse problem with data-driven techniques. A specific difficulty of the challenge setup was that the precise forward model remained unknown to the participants. Therefore, a key feature of our approach was to initially estimate the unknown fanbeam geometry in a data-driven calibration step. Apart from an in-depth analysis of our methodology, we also demonstrate its state-of-the-art performance on the open-access real-world dataset LoDoPaB CT.
    Open-Ended Knowledge Tracing. (arXiv:2203.03716v2 [cs.CY] UPDATED)
    Knowledge tracing refers to the problem of estimating each student's knowledge component/skill mastery level from their past responses to questions in educational applications. One direct benefit knowledge tracing methods provide is the ability to predict each student's performance on the future questions. However, one key limitation of most existing knowledge tracing methods is that they treat student responses to questions as binary-valued, i.e., whether the responses are correct or incorrect. Response correctness analysis/prediction is easy to navigate but loses important information, especially for open-ended questions: the exact student responses can potentially provide much more information about their knowledge states than only response correctness. In this paper, we present our first exploration into open-ended knowledge tracing, i.e., the analysis and prediction of students' open-ended responses to questions in the knowledge tracing setup. We first lay out a generic framework for open-ended knowledge tracing before detailing its application to the domain of computer science education with programming questions. We define a series of evaluation metrics in this domain and conduct a series of quantitative and qualitative experiments to test the boundaries of open-ended knowledge tracing methods on a real-world student code dataset.
    On Numerical Integration in Neural Ordinary Differential Equations. (arXiv:2206.07335v1 [cs.LG])
    The combination of ordinary differential equations and neural networks, i.e., neural ordinary differential equations (Neural ODE), has been widely studied from various angles. However, deciphering the numerical integration in Neural ODE is still an open challenge, as many researches demonstrated that numerical integration significantly affects the performance of the model. In this paper, we propose the inverse modified differential equations (IMDE) to clarify the influence of numerical integration on training Neural ODE models. IMDE is determined by the learning task and the employed ODE solver. It is shown that training a Neural ODE model actually returns a close approximation of the IMDE, rather than the true ODE. With the help of IMDE, we deduce that (i) the discrepancy between the learned model and the true ODE is bounded by the sum of discretization error and learning loss; (ii) Neural ODE using non-symplectic numerical integration fail to learn conservation laws theoretically. Several experiments are performed to numerically verify our theoretical analysis.
    Training Discrete Deep Generative Models via Gapped Straight-Through Estimator. (arXiv:2206.07235v1 [cs.LG])
    While deep generative models have succeeded in image processing, natural language processing, and reinforcement learning, training that involves discrete random variables remains challenging due to the high variance of its gradient estimation process. Monte Carlo is a common solution used in most variance reduction approaches. However, this involves time-consuming resampling and multiple function evaluations. We propose a Gapped Straight-Through (GST) estimator to reduce the variance without incurring resampling overhead. This estimator is inspired by the essential properties of Straight-Through Gumbel-Softmax. We determine these properties and show via an ablation study that they are essential. Experiments demonstrate that the proposed GST estimator enjoys better performance compared to strong baselines on two discrete deep generative modeling tasks, MNIST-VAE and ListOps.
  • Open

    Wide Bayesian neural networks have a simple weight posterior: theory and accelerated sampling. (arXiv:2206.07673v1 [stat.ML])
    We introduce repriorisation, a data-dependent reparameterisation which transforms a Bayesian neural network (BNN) posterior to a distribution whose KL divergence to the BNN prior vanishes as layer widths grow. The repriorisation map acts directly on parameters, and its analytic simplicity complements the known neural network Gaussian process (NNGP) behaviour of wide BNNs in function space. Exploiting the repriorisation, we develop a Markov chain Monte Carlo (MCMC) posterior sampling algorithm which mixes faster the wider the BNN. This contrasts with the typically poor performance of MCMC in high dimensions. We observe up to 50x higher effective sample size relative to no reparametrisation for both fully-connected and residual networks. Improvements are achieved at all widths, with the margin between reparametrised and standard BNNs growing with layer width.
    Calibrating Agent-based Models to Microdata with Graph Neural Networks. (arXiv:2206.07570v1 [cs.MA])
    Calibrating agent-based models (ABMs) to data is among the most fundamental requirements to ensure the model fulfils its desired purpose. In recent years, simulation-based inference methods have emerged as powerful tools for performing this task when the model likelihood function is intractable, as is often the case for ABMs. In some real-world use cases of ABMs, both the observed data and the ABM output consist of the agents' states and their interactions over time. In such cases, there is a tension between the desire to make full use of the rich information content of such granular data on the one hand, and the need to reduce the dimensionality of the data to prevent difficulties associated with high-dimensional learning tasks on the other. A possible resolution is to construct lower-dimensional time-series through the use of summary statistics describing the macrostate of the system at each time point. However, a poor choice of summary statistics can result in an unacceptable loss of information from the original dataset, dramatically reducing the quality of the resulting calibration. In this work, we instead propose to learn parameter posteriors associated with granular microdata directly using temporal graph neural networks. We will demonstrate that such an approach offers highly compelling inductive biases for Bayesian inference using the raw ABM microstates as output.
    Non-Vacuous Generalisation Bounds for Shallow Neural Networks. (arXiv:2202.01627v3 [cs.LG] UPDATED)
    We focus on a specific class of shallow neural networks with a single hidden layer, namely those with $L_2$-normalised data and either a sigmoid-shaped Gaussian error function ("erf") activation or a Gaussian Error Linear Unit (GELU) activation. For these networks, we derive new generalisation bounds through the PAC-Bayesian theory; unlike most existing such bounds they apply to neural networks with deterministic rather than randomised parameters. Our bounds are empirically non-vacuous when the network is trained with vanilla stochastic gradient descent on MNIST and Fashion-MNIST.
    RieszNet and ForestRiesz: Automatic Debiased Machine Learning with Neural Nets and Random Forests. (arXiv:2110.03031v3 [cs.LG] UPDATED)
    Many causal and policy effects of interest are defined by linear functionals of high-dimensional or non-parametric regression functions. $\sqrt{n}$-consistent and asymptotically normal estimation of the object of interest requires debiasing to reduce the effects of regularization and/or model selection on the object of interest. Debiasing is typically achieved by adding a correction term to the plug-in estimator of the functional, which leads to properties such as semi-parametric efficiency, double robustness, and Neyman orthogonality. We implement an automatic debiasing procedure based on automatically learning the Riesz representation of the linear functional using Neural Nets and Random Forests. Our method only relies on black-box evaluation oracle access to the linear functional and does not require knowledge of its analytic form. We propose a multitasking Neural Net debiasing method with stochastic gradient descent minimization of a combined Riesz representer and regression loss, while sharing representation layers for the two functions. We also propose a Random Forest method which learns a locally linear representation of the Riesz function. Even though our method applies to arbitrary functionals, we experimentally find that it performs well compared to the state of art neural net based algorithm of Shi et al. (2019) for the case of the average treatment effect functional. We also evaluate our method on the problem of estimating average marginal effects with continuous treatments, using semi-synthetic data of gasoline price changes on gasoline demand.
    Born-Infeld (BI) for AI: Energy-Conserving Descent (ECD) for Optimization. (arXiv:2201.11137v2 [cs.LG] UPDATED)
    We introduce a novel framework for optimization based on energy-conserving Hamiltonian dynamics in a strongly mixing (chaotic) regime and establish its key properties analytically and numerically. The prototype is a discretization of Born-Infeld dynamics, with a squared relativistic speed limit depending on the objective function. This class of frictionless, energy-conserving optimizers proceeds unobstructed until slowing naturally near the minimal loss, which dominates the phase space volume of the system. Building from studies of chaotic systems such as dynamical billiards, we formulate a specific algorithm with good performance on machine learning and PDE-solving tasks, including generalization. It cannot stop at a high local minimum, an advantage in non-convex loss functions, and proceeds faster than GD+momentum in shallow valleys.
    QONNX: Representing Arbitrary-Precision Quantized Neural Networks. (arXiv:2206.07527v1 [cs.LG])
    We present extensions to the Open Neural Network Exchange (ONNX) intermediate representation format to represent arbitrary-precision quantized neural networks. We first introduce support for low precision quantization in existing ONNX-based quantization formats by leveraging integer clipping, resulting in two new backward-compatible variants: the quantized operator format with clipping and quantize-clip-dequantize (QCDQ) format. We then introduce a novel higher-level ONNX format called quantized ONNX (QONNX) that introduces three new operators -- Quant, BipolarQuant, and Trunc -- in order to represent uniform quantization. By keeping the QONNX IR high-level and flexible, we enable targeting a wider variety of platforms. We also present utilities for working with QONNX, as well as examples of its usage in the FINN and hls4ml toolchains. Finally, we introduce the QONNX model zoo to share low-precision quantized neural networks.
    Solving Stochastic Optimization with Expectation Constraints Efficiently by a Stochastic Augmented Lagrangian-Type Algorithm. (arXiv:2106.11577v3 [math.OC] UPDATED)
    This paper considers the problem of minimizing a convex expectation function with a set of inequality convex expectation constraints. We present a computable stochastic approximation type algorithm, namely the stochastic linearized proximal method of multipliers, to solve this convex stochastic optimization problem. This algorithm can be roughly viewed as a hybrid of stochastic approximation and the traditional proximal method of multipliers. Under mild conditions, we show that this algorithm exhibits $O(K^{-1/2})$ expected convergence rates for both objective reduction and constraint violation if parameters in the algorithm are properly chosen, where $K$ denotes the number of iterations. Moreover, we show that, with high probability, the algorithm has $O(\log(K)K^{-1/2})$ constraint violation bound and $O(\log^{3/2}(K)K^{-1/2})$ objective bound. Some preliminary numerical results demonstrate the performance of the proposed algorithm.
    MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields. (arXiv:2206.07697v1 [stat.ML])
    Creating fast and accurate force fields is a long-standing challenge in computational chemistry and materials science. Recently, several equivariant message passing neural networks (MPNNs) have been shown to outperform models built using other approaches in terms of accuracy. However, most MPNNs suffer from high computational cost and poor scalability. We propose that these limitations arise because MPNNs only pass two-body messages leading to a direct relationship between the number of layers and the expressivity of the network. In this work, we introduce MACE, a new equivariant MPNN model that uses higher body order messages. In particular, we show that using four-body messages reduces the required number of message passing iterations to just \emph{two}, resulting in a fast and highly parallelizable model, reaching or exceeding state-of-the-art accuracy on the rMD17, 3BPA, and AcAc benchmark tasks. We also demonstrate that using higher order messages leads to an improved steepness of the learning curves.
    Double Robustness for Complier Parameters and a Semiparametric Test for Complier Characteristics. (arXiv:1909.05244v6 [stat.ML] UPDATED)
    We study low dimensional complier parameters that are identified using a binary instrumental variable $Z$, which is valid conditional on a possibly high dimensional vector of covariates $X$. We characterize the doubly robust moment function for the entire class of complier parameters defined by Abadie (2003) by combining two classic formulations: the Wald formula and the $\kappa$ weight. In particular, we reinterpret the $\kappa$ weight as the Riesz representer to the Wald formula, which appears to be a new insight. The main result includes new cases such as average complier characteristics. We use the main result to propose a hypothesis test, free of functional form restrictions, to evaluate (i) whether two different instruments induce compliers with the same observable characteristics on average, and (ii) whether compliers have observable characteristics that are the same as the full population on average. By developing this hypothesis test, we equip empirical researchers with a new robustness check.
    Robust and Sparse Estimation of Linear Regression Coefficients with Heavy-tailed Noises and Covariates. (arXiv:2206.07594v1 [stat.ML])
    Robust and sparse estimation of linear regression coefficients is investigated. The situation addressed by the present paper is that covariates and noises are sampled from heavy-tailed distributions, and the covariates and noises are contaminated by malicious outliers. Our estimator can be computed efficiently. Further, our estimation error bound is sharp.
    Bayesian Learning of Parameterised Quantum Circuits. (arXiv:2206.07559v1 [quant-ph])
    Currently available quantum computers suffer from constraints including hardware noise and a limited number of qubits. As such, variational quantum algorithms that utilise a classical optimiser in order to train a parameterised quantum circuit have drawn significant attention for near-term practical applications of quantum technology. In this work, we take a probabilistic point of view and reformulate the classical optimisation as an approximation of a Bayesian posterior. The posterior is induced by combining the cost function to be minimised with a prior distribution over the parameters of the quantum circuit. We describe a dimension reduction strategy based on a maximum a posteriori point estimate with a Laplace prior. Experiments on the Quantinuum H1-2 computer show that the resulting circuits are faster to execute and less noisy than the circuits trained without the dimension reduction strategy. We subsequently describe a posterior sampling strategy based on stochastic gradient Langevin dynamics. Numerical simulations on three different problems show that the strategy is capable of generating samples from the full posterior and avoiding local optima.
    Rethinking Initialization of the Sinkhorn Algorithm. (arXiv:2206.07630v1 [stat.ML])
    Computing an optimal transport (OT) coupling between distributions plays an increasingly important role in machine learning. While OT problems can be solved as linear programs, adding an entropic smoothing term is known to result in solvers that are faster and more robust to outliers, differentiable and easier to parallelize. The Sinkhorn fixed point algorithm is the cornerstone of these approaches, and, as a result, multiple attempts have been made to shorten its runtime using, for instance, annealing, momentum or acceleration. The premise of this paper is that \textit{initialization} of the Sinkhorn algorithm has received comparatively little attention, possibly due to two preconceptions: as the regularized OT problem is convex, it may not be worth crafting a tailored initialization as \textit{any} is guaranteed to work; secondly, because the Sinkhorn algorithm is often differentiated in end-to-end pipelines, data-dependent initializations could potentially bias gradient estimates obtained by unrolling iterations. We challenge this conventional wisdom and show that carefully chosen initializations can result in dramatic speed-ups, and will not bias gradients which are computed with implicit differentiation. We detail how initializations can be recovered from closed-form or approximate OT solutions, using known results in the 1D or Gaussian settings. We show empirically that these initializations can be used off-the-shelf, with little to no tuning, and result in consistent speed-ups for a variety of OT problems.
    Adversarial robust weighted Huber regression. (arXiv:2102.11120v3 [math.ST] UPDATED)
    We consider a robust estimation of linear regression coefficients. In this note, we focus on the case where the covariates are sampled from an $L$-subGaussian distribution with unknown covariance, the noises are sampled from a distribution with a bounded absolute moment and both covariates and noises may be contaminated by an adversary. We derive an estimation error bound, which depends on the stable rank and the condition number of the covariance matrix of covariates with a polynomial computational complexity of estimation.
    Diffusion Models for Video Prediction and Infilling. (arXiv:2206.07696v1 [cs.CV])
    To predict and anticipate future outcomes or reason about missing information in a sequence is a key ability for agents to be able to make intelligent decisions. This requires strong temporally coherent generative capabilities. Diffusion models have shown huge success in several generative tasks lately, but have not been extensively explored in the video domain. We present Random-Mask Video Diffusion (RaMViD), which extends image diffusion models to videos using 3D convolutions, and introduces a new conditioning technique during training. By varying the mask we condition on, the model is able to perform video prediction, infilling and upsampling. Since we do not use concatenation to condition on a mask, as done in most conditionally trained diffusion models, we are able to decrease the memory footprint. We evaluated the model on two benchmark datasets for video prediction and one for video generation on which we achieved competitive results. On Kinetics-600 we achieved state-of-the-art for video prediction.
    On the fast convergence of minibatch heavy ball momentum. (arXiv:2206.07553v1 [cs.LG])
    Simple stochastic momentum methods are widely used in machine learning optimization, but their good practical performance is at odds with an absence of theoretical guarantees of acceleration in the literature. In this work, we aim to close the gap between theory and practice by showing that stochastic heavy ball momentum, which can be interpreted as a randomized Kaczmarz algorithm with momentum, retains the fast linear rate of (deterministic) heavy ball momentum on quadratic optimization problems, at least when minibatching with a sufficiently large batch size is used. The analysis relies on carefully decomposing the momentum transition matrix, and using new spectral norm concentration bounds for products of independent random matrices. We provide numerical experiments to demonstrate that our bounds are reasonably sharp.
    Blind Estimation of a Doubly Selective OFDM Channel: A Deep Learning Algorithm and Theory. (arXiv:2206.07483v1 [eess.SP])
    We provide a new generation solution to the fundamental old problem of a doubly selective fading channel estimation for orthogonal frequency division multiplexing (OFDM) systems. For systems based on OFDM, we propose a deep learning (DL)-based blind doubly selective channel estimator. This estimator does require no pilot symbols, unlike the corresponding state-of-the-art estimators, even during the estimation of a deep fading doubly selective channel. We also provide the first of its kind theory on the testing mean squared error (MSE) performance of our investigated blind OFDM channel estimator based on over-parameterized ReLU FNNs.
    Neural Network Kalman filtering for 3D object tracking from linear array ultrasound data. (arXiv:2111.09631v3 [stat.AP] UPDATED)
    Many interventional surgical procedures rely on medical imaging to visualise and track instruments. Such imaging methods not only need to be real-time capable, but also provide accurate and robust positional information. In ultrasound applications, typically only two-dimensional data from a linear array are available, and as such obtaining accurate positional estimation in three dimensions is non-trivial. In this work, we first train a neural network, using realistic synthetic training data, to estimate the out-of-plane offset of an object with the associated axial aberration in the reconstructed ultrasound image. The obtained estimate is then combined with a Kalman filtering approach that utilises positioning estimates obtained in previous time-frames to improve localisation robustness and reduce the impact of measurement noise. The accuracy of the proposed method is evaluated using simulations, and its practical applicability is demonstrated on experimental data obtained using a novel optical ultrasound imaging setup. Accurate and robust positional information is provided in real-time. Axial and lateral coordinates for out-of-plane objects are estimated with a mean error of 0.1mm for simulated data and a mean error of 0.2mm for experimental data. Three-dimensional localisation is most accurate for elevational distances larger than 1mm, with a maximum distance of 6mm considered for a 25mm aperture.
    Heterogeneous Distributed Lag Models to Estimate Personalized Effects of Maternal Exposures to Air Pollution. (arXiv:2109.13763v2 [stat.ME] UPDATED)
    Children's health studies support an association between maternal environmental exposures and children's birth outcomes. A common goal is to identify critical windows of susceptibility--periods during gestation with increased association between maternal exposures and a future outcome. The timing of the critical windows and magnitude of the associations are likely heterogeneous across different levels of individual, family, and neighborhood characteristics. Using an administrative Colorado birth cohort we estimate the individualized relationship between weekly exposures to fine particulate matter (PM2.5) during gestation and birth weight. To achieve this goal, we propose a statistical learning method combining distributed lag models and Bayesian additive regression trees to estimate critical windows at the individual level and identify characteristics that induce heterogeneity from a high-dimensional set of potential modifying factors. We find evidence of heterogeneity in the PM2.5-birth weight relationship, with some mother-child dyads showing a 3 times larger decrease in birth weight for an IQR increase in exposure (5.9 to 8.5 $\mu g/m^3$ PM2.5) compared to the population average. Specifically, we find increased susceptibility for non-Hispanic mothers who are either younger, have higher body mass index or lower educational attainment. Our case study is the first precision health study of critical windows.
    Nystr\"om Kernel Mean Embeddings. (arXiv:2201.13055v2 [stat.ML] UPDATED)
    Kernel mean embeddings are a powerful tool to represent probability distributions over arbitrary spaces as single points in a Hilbert space. Yet, the cost of computing and storing such embeddings prohibits their direct use in large-scale settings. We propose an efficient approximation procedure based on the Nystr\"om method, which exploits a small random subset of the dataset. Our main result is an upper bound on the approximation error of this procedure. It yields sufficient conditions on the subsample size to obtain the standard $n^{-1/2}$ rate while reducing computational costs. We discuss applications of this result for the approximation of the maximum mean discrepancy and quadrature rules, and illustrate our theoretical findings with numerical experiments.
    Probabilistic Spatial Transformer Networks. (arXiv:2004.03637v2 [cs.LG] UPDATED)
    Spatial Transformer Networks (STNs) estimate image transformations that can improve downstream tasks by `zooming in' on relevant regions in an image. However, STNs are hard to train and sensitive to mis-predictions of transformations. To circumvent these limitations, we propose a probabilistic extension that estimates a stochastic transformation rather than a deterministic one. Marginalizing transformations allows us to consider each image at multiple poses, which makes the localization task easier and the training more robust. As an additional benefit, the stochastic transformations act as a localized, learned data augmentation that improves the downstream tasks. We show across standard imaging benchmarks and on a challenging real-world dataset that these two properties lead to improved classification performance, robustness and model calibration. We further demonstrate that the approach generalizes to non-visual domains by improving model performance on time-series data.
    Clustering acoustic emission data streams with sequentially appearing clusters using mixture models. (arXiv:2108.11211v3 [stat.ML] UPDATED)
    The interpretation of unlabeled acoustic emission (AE) data classically relies on general-purpose clustering methods. While several external criteria have been used in the past to select the hyperparameters of those algorithms, few studies have paid attention to the development of dedicated objective functions in clustering methods able to cope with the specificities of AE data. We investigate how to explicitly represent clusters onsets in mixture models in general, and in Gaussian Mixture Models (GMM) in particular. By modifying the internal criterion of such models, we propose the first clustering method able to provide, through parameters estimated by an expectation-maximization procedure, information about when clusters occur (onsets), how they grow (kinetics) and their level of activation through time. This new objective function accommodates continuous timestamps of AE signals and, thus, their order of occurrence. The method, called GMMSEQ, is experimentally validated to characterize the loosening phenomenon in bolted structure under vibrations. A comparison with three standard clustering methods on raw streaming data from five experimental campaigns shows that GMMSEQ not only provides useful qualitative information about the timeline of clusters, but also shows better performance in terms of cluster characterization. In view of developing an open acoustic emission initiative and according to the FAIR principles, the datasets and the codes are made available to reproduce the research of this paper.
    BRIDGE: Byzantine-resilient Decentralized Gradient Descent. (arXiv:1908.08098v3 [stat.ML] UPDATED)
    Machine learning has begun to play a central role in many applications. A multitude of these applications typically also involve datasets that are distributed across multiple computing devices/machines due to either design constraints (e.g., multiagent systems) or computational/privacy reasons (e.g., learning on smartphone data). Such applications often require the learning tasks to be carried out in a decentralized fashion, in which there is no central server that is directly connected to all nodes. In real-world decentralized settings, nodes are prone to undetected failures due to malfunctioning equipment, cyberattacks, etc., which are likely to crash non-robust learning algorithms. The focus of this paper is on robustification of decentralized learning in the presence of nodes that have undergone Byzantine failures. The Byzantine failure model allows faulty nodes to arbitrarily deviate from their intended behaviors, thereby ensuring designs of the most robust of algorithms. But the study of Byzantine resilience within decentralized learning, in contrast to distributed learning, is still in its infancy. In particular, existing Byzantine-resilient decentralized learning methods either do not scale well to large-scale machine learning models, or they lack statistical convergence guarantees that help characterize their generalization errors. In this paper, a scalable, Byzantine-resilient decentralized machine learning framework termed Byzantine-resilient decentralized gradient descent (BRIDGE) is introduced. Algorithmic and statistical convergence guarantees for one variant of BRIDGE are also provided in the paper for both strongly convex problems and a class of nonconvex problems. In addition, large-scale decentralized learning experiments are used to establish that the BRIDGE framework is scalable and it delivers competitive results for Byzantine-resilient convex and nonconvex learning.
    Model-based RL with Optimistic Posterior Sampling: Structural Conditions and Sample Complexity. (arXiv:2206.07659v1 [cs.LG])
    We propose a general framework to design posterior sampling methods for model-based RL. We show that the proposed algorithms can be analyzed by reducing regret to Hellinger distance based conditional probability estimation. We further show that optimistic posterior sampling can control this Hellinger distance, when we measure model error via data likelihood. This technique allows us to design and analyze unified posterior sampling algorithms with state-of-the-art sample complexity guarantees for many model-based RL settings. We illustrate our general result in many special cases, demonstrating the versatility of our framework.
    Nonstationary Temporal Matrix Factorization for Multivariate Time Series Forecasting. (arXiv:2203.10651v2 [cs.LG] UPDATED)
    Modern time series datasets are often high-dimensional, incomplete/sparse, and nonstationary. These properties hinder the development of scalable and efficient solutions for time series forecasting and analysis. To address these challenges, we propose a Nonstationary Temporal Matrix Factorization (NoTMF) model, in which matrix factorization is used to reconstruct the whole time series matrix and vector autoregressive (VAR) process is imposed on a properly differenced copy of the temporal factor matrix. This approach not only preserves the low-rank property of the data but also offers consistent temporal dynamics. The learning process of NoTMF involves the optimization of two factor matrices and a collection of VAR coefficient matrices. To efficiently solve the optimization problem, we derive an alternating minimization framework, in which subproblems are solved using conjugate gradient and least squares methods. In particular, the use of conjugate gradient method offers an efficient routine and allows us to apply NoTMF on large-scale problems. Through extensive experiments on Uber movement speed dataset, we demonstrate the superior accuracy and effectiveness of NoTMF over other baseline models. Our results also confirm the importance of addressing the nonstationarity of real-world time series data such as spatiotemporal traffic flow/speed.
    Offline Reinforcement Learning Under Value and Density-Ratio Realizability: The Power of Gaps. (arXiv:2203.13935v3 [cs.LG] UPDATED)
    We consider a challenging theoretical problem in offline reinforcement learning (RL): obtaining sample-efficiency guarantees with a dataset lacking sufficient coverage, under only realizability-type assumptions for the function approximators. While the existing theory has addressed learning under realizability and under non-exploratory data separately, no work has been able to address both simultaneously (except for a concurrent work which we compare in detail). Under an additional gap assumption, we provide guarantees to a simple pessimistic algorithm based on a version space formed by marginalized importance sampling (MIS), and the guarantee only requires the data to cover the optimal policy and the function classes to realize the optimal value and density-ratio functions. While similar gap assumptions have been used in other areas of RL theory, our work is the first to identify the utility and the novel mechanism of gap assumptions in offline RL with weak function approximation.
    Online Variational Filtering and Parameter Learning. (arXiv:2110.13549v2 [stat.ML] UPDATED)
    We present a variational method for online state estimation and parameter learning in state-space models (SSMs), a ubiquitous class of latent variable models for sequential data. As per standard batch variational techniques, we use stochastic gradients to simultaneously optimize a lower bound on the log evidence with respect to both model parameters and a variational approximation of the states' posterior distribution. However, unlike existing approaches, our method is able to operate in an entirely online manner, such that historic observations do not require revisitation after being incorporated and the cost of updates at each time step remains constant, despite the growing dimensionality of the joint posterior distribution of the states. This is achieved by utilizing backward decompositions of this joint posterior distribution and of its variational approximation, combined with Bellman-type recursions for the evidence lower bound and its gradients. We demonstrate the performance of this methodology across several examples, including high-dimensional SSMs and sequential Variational Auto-Encoders.  ( 2 min )
    Finite-Sample Guarantees for High-Dimensional DML. (arXiv:2206.07386v1 [econ.EM])
    Debiased machine learning (DML) offers an attractive way to estimate treatment effects in observational settings, where identification of causal parameters requires a conditional independence or unconfoundedness assumption, since it allows to control flexibly for a potentially very large number of covariates. This paper gives novel finite-sample guarantees for joint inference on high-dimensional DML, bounding how far the finite-sample distribution of the estimator is from its asymptotic Gaussian approximation. These guarantees are useful to applied researchers, as they are informative about how far off the coverage of joint confidence bands can be from the nominal level. There are many settings where high-dimensional causal parameters may be of interest, such as the ATE of many treatment profiles, or the ATE of a treatment on many outcomes. We also cover infinite-dimensional parameters, such as impacts on the entire marginal distribution of potential outcomes. The finite-sample guarantees in this paper complement the existing results on consistency and asymptotic normality of DML estimators, which are either asymptotic or treat only the one-dimensional case.  ( 2 min )
    A Random Matrix Perspective on Random Tensors. (arXiv:2108.00774v2 [stat.ML] UPDATED)
    Tensor models play an increasingly prominent role in many fields, notably in machine learning. In several applications, such as community detection, topic modeling and Gaussian mixture learning, one must estimate a low-rank signal from a noisy tensor. Hence, understanding the fundamental limits of estimators of that signal inevitably calls for the study of random tensors. Substantial progress has been recently achieved on this subject in the large-dimensional limit. Yet, some of the most significant among these results--in particular, a precise characterization of the abrupt phase transition (with respect to signal-to-noise ratio) that governs the performance of the maximum likelihood (ML) estimator of a symmetric rank-one model with Gaussian noise--were derived based of mean-field spin glass theory, which is not easily accessible to non-experts. In this work, we develop a sharply distinct and more elementary approach, relying on standard but powerful tools brought by years of advances in random matrix theory. The key idea is to study the spectra of random matrices arising from contractions of a given random tensor. We show how this gives access to spectral properties of the random tensor itself. For the aforementioned rank-one model, our technique yields a hitherto unknown fixed-point equation whose solution precisely matches the asymptotic performance of the ML estimator above the phase transition threshold in the third-order case. A numerical verification provides evidence that the same holds for orders 4 and 5, leading us to conjecture that, for any order, our fixed-point equation is equivalent to the known characterization of the ML estimation performance that had been obtained by relying on spin glasses. Moreover, our approach sheds light on certain properties of the ML problem landscape in large dimensions and can be extended to other models, such as asymmetric and non-Gaussian.  ( 3 min )
    The Dual PC Algorithm for Structure Learning. (arXiv:2112.09036v3 [stat.ML] UPDATED)
    Learning the graphical structure of Bayesian networks is key to describing data generating mechanisms in many complex applications but poses considerable computational challenges. Observational data can only identify the equivalence class of the directed acyclic graph underlying a Bayesian network model, and a variety of methods exist to tackle the problem. Under certain assumptions, the popular PC algorithm can consistently recover the correct equivalence class by reverse-engineering the conditional independence (CI) relationships holding in the variable distribution. Here, we propose the dual PC algorithm, a novel scheme to carry out the CI tests within the PC algorithm by leveraging the inverse relationship between covariance and precision matrices. By exploiting block matrix inversions we can simultaneously perform tests on partial correlations of complementary (or dual) conditioning sets. The multiple CI tests of the dual PC algorithm proceed by first considering marginal and full-order CI relationships and progressively moving to central-order ones. Simulation studies show that the dual PC algorithm outperforms the classic PC algorithm both in terms of run time and in recovering the underlying network structure, even in the presence of deviations from Gaussianity.  ( 2 min )
    Bayesian Federated Learning via Predictive Distribution Distillation. (arXiv:2206.07562v1 [cs.LG])
    For most existing federated learning algorithms, each round consists of minimizing a loss function at each client to learn an optimal model at the client, followed by aggregating these client models at the server. Point estimation of the model parameters at the clients does not take into account the uncertainty in the models estimated at each client. In many situations, however, especially in limited data settings, it is beneficial to take into account the uncertainty in the client models for more accurate and robust predictions. Uncertainty also provides useful information for other important tasks, such as active learning and out-of-distribution (OOD) detection. We present a framework for Bayesian federated learning where each client infers the posterior predictive distribution using its training data and present various ways to aggregate these client-specific predictive distributions at the server. Since communicating and aggregating predictive distributions can be challenging and expensive, our approach is based on distilling each client's predictive distribution into a single deep neural network. This enables us to leverage advances in standard federated learning to Bayesian federated learning as well. Unlike some recent works that have tried to estimate model uncertainty of each client, our work also does not make any restrictive assumptions, such as the form of the client's posterior distribution. We evaluate our approach on classification in federated setting, as well as active learning and OOD detection in federated settings, on which our approach outperforms various existing federated learning baselines.  ( 2 min )
    Sparse Subspace Clustering in Diverse Multiplex Network Model. (arXiv:2206.07602v1 [stat.ML])
    The paper considers the DIverse MultiPLEx (DIMPLE) network model, introduced in Pensky and Wang (2021), where all layers of the network have the same collection of nodes and are equipped with the Stochastic Block Models. In addition, all layers can be partitioned into groups with the same community structures, although the layers in the same group may have different matrices of block connection probabilities. The DIMPLE model generalizes a multitude of papers that study multilayer networks with the same community structures in all layers, as well as the Mixture Multilayer Stochastic Block Model (MMLSBM), where the layers in the same group have identical matrices of block connection probabilities. While Pensky and Wang (2021) applied spectral clustering to the proxy of the adjacency tensor, the present paper uses Sparse Subspace Clustering (SSC) for identifying groups of layers with identical community structures. Under mild conditions, the latter leads to the strongly consistent between-layer clustering. In addition, SSC allows to handle much larger networks than methodology of Pensky and Wang (2021), and is perfectly suitable for application of parallel computing.  ( 2 min )
    Adaptation to the Range in $K$-Armed Bandits. (arXiv:2006.03378v3 [math.ST] UPDATED)
    We consider stochastic bandit problems with $K$ arms, each associated with a bounded distribution supported on the range $[m,M]$. We do not assume that the range $[m,M]$ is known and show that there is a cost for learning this range. Indeed, a new trade-off between distribution-dependent and distribution-free regret bounds arises, which prevents from simultaneously achieving the typical $\ln T$ and $\sqrt{T}$ bounds. For instance, a $\sqrt{T}$}distribution-free regret bound may only be achieved if the distribution-dependent regret bounds are at least of order $\sqrt{T}$. We exhibit a strategy achieving the rates for regret indicated by the new trade-off.  ( 2 min )
    GNNRank: Learning Global Rankings from Pairwise Comparisons via Directed Graph Neural Networks. (arXiv:2202.00211v2 [cs.LG] UPDATED)
    Recovering global rankings from pairwise comparisons has wide applications from time synchronization to sports team ranking. Pairwise comparisons corresponding to matches in a competition can be construed as edges in a directed graph (digraph), whose nodes represent e.g. competitors with an unknown rank. In this paper, we introduce neural networks into the ranking recovery problem by proposing the so-called GNNRank, a trainable GNN-based framework with digraph embedding. Moreover, new objectives are devised to encode ranking upsets/violations. The framework involves a ranking score estimation approach, and adds an inductive bias by unfolding the Fiedler vector computation of the graph constructed from a learnable similarity matrix. Experimental results on extensive data sets show that our methods attain competitive and often superior performance against baselines, as well as showing promising transfer ability. Codes and preprocessed data are at: \url{https://github.com/SherylHYX/GNNRank}.  ( 2 min )
    The Mean-Squared Error of Double Q-Learning. (arXiv:2007.05034v3 [cs.LG] UPDATED)
    In this paper, we establish a theoretical comparison between the asymptotic mean-squared error of Double Q-learning and Q-learning. Our result builds upon an analysis for linear stochastic approximation based on Lyapunov equations and applies to both tabular setting and with linear function approximation, provided that the optimal policy is unique and the algorithms converge. We show that the asymptotic mean-squared error of Double Q-learning is exactly equal to that of Q-learning if Double Q-learning uses twice the learning rate of Q-learning and outputs the average of its two estimators. We also present some practical implications of this theoretical observation using simulations.  ( 2 min )
    Online Contextual Decision-Making with a Smart Predict-then-Optimize Method. (arXiv:2206.07316v1 [cs.LG])
    We study an online contextual decision-making problem with resource constraints. At each time period, the decision-maker first predicts a reward vector and resource consumption matrix based on a given context vector and then solves a downstream optimization problem to make a decision. The final goal of the decision-maker is to maximize the summation of the reward and the utility from resource consumption, while satisfying the resource constraints. We propose an algorithm that mixes a prediction step based on the "Smart Predict-then-Optimize (SPO)" method with a dual update step based on mirror descent. We prove regret bounds and demonstrate that the overall convergence rate of our method depends on the $\mathcal{O}(T^{-1/2})$ convergence of online mirror descent as well as risk bounds of the surrogate loss function used to learn the prediction model. Our algorithm and regret bounds apply to a general convex feasible region for the resource constraints, including both hard and soft resource constraint cases, and they apply to a wide class of prediction models in contrast to the traditional settings of linear contextual models or finite policy spaces. We also conduct numerical experiments to empirically demonstrate the strength of our proposed SPO-type methods, as compared to traditional prediction-error-only methods, on multi-dimensional knapsack and longest path instances.  ( 2 min )
    Epistemic Deep Learning. (arXiv:2206.07609v1 [cs.LG])
    The belief function approach to uncertainty quantification as proposed in the Demspter-Shafer theory of evidence is established upon the general mathematical models for set-valued observations, called random sets. Set-valued predictions are the most natural representations of uncertainty in machine learning. In this paper, we introduce a concept called epistemic deep learning based on the random-set interpretation of belief functions to model epistemic learning in deep neural networks. We propose a novel random-set convolutional neural network for classification that produces scores for sets of classes by learning set-valued ground truth representations. We evaluate different formulations of entropy and distance measures for belief functions as viable loss functions for these random-set networks. We also discuss methods for evaluating the quality of epistemic predictions and the performance of epistemic random-set neural networks. We demonstrate through experiments that the epistemic approach produces better performance results when compared to traditional approaches of estimating uncertainty.  ( 2 min )
    Deep Network Approximation in Terms of Intrinsic Parameters. (arXiv:2111.07964v2 [cs.LG] UPDATED)
    One of the arguments to explain the success of deep learning is the powerful approximation capacity of deep neural networks. Such capacity is generally accompanied by the explosive growth of the number of parameters, which, in turn, leads to high computational costs. It is of great interest to ask whether we can achieve successful deep learning with a small number of learnable parameters adapting to the target function. From an approximation perspective, this paper shows that the number of parameters that need to be learned can be significantly smaller than people typically expect. First, we theoretically design ReLU networks with a few learnable parameters to achieve an attractive approximation. We prove by construction that, for any Lipschitz continuous function $f$ on $[0,1]^d$ with a Lipschitz constant $\lambda>0$, a ReLU network with $n+2$ intrinsic parameters (those depending on $f$) can approximate $f$ with an exponentially small error $5\lambda \sqrt{d}\,2^{-n}$. Such a result is generalized to generic continuous functions. Furthermore, we show that the idea of learning a small number of parameters to achieve a good approximation can be numerically observed. We conduct several experiments to verify that training a small part of parameters can also achieve good results for classification problems if other parameters are pre-specified or pre-trained from a related problem.  ( 2 min )
    Adaptive Threshold Sampling. (arXiv:1708.04970v2 [stat.ML] UPDATED)
    Sampling is a fundamental problem in computer science and statistics. However, for a given task and stream, it is often not possible to choose good sampling probabilities in advance. We derive a general framework for adaptively changing the sampling probabilities via a collection of thresholds.In general, adaptive sampling procedures introduce dependence amongst the sampled points, making it difficult to compute expectations and ensure estimators are unbiased or consistent. Our framework address this issue and further shows when adaptive thresholds can be treated as if they were fixed thresholds which samples items independently. This makes our adaptive sampling schemes simple to apply as there is no need to create custom estimators for the sampling method. Using our framework, we derive new samplers that can address a broad range of new and existing problems including sampling with memory rather than sample size budgets, stratified samples, multiple objectives, distinct counting, and sliding windows. In particular, we design a sampling procedure for the top-K problem where, unlike in the heavy-hitter problem, the sketch size and sampling probabilities are adaptively chosen.  ( 2 min )
    Local Identifiability of Deep ReLU Neural Networks: the Theory. (arXiv:2206.07424v1 [math.ST])
    Is a sample rich enough to determine, at least locally, the parameters of a neural network? To answer this question, we introduce a new local parameterization of a given deep ReLU neural network by fixing the values of some of its weights. This allows us to define local lifting operators whose inverses are charts of a smooth manifold of a high dimensional space. The function implemented by the deep ReLU neural network composes the local lifting with a linear operator which depends on the sample. We derive from this convenient representation a geometrical necessary and sufficient condition of local identifiability. Looking at tangent spaces, the geometrical condition provides: 1/ a sharp and testable necessary condition of identifiability and 2/ a sharp and testable sufficient condition of local identifiability. The validity of the conditions can be tested numerically using backpropagation and matrix rank computations.  ( 2 min )
    Statistical and Computational Phase Transitions in Group Testing. (arXiv:2206.07640v1 [stat.ML])
    We study the group testing problem where the goal is to identify a set of k infected individuals carrying a rare disease within a population of size n, based on the outcomes of pooled tests which return positive whenever there is at least one infected individual in the tested group. We consider two different simple random procedures for assigning individuals to tests: the constant-column design and Bernoulli design. Our first set of results concerns the fundamental statistical limits. For the constant-column design, we give a new information-theoretic lower bound which implies that the proportion of correctly identifiable infected individuals undergoes a sharp "all-or-nothing" phase transition when the number of tests crosses a particular threshold. For the Bernoulli design, we determine the precise number of tests required to solve the associated detection problem (where the goal is to distinguish between a group testing instance and pure noise), improving both the upper and lower bounds of Truong, Aldridge, and Scarlett (2020). For both group testing models, we also study the power of computationally efficient (polynomial-time) inference procedures. We determine the precise number of tests required for the class of low-degree polynomial algorithms to solve the detection problem. This provides evidence for an inherent computational-statistical gap in both the detection and recovery problems at small sparsity levels. Notably, our evidence is contrary to that of Iliopoulos and Zadik (2021), who predicted the absence of a computational-statistical gap in the Bernoulli design.  ( 2 min )
    Characteristic kernels on Hilbert spaces, Banach spaces, and on sets of measures. (arXiv:2206.07588v1 [stat.ML])
    We present new classes of positive definite kernels on non-standard spaces that are integrally strictly positive definite or characteristic. In particular, we discuss radial kernels on separable Hilbert spaces, and introduce broad classes of kernels on Banach spaces and on metric spaces of strong negative type. The general results are used to give explicit classes of kernels on separable $L^p$ spaces and on sets of measures.  ( 2 min )
    Multi-Objective Hyperparameter Optimization -- An Overview. (arXiv:2206.07438v1 [cs.LG])
    Hyperparameter optimization constitutes a large part of typical modern machine learning workflows. This arises from the fact that machine learning methods and corresponding preprocessing steps often only yield optimal performance when hyperparameters are properly tuned. But in many applications, we are not only interested in optimizing ML pipelines solely for predictive accuracy; additional metrics or constraints must be considered when determining an optimal configuration, resulting in a multi-objective optimization problem. This is often neglected in practice, due to a lack of knowledge and readily available software implementations for multi-objective hyperparameter optimization. In this work, we introduce the reader to the basics of multi- objective hyperparameter optimization and motivate its usefulness in applied ML. Furthermore, we provide an extensive survey of existing optimization strategies, both from the domain of evolutionary algorithms and Bayesian optimization. We illustrate the utility of MOO in several specific ML applications, considering objectives such as operating conditions, prediction time, sparseness, fairness, interpretability and robustness.  ( 2 min )
    Diffusion Transport Alignment. (arXiv:2206.07305v1 [stat.ML])
    The integration of multimodal data presents a challenge in cases when the study of a given phenomena by different instruments or conditions generates distinct but related domains. Many existing data integration methods assume a known one-to-one correspondence between domains of the entire dataset, which may be unrealistic. Furthermore, existing manifold alignment methods are not suited for cases where the data contains domain-specific regions, i.e., there is not a counterpart for a certain portion of the data in the other domain. We propose Diffusion Transport Alignment (DTA), a semi-supervised manifold alignment method that exploits prior correspondence knowledge between only a few points to align the domains. By building a diffusion process, DTA finds a transportation plan between data measured from two heterogeneous domains with different feature spaces, which by assumption, share a similar geometrical structure coming from the same underlying data generating process. DTA can also compute a partial alignment in a data-driven fashion, resulting in accurate alignments when some data are measured in only one domain. We empirically demonstrate that DTA outperforms other methods in aligning multimodal data in this semisupervised setting. We also empirically show that the alignment obtained by DTA can improve the performance of machine learning tasks, such as domain adaptation, inter-domain feature mapping, and exploratory data analysis, while outperforming competing methods.  ( 2 min )
    Noise Covariance Estimation in Multi-Task High-dimensional Linear Models. (arXiv:2206.07256v1 [math.ST])
    This paper studies the multi-task high-dimensional linear regression models where the noise among different tasks is correlated, in the moderately high dimensional regime where sample size $n$ and dimension $p$ are of the same order. Our goal is to estimate the covariance matrix of the noise random vectors, or equivalently the correlation of the noise variables on any pair of two tasks. Treating the regression coefficients as a nuisance parameter, we leverage the multi-task elastic-net and multi-task lasso estimators to estimate the nuisance. By precisely understanding the bias of the squared residual matrix and by correcting this bias, we develop a novel estimator of the noise covariance that converges in Frobenius norm at the rate $n^{-1/2}$ when the covariates are Gaussian. This novel estimator is efficiently computable. Under suitable conditions, the proposed estimator of the noise covariance attains the same rate of convergence as the "oracle" estimator that knows in advance the regression coefficients of the multi-task model. The Frobenius error bounds obtained in this paper also illustrate the advantage of this new estimator compared to a method-of-moments estimator that does not attempt to estimate the nuisance. As a byproduct of our techniques, we obtain an estimate of the generalization error of the multi-task elastic-net and multi-task lasso estimators. Extensive simulation studies are carried out to illustrate the numerical performance of the proposed method.  ( 2 min )
    Query-Adaptive Predictive Inference with Partial Labels. (arXiv:2206.07236v1 [stat.ML])
    The cost and scarcity of fully supervised labels in statistical machine learning encourage using partially labeled data for model validation as a cheaper and more accessible alternative. Effectively collecting and leveraging weakly supervised data for large-space structured prediction tasks thus becomes an important part of an end-to-end learning system. We propose a new computationally-friendly methodology to construct predictive sets using only partially labeled data on top of black-box predictive models. To do so, we introduce "probe" functions as a way to describe weakly supervised instances and define a false discovery proportion-type loss, both of which seamlessly adapt to partial supervision and structured prediction -- ranking, matching, segmentation, multilabel or multiclass classification. Our experiments highlight the validity of our predictive set construction as well as the attractiveness of a more flexible user-dependent loss framework.  ( 2 min )
    Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions. (arXiv:2206.07252v1 [stat.ML])
    Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems. While the empirical success of SGD is often attributed to its computational efficiency and favorable generalization behavior, neither effect is well understood and disentangling them remains an open problem. Even in the simple setting of convex quadratic problems, worst-case analyses give an asymptotic convergence rate for SGD that is no better than full-batch gradient descent (GD), and the purported implicit regularization effects of SGD lack a precise explanation. In this work, we study the dynamics of multi-pass SGD on high-dimensional convex quadratics and establish an asymptotic equivalence to a stochastic differential equation, which we call homogenized stochastic gradient descent (HSGD), whose solutions we characterize explicitly in terms of a Volterra integral equation. These results yield precise formulas for the learning and risk trajectories, which reveal a mechanism of implicit conditioning that explains the efficiency of SGD relative to GD. We also prove that the noise from SGD negatively impacts generalization performance, ruling out the possibility of any type of implicit regularization in this context. Finally, we show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD (bootstrap risk).  ( 2 min )
    CARD: Classification and Regression Diffusion Models. (arXiv:2206.07275v1 [stat.ML])
    Learning the distribution of a continuous or categorical response variable $\boldsymbol y$ given its covariates $\boldsymbol x$ is a fundamental problem in statistics and machine learning. Deep neural network-based supervised learning algorithms have made great progress in predicting the mean of $\boldsymbol y$ given $\boldsymbol x$, but they are often criticized for their ability to accurately capture the uncertainty of their predictions. In this paper, we introduce classification and regression diffusion (CARD) models, which combine a denoising diffusion-based conditional generative model and a pre-trained conditional mean estimator, to accurately predict the distribution of $\boldsymbol y$ given $\boldsymbol x$. We demonstrate the outstanding ability of CARD in conditional distribution prediction with both toy examples and real-world datasets, the experimental results on which show that CARD in general outperforms state-of-the-art methods, including Bayesian neural network-based ones that are designed for uncertainty estimation, especially when the conditional distribution of $\boldsymbol y$ given $\boldsymbol x$ is multi-modal.  ( 2 min )
    Unbiased Estimation using the Underdamped Langevin Dynamics. (arXiv:2206.07202v1 [stat.CO])
    In this work we consider the unbiased estimation of expectations w.r.t.~probability measures that have non-negative Lebesgue density, and which are known point-wise up-to a normalizing constant. We focus upon developing an unbiased method via the underdamped Langevin dynamics, which has proven to be popular of late due to applications in statistics and machine learning. Specifically in continuous-time, the dynamics can be constructed to admit the probability of interest as a stationary measure. We develop a novel scheme based upon doubly randomized estimation, which requires access only to time-discretized versions of the dynamics and are the ones that are used in practical algorithms. We prove, under standard assumptions, that our estimator is of finite variance and either has finite expected cost, or has finite cost with a high probability. To illustrate our theoretical findings we provide numerical experiments that verify our theory, which include challenging examples from Bayesian statistics and statistical physics.  ( 2 min )
    Learning the Structure of Large Networked Systems Obeying Conservation Laws. (arXiv:2206.07083v1 [stat.ML])
    Many networked systems such as electric networks, the brain, and social networks of opinion dynamics are known to obey conservation laws. Examples of this phenomenon include the Kirchoff laws in electric networks and opinion consensus in social networks. Conservation laws in networked systems may be modeled as balance equations of the form $X = B^{*} Y$, where the sparsity pattern of $B^{*}$ captures the connectivity of the network, and $Y, X \in \mathbb{R}^p$ are vectors of "potentials" and "injected flows" at the nodes respectively. The node potentials $Y$ cause flows across edges and the flows $X$ injected at the nodes are extraneous to the network dynamics. In several practical systems, the network structure is often unknown and needs to be estimated from data. Towards this, one has access to samples of the node potentials $Y$, but only the statistics of the node injections $X$. Motivated by this important problem, we study the estimation of the sparsity structure of the matrix $B^{*}$ from $n$ samples of $Y$ under the assumption that the node injections $X$ follow a Gaussian distribution with a known covariance $\Sigma_X$. We propose a new $\ell_{1}$-regularized maximum likelihood estimator for this problem in the high-dimensional regime where the size of the network $p$ is larger than sample size $n$. We show that this optimization problem is convex in the objective and admits a unique solution. Under a new mutual incoherence condition, we establish sufficient conditions on the triple $(n,p,d)$ for which exact sparsity recovery of $B^{*}$ is possible with high probability; $d$ is the degree of the graph. We also establish guarantees for the recovery of $B^{*}$ in the element-wise maximum, Frobenius, and operator norms. Finally, we complement these theoretical results with experimental validation of the performance of the proposed estimator on synthetic and real-world data.  ( 3 min )
    Loss Functions for Classification using Structured Entropy. (arXiv:2206.07122v1 [stat.ML])
    Cross-entropy loss is the standard metric used to train classification models in deep learning and gradient boosting. It is well-known that this loss function fails to account for similarities between the different values of the target. We propose a generalization of entropy called {\em structured entropy} which uses a random partition to incorporate the structure of the target variable in a manner which retains many theoretical properties of standard entropy. We show that a structured cross-entropy loss yields better results on several classification problems where the target variable has an a priori known structure. The approach is simple, flexible, easily computable, and does not rely on a hierarchically defined notion of structure.  ( 2 min )
    Stability of image reconstruction algorithms. (arXiv:2206.07128v1 [math.OC])
    Robustness and stability of image reconstruction algorithms have recently come under scrutiny. Their importance to medical imaging cannot be overstated. We review the known results for the topical variational regularization strategies ($\ell_2$ and $\ell_1$ regularization), and present new stability results for $\ell_p$ regularized linear inverse problems for $p\in(1,\infty)$. Our results generalize well to the respective $L_p(\Omega)$ function spaces.  ( 2 min )
    Brownian Noise Reduction: Maximizing Privacy Subject to Accuracy Constraints. (arXiv:2206.07234v1 [cs.LG])
    There is a disconnect between how researchers and practitioners handle privacy-utility tradeoffs. Researchers primarily operate from a privacy first perspective, setting strict privacy requirements and minimizing risk subject to these constraints. Practitioners often desire an accuracy first perspective, possibly satisfied with the greatest privacy they can get subject to obtaining sufficiently small error. Ligett et al. have introduced a "noise reduction" algorithm to address the latter perspective. The authors show that by adding correlated Laplace noise and progressively reducing it on demand, it is possible to produce a sequence of increasingly accurate estimates of a private parameter while only paying a privacy cost for the least noisy iterate released. In this work, we generalize noise reduction to the setting of Gaussian noise, introducing the Brownian mechanism. The Brownian mechanism works by first adding Gaussian noise of high variance corresponding to the final point of a simulated Brownian motion. Then, at the practitioner's discretion, noise is gradually decreased by tracing back along the Brownian path to an earlier time. Our mechanism is more naturally applicable to the common setting of bounded $\ell_2$-sensitivity, empirically outperforms existing work on common statistical tasks, and provides customizable control of privacy loss over the entire interaction with the practitioner. We complement our Brownian mechanism with ReducedAboveThreshold, a generalization of the classical AboveThreshold algorithm that provides adaptive privacy guarantees. Overall, our results demonstrate that one can meet utility constraints while still maintaining strong levels of privacy.  ( 2 min )
    Benefits of Additive Noise in Composing Classes with Bounded Capacity. (arXiv:2206.07199v1 [stat.ML])
    We observe that given two (compatible) classes of functions $\mathcal{F}$ and $\mathcal{H}$ with small capacity as measured by their uniform covering numbers, the capacity of the composition class $\mathcal{H} \circ \mathcal{F}$ can become prohibitively large or even unbounded. We then show that adding a small amount of Gaussian noise to the output of $\mathcal{F}$ before composing it with $\mathcal{H}$ can effectively control the capacity of $\mathcal{H} \circ \mathcal{F}$, offering a general recipe for modular design. To prove our results, we define new notions of uniform covering number of random functions with respect to the total variation and Wasserstein distances. We instantiate our results for the case of multi-layer sigmoid neural networks. Preliminary empirical results on MNIST dataset indicate that the amount of noise required to improve over existing uniform bounds can be numerically negligible (i.e., element-wise i.i.d. Gaussian noise with standard deviation $10^{-240}$). The source codes are available at https://github.com/fathollahpour/composition_noise.  ( 2 min )
    Lazy Queries Can Reduce Variance in Zeroth-order Optimization. (arXiv:2206.07126v1 [cs.LG])
    A major challenge of applying zeroth-order (ZO) methods is the high query complexity, especially when queries are costly. We propose a novel gradient estimation technique for ZO methods based on adaptive lazy queries that we term as LAZO. Different from the classic one-point or two-point gradient estimation methods, LAZO develops two alternative ways to check the usefulness of old queries from previous iterations, and then adaptively reuses them to construct the low-variance gradient estimates. We rigorously establish that through judiciously reusing the old queries, LAZO can reduce the variance of stochastic gradient estimates so that it not only saves queries per iteration but also achieves the regret bound for the symmetric two-point method. We evaluate the numerical performance of LAZO, and demonstrate the low-variance property and the performance gain of LAZO in both regret and query complexity relative to several existing ZO methods. The idea of LAZO is general, and can be applied to other variants of ZO methods.  ( 2 min )

  • Open

    Gym like frameworks for combinatorial optimization on Graphs?
    I was wondering if anyone knows of a gym like framework for combinaotrial optimization with reinforcement learning, which deal with max-cut, travelling sales person problem and other interesting problems on graphs, I have found one framework here https://github.com/wz26/OpenGraphGym but they do not have a gym interface, which makes it difficult for me to use standard rl libraries like RayRL or Stable baselines. submitted by /u/obsoletelearner [link] [comments]  ( 1 min )
    Measuring coordination in MARL
    I'm working on some research which uses coordinated MARL methods to enable collaboration between two agents controlling two tasks in a manufacturing environment. Currently I'm measuring performance of MARL methods by system-level reward, which makes sense, but I have no means of explaining or measuring how well the agents are coordinating with one another. I was wondering if anyone had any ideas for how to measure coordination? I was thinking some sort of correlation between principle components of the agents' models or correlation between KPI's of the two tasks in my environment. Any thoughts? submitted by /u/StandingBuffalo [link] [comments]  ( 1 min )
    how can i define my observation space for an array of float data type?
    i am trying to solve a problem using RL. for my observation space i have a dataset of 38*11k. for each episode my agent should receive 1 row of that dataset and make an action based on that single array of observations ... i'm not exactly sure how can i define my observation space in a gym kinda environment. submitted by /u/Affectionate_Worth43 [link] [comments]  ( 1 min )
    Reward is decreasing for my task.
    submitted by /u/ElvishChampion [link] [comments]  ( 2 min )
    Transformers in RL
    I'm looking into applying Transformers to my RL problem (Minecraft) and was curious about existing libraries. The few that I've found are made for text or aren't extensible to libraries I'm already using (stable baselines). At this point, I'll just make my own implementation but before I start, I'd love to know if an implementation already exists. submitted by /u/realbrokenlantern [link] [comments]  ( 1 min )
    PPO neural network output (final layer) for Hybrid Control (continuous + discrete actions) in Unity ML-Agents
    Hi all, I have a question regarding the the final layer for the actor neural network in PPO in the context of hybrid actions (continuous + discrete) built in Unity ML-Agents. I wanted to know if it follows the same logic as SAC in Unity ML-Agents (https://arxiv.org/pdf/1912.11077.pdf), where we have both a softmax (for discrete actions) and the moments of the gaussian distribution (for continuous control). I am asking this since I recently found another paper, where for Hybrid-PPO, 2 independent actor neural networks (one for discrete and the other for continuous action) are used (https://www.ijcai.org/proceedings/2019/0316.pdf), sharing only the first few layers to encode the state information. However, I don't know which method was implemented in Unity ML-Agents I have posted this question in the Unity forum as well (https://forum.unity.com/threads/hybrid-control-discrete-continuous-actions.1102336/) Many thanks!! submitted by /u/Lower-Statistician94 [link] [comments]  ( 1 min )
    How to tune hypeparametes in A2C-ppo?
    Im currently working with A2C. The model was able to learn open ai pong, i ran this as a sanity check that i havent made any bugs. Now im trying to make the model play breakout, but still after 10m steps the model has not made any significant progress. Im using baseline hyperparameters which can be found here https://github.com/openai/baselines/blob/master/baselines/a2c/a2c.py, except my buffersize have been from 512 to 4096. Ive noticed that entropy decreases extremely slowly given the buffersize from the interval which i just gave. So my questions are how to make entropy decrease and how to increase rewards per buffer? Ive tried to decrease the entropy coefficient to almost zero, but still it acts very weirdly. ​ Average entropy when entropy coef is zero. Cases when the coef is between 1e-04-1e02 looks similiar. https://preview.redd.it/useqj6mris591.png?width=580&format=png&auto=webp&s=162420b03daf9b3b65b6ed9ba56bb38ce55f50b6 ​ ​ https://preview.redd.it/kzlk78c9ns591.png?width=1216&format=png&auto=webp&s=87ed8ce6df8c88e0d6b55b62bfec4455506f692c submitted by /u/SigmaEpsilonDelta [link] [comments]  ( 1 min )
    The Road to General AI
    submitted by /u/Anm_Vanilla_20 [link] [comments]
    Hardest task in OpenAI gym?
    Hello. From what I know, it seems to be that the hardest task in Atari 2600 for RL algorithms is "Montezuma's revenge" game. What about the whole AI gym environments collection? What's the hardest task among all of them? submitted by /u/KushnarevaL [link] [comments]  ( 1 min )
  • Open

    How Mantium achieves low-latency GPT-J inference with DeepSpeed on Amazon SageMaker
    Mantium is a global cloud platform provider for building AI applications and managing them at scale. Mantium’s end-to-end development platform enables enterprises and businesses of all sizes to build AI applications and automation faster and easier than what has been traditionally possible. With Mantium, technical and non-technical teams can prototype, develop, test, and deploy AI […]  ( 8 min )
    Prepare data faster with PySpark and Altair code snippets in Amazon SageMaker Data Wrangler
    Amazon SageMaker Data Wrangler is a purpose-built data aggregation and preparation tool for machine learning (ML). It allows you to use a visual interface to access data and perform exploratory data analysis (EDA) and feature engineering. The EDA feature comes with built-in data analysis capabilities for charts (such as scatter plot or histogram) and time-saving […]  ( 6 min )
    Extract insights from SAP ERP with no-code ML solutions with Amazon AppFlow and Amazon SageMaker Canvas
    Customers in industries like consumer packaged goods, manufacturing, and retail are always looking for ways to empower their operational processes by enriching them with insights and analytics generated from data. Tasks like sales forecasting directly affect operations such as raw material planning, procurement, manufacturing, distribution, and inbound/outbound logistics, and it can have many levels of […]  ( 9 min )
    Customize pronunciations using Amazon Polly
    Amazon Polly breathes life into text by converting it into lifelike speech. This empowers developers and businesses to create applications that can converse in real time, thereby offering an enhanced interactive experience. Text-to-speech (TTS) in Amazon Polly supports a variety of languages and locales, which enables you to perform TTS conversion according to your preferences. […]  ( 7 min )
    Demystifying machine learning at the edge through real use cases
    Edge is a term that refers to a location, far from the cloud or a big data center, where you have a computer device (edge device) capable of running (edge) applications. Edge computing is the act of running workloads on these edge devices. Machine learning at the edge (ML@Edge) is a concept that brings the […]  ( 13 min )
    Text summarization with Amazon SageMaker and Hugging Face
    In this post, we show you how to implement one of the most downloaded Hugging Face pre-trained models used for text summarization, DistilBART-CNN-12-6, within a Jupyter notebook using Amazon SageMaker and the SageMaker Hugging Face Inference Toolkit. Based on the steps shown in this post, you can try summarizing text from the WikiText-2 dataset managed […]  ( 9 min )
    Take your intelligent search experience to the next level with Amazon Kendra hierarchical facets
    Unstructured data continues to grow in many organizations, making it a challenge for users to get the information they need. Amazon Kendra is a highly accurate, intelligent search service powered by machine learning (ML). Amazon Kendra uses deep learning and reading comprehension to deliver precise answers, and returns a list of ranked documents that match […]  ( 10 min )
  • Open

    Why did i do this
    submitted by /u/Aip0 [link] [comments]
    Combining Ebsynth and Disco Diffusion
    Made a video a week or 2 ago, here is a tutorial on how I did it with Ebsynth and a few other programs Saves lots of time! https://www.youtube.com/watch?v=Cs2ILRo16-0 submitted by /u/prfitofthesngularity [link] [comments]  ( 1 min )
    Finished tutorial
    I made a video a while back using this method with great results, just finished a tutorial on combining Ebsynth with Disco Diffusion https://www.youtube.com/watch?v=Cs2ILRo16-0 submitted by /u/prfitofthesngularity [link] [comments]
    The Concept of Existence
    submitted by /u/GreatGearAmidAPizza [link] [comments]
    A Google Bot just went Self Aware ? : Google engineer [Blake lemoine] In...
    submitted by /u/EnvironmentalMap5 [link] [comments]
    This AI model tries to re-create the mind of Ruth Bader Ginsburg
    submitted by /u/yamaboobi1 [link] [comments]
    Weekly China AI News: Megvii Chief Scientist, ResNet Creator Dies; Baidu’s EV Arm Unveils a Self-Driving Concept Car; Alibaba Introduces CIPU to Power Data Centers
    submitted by /u/trcytony [link] [comments]
    An “interview” with a chatbot is not evidence of its sentience
    submitted by /u/mm_maybe [link] [comments]  ( 1 min )
    Generate Synthetic Time-series Data with Open-source Tools - KDnuggets
    submitted by /u/Repeat-or [link] [comments]
    Joe Biden and Donald Trump boxing
    submitted by /u/Paulwhite20 [link] [comments]  ( 1 min )
    Off-putting, unpleasant, and accurate.
    submitted by /u/TheAmnesiacKid [link] [comments]  ( 1 min )
    Join us for the OpenAI GPT-3 Deep Learning Labs Hackathon!
    We are waiting for all of you, AI enthusiasts with coding experience and without, on the 24th - 26th of June to help you turn your ground-breaking ideas into reality! https://lablab.ai/event/gpt3-online ​ https://preview.redd.it/9ybdri4k5t591.png?width=1600&format=png&auto=webp&s=d41c842c3725e2b2e9710f1aa8e8072ed62df6bc submitted by /u/zakrzzz [link] [comments]
    Best way to recognize (mask) a partially visible credit card?
    Hi everyone, What is the best way to recognize (generate a mask) for a partially visible credit card in a picture? For example this link: https://imgur.com/a/G4VjNoc I tried to do this with a Mask RCNN. The results were not good enough. The trained set had problems with credit cards with patterns it did not see before. Even more it had problems with credic cards it has seen before as well. What do you guys think would be the best way to recognize partially visible credit cards in a picture? Thanks a lot. submitted by /u/RangerHere [link] [comments]  ( 1 min )
    AI Dream 53 - VR Headset Stereoscoptic 3D by AI
    submitted by /u/LordPewPew777 [link] [comments]
    Possible to Auto generate a multiple choice question and 4 answer options?
    Hello, wondering if it is possible to automatically create a series of multiple-choice questions from data (say wiki api) and offer 4 answers (3 being incorrect, one being correct)...? The challenge is also to make sure that 3 of 4 options have relevance to the question too. Thank you! submitted by /u/abhimitra [link] [comments]  ( 1 min )
    Thinking about how could beneficial AGI look like
    submitted by /u/HumanSeeing [link] [comments]  ( 2 min )
    A MAGICAL WATERFALLS ESCAPADE
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 1 min )
    New Text To Video AI | Breakthrough Living Skin For Robots | Metamemory Lets AI Think Like Humans | Machine Learning Helps Astronomers Develop New Theory
    submitted by /u/getrich_or_diemining [link] [comments]
    Glass painting.
    submitted by /u/cookingandcraft [link] [comments]
    Automated Computer Vision Pipelines | Sump up {Webinar}
    Hi folks, the last epoch of the webinar series on Automated CV Pipelines is here. The previous 5 epochs covered a number of differenet ways to scale annotation projects and automate the processes. This session will sum up the key points from the previous sessions. Check it out if you are interested! I left the hyperlink in the first sentence. submitted by /u/WeekendClassic [link] [comments]  ( 1 min )
    Tribes: Human 5 - AI Generated
    submitted by /u/Babylon_6 [link] [comments]
    the self-perception of Dalle Mini
    submitted by /u/LokiBrot9452 [link] [comments]  ( 1 min )
  • Open

    [D] How We Built OpenAI's GSM8K Dataset of 8,500 Math Problems
    We recently created a dataset of 8,500 Grade School Math problems in collaboration with OpenAI’s Reinforcement Learning team. The goal: to train language models like GPT-3 to solve natural language math problems and measure their reasoning ability. Read the post by Karl Cobbe, Vineet Kosaraju, and John Schulman on OpenAI's blog! It’s also been adopted by many other research labs, including Google in their PaLM and Chain of Thought papers. Dataset creation is a critical piece of AI, but it’s surprisingly underappreciated – ask most researchers, and they’ll have never inspected their datasets themselves! But how can you trust what you’re building when your inputs are junk? This is a real problem: for example, over 30% of Google’s GoEmotions dataset of Reddit comments is mislabeled… We wrote a blog post diving into the details of how we created this dataset. Would love to hear others opinions — how would you approach building a dataset like this? What are math datasets would be useful? Full blog post here. submitted by /u/BB4evaTB12 [link] [comments]  ( 1 min )
    [D] Upscaling Very Low-Resolution Image
    Hey guys and gals, my girlfriend's mother passed away last week. She only has a low-resolution picture of her. This picture is literally 5 KB: Mother of my girlfriend, original picture, 5 KB, 128 × 168 I tried at least ten websites to upscale it. The result looks horrible: Upscaled version, 2.8 MB, 8192 × 10752 After trying my hand at the current state-of-the-art AI, I believe there MUST be something better on the market that I've just not found yet. After seeing DALL-E 2 in action, it absolutely must be possible to upscale this picture, so we can hang her picture in our living room in decent quality. Any help would be greatly appreciated. submitted by /u/Patrick_K_Wenk [link] [comments]  ( 2 min )
    [P]: mmap_ninja: Speedup your training dramatically by using memory-mapped files for your dataset
    Repo link: https://github.com/hristo-vrigazov/mmap.ninja Images Colab notebook: https://colab.research.google.com/drive/1-WMtVyfxx2aUMeV7vlG48Ia27-5cxnrS?usp=sharing Texts Colab notebook: https://colab.research.google.com/drive/18bEwylFwx4owMpb-RAkJZS_9JrrUcFd7?usp=sharing Hello everyone, I wrote a small, but very useful library for my personal projects and decided to share it with the world. It deals with filesystem I/O during machine learning training. A large portion of the time spent training (especially if GPU is available) is spent on reading/writing images from the disk (or text for that matter). For example, take the COCO 2017 validation dataset of images (I just had this one available on my machine, nothing special about it). If you can't load it all into memory at once (whic…  ( 5 min )
    [R] Binarized Neural Networks in Non-Classification Tasks
    So I'm looking at implementing a binarized (-1,+1 neurons, probably just sign function for gradient descent) version of my variational auto-encoder (VAE) for a signal de-nosing task. Every BNN I can find is a classification task. Before I spend a day figuring out how to redo my system, can anyone confirm this will actually yield any results? The plan is standard floating point at the input and output layers, with binarized layers in between. submitted by /u/SearchAtlantis [link] [comments]  ( 1 min )
    [D] I have a MLM encoder-only transformer to pretrain, I want to use it for text generation.
    The attention mechanism used doesn't allow for causal masking (it's one of those efficient architecture). So how do I train the MLM to be able to generate text zero-shot. Shall I train it like 1 token, 2 token.... N token or something (something like artificially non-teacher forced autoregressive training)? I haven't yet started the pre-training. What other things should I remember whole trying to train? (I have a trained simpler NNs, this will be my first larger training runs). submitted by /u/OddSandwich969 [link] [comments]  ( 1 min )
    [D] Why do a lot of researchers like to submit the paper just at the deadline of the conferences?
    For many machine learning conferences, I know a lot of researchers like to submit the paper a few hours or even a few minutes just ahead of the deadline. Why do they like to do this? Does it benefit the acceptance ratio? submitted by /u/fllubo [link] [comments]  ( 3 min )
    [D] How do static weights in LLMs generate such dynamic behavior?
    This is a bit of a complex topic, so I feel a simple discussion post may not be the best place to fully flesh it out - but in the context of few-shot learning in LLMs (Large Language Models), we observe static/unchanged/un-updated weights being able to infer patterns and sometimes even learn complex tasks. I was wondering why forward passing works so well - by all means, we should've been updating weights with new information but forward passing seems to work pretty well already as it is. So what's your opinion on this? My hypothesis was that it models a differential equation, very much like diffusion models, implicitly. That """learning""" (or meta-learning) process is thus adaptive due to the very function fitted is adaptive in nature, endowing the flexibility we observe today. I know few-shot learning and such phenomena in general are a little bit fuzzy and unexplored territories but would love to know what you guys think about this, and some resources which have explored the same :) submitted by /u/Competitive-Rub-1958 [link] [comments]  ( 1 min )
    [D] When to use Boosted Trees? Are they useless?
    Hey everyone, I was looking at an old project for binary classification and see that the model that is used is a variation of Boosted Trees. Being in the deep learning time I was questioning this project. Coming from a deep learning and convolutional experience I am super confused and probably stupid to think that Boosted Trees is super old. But I am asking for suggestion and feedbacks. I was wondering if people who has experience in this topic can guide me if it is still useful to use Boosted Trees or there are bunch of better option to use instead? What other options I can consider and why Boosted Trees can be the best thing for specific projects? In general, any feedback is very appreciated. Thanks submitted by /u/seyeeet [link] [comments]  ( 3 min )
    [D] Layman query: Is on-site power consumption an indicator of an AI labs compute resources?
    Assumption: If monthly expenses were a reasonable indicator of who the "leading AI labs" are, could their monthly electricity bill also be a useful indicator? Or do these labs rely on off-site computing services/resources? If not, please explain for non-specialists in the audience. Thanks appreciate it. submitted by /u/Half-of [link] [comments]  ( 1 min )
    [D] Robust and Efficient Medical Imaging with Self-Supervision by Google Brain
    https://arxiv.org/pdf/2205.09723.pdf They propose a new hyper initialization plan, combining large scale non-medical data pretraining with task relevant self-supervised pretraining. They obtain the same accuracy as specialized models in out-of-distribution settings using 3-100x less data and show 11.5% relative improvement for in-distribution test sets. This is a big deal in medical applications, because labeled data is incredibly difficult or expensive to get and we need years to obtain high quality data. submitted by /u/margilly_ai [link] [comments]  ( 1 min )
    [R] Faster R-CNN anchor boxes and loss calculation
    Hello, So I just finished reading the paper faster R-CNN. For the anchor boxes, what I insterdood that they are prechosen boxes that serves as references. But when calculating the loss, I figured out that only the W_a and H_a of the anchor box coordinates contributes to the equation according to the equations (1) and (2) in the paper page 5 knowing the the L_reg is a smooth L1 loss. So where I’m mistaken ? submitted by /u/Meddhouib10 [link] [comments]  ( 1 min )
    Poincare Embeddings: Embedding your data in low dimensions [P]
    I have been doing further research on ways to better create embeddings of the data we have and I came across Poincaré Embeddings for Learning Hierarchical Representations (https://arxiv.org/abs/1705.08039), this is a type of hyperbolic embedding that once again is great for hierarchical data and is made for datasets where we have positive pair examples, which essentially means in our dataset we have datapoints that we know we want to be close to each other in the embedding space. For example if it was a dataset containing types of mammals then you would want a Labrador and a Bulldog to be close to each other. The algorithm is pretty clever as it finds the hierarchy in the data itself, without any extra input from the user. Also a cool thing about them is that your embeddings can be low dimensional and still have very low distortion. This means shorter training times and less compute needed There are also a few examples of implementations of it, including one I made myself which I think is quite user friendly so you can play around with it too and embed your own data for any projects you’re working on. Also It’s definitely worth giving the paper a brief read as it’s interesting. I plan on making quite a few more implementations of hyperbolic and geometric ML algorithms so let me know in the comments if there’s anything you’d like to see like a Transformer/more embedding algorithms/ Graph Neural Network etc. Implementation in the HyperLib library with an example: https://github.com/nalexai/hyperlib/blob/main/examples/wordnet_embedding.py I made a blog post to go through it in more detail: https://medium.com/p/9d7b14f22847/ submitted by /u/platinumposter [link] [comments]  ( 2 min )
    [D] Any ideas for an NLP classifier for which I don't have any ground truth?
    I'm currently developing a classifier that can classify a text into emission reduction methods including the use of renewable energy sources, such as hydro, wind, solar, and biofuels, as well as other methods such as increasing energy efficiency, Carbon capture, supplier engagement, etc. I've worked with Sustainable Development Goals classifier but I already had some tags(https://github.com/osdg-ai/osdg-tool/blob/pre-release/osdg/core/sdg/data_files/OSDG-kw-mapping.json) so I was able to perform word matching. But in this case, I don't have any kind of data. Is anyone aware of any pre-trained model, dataset, or any of that sort that could help me build a model? I'm thinking of a named entity recognition model but still, don't have any keywords that would help me in classifying. An examp…  ( 2 min )
    [D] How to combine imaging data with categorical data for 3dconvnet
    I have put together a 3dconvnet for classification of head ct scans in Python using tensorflow. Mid 60% accuracy is best I have been able to achieve using imaging alone. The outcome variables are binary. I want to add patient demographic variables, and other variables of interest (some categorical some continuous) to the test/training data (images) to improve the accuracy. What is the best way to combine that data prior to training? At what point is that data best concatenated? Thanks! submitted by /u/doktoroso [link] [comments]  ( 1 min )
    [D] When to use SMOTE when dealing with rare events classification?
    I'm reading that SMOTE is a common technique for the classification of imbalanced data. What could be the downsides of SMOTE and when is it useful? submitted by /u/buenavista62 [link] [comments]  ( 2 min )
  • Open

    coursera Neural Networks and Deep Learning
    Anyone managed to pass the Assignment please of this course. on week 4? I have tried multiple times but failed to pass the assignment. submitted by /u/Annual-Ad4911 [link] [comments]
    AI News | Text To Video AI 'CogVideo' | Breakthrough Living Skin For Robots | Metamemory Lets AI Think Like Humans
    submitted by /u/tohelpyou88 [link] [comments]
  • Open

    All-In-One Financial Services? Vietnam’s MoMo Has a Super-App for That
    For younger generations, paper bills, loan forms and even cash might as well be in a museum. Smartphones in hand, their financial services largely take place online. The financial-technology companies that serve them are in a race to develop AI that can make sense of the vast amount of data the companies collect — both Read article > The post All-In-One Financial Services? Vietnam’s MoMo Has a Super-App for That appeared first on NVIDIA Blog.  ( 4 min )
  • Open

    Google AI Becomes Sentient: What Does This Mean For Our Future?
    Google AI, or artificial intelligence, is a field of computer science that deals with the creation of intelligent machines. AI applications…  ( 3 min )
  • Open

    Continued fractions as matrix products
    Let pn / qn be the nth convergent of a continued fraction: Then Source: Julian Havil. The Irrationals. p. 212. Related posts Calendars and continued fractions Continued fractions of square roots Normal hazard continued fraction Continued fractions as matrix products first appeared on John D. Cook.  ( 1 min )
  • Open

    7 Software Development Challenges & How To Tackle Them
    Software development is not as easy as we perceive. Building a startup or a product from scratch is a time-consuming and complicated process. You need to develop a valuable and unique idea that the users would prefer. You must develop a meaningful product and sell it successfully to users or other businesses.  Throughout the process,… Read More »7 Software Development Challenges & How To Tackle Them The post 7 Software Development Challenges & How To Tackle Them appeared first on Data Science Central.  ( 5 min )
  • Open

    Semi-Supervised Imitation Learning of Team Policies from Suboptimal Demonstrations. (arXiv:2205.02959v5 [cs.AI] UPDATED)
    We present Bayesian Team Imitation Learner (BTIL), an imitation learning algorithm to model the behavior of teams performing sequential tasks in Markovian domains. In contrast to existing multi-agent imitation learning techniques, BTIL explicitly models and infers the time-varying mental states of team members, thereby enabling learning of decentralized team policies from demonstrations of suboptimal teamwork. Further, to allow for sample- and label-efficient policy learning from small datasets, BTIL employs a Bayesian perspective and is capable of learning from semi-supervised demonstrations. We demonstrate and benchmark the performance of BTIL on synthetic multi-agent tasks as well as a novel dataset of human-agent teamwork. Our experiments show that BTIL can successfully learn team policies from demonstrations despite the influence of team members' (time-varying and potentially misaligned) mental states on their behavior.  ( 2 min )
    Self-critiquing models for assisting human evaluators. (arXiv:2206.05802v2 [cs.CL] UPDATED)
    We fine-tune large language models to write natural language critiques (natural language critical comments) using behavioral cloning. On a topic-based summarization task, critiques written by our models help humans find flaws in summaries that they would have otherwise missed. Our models help find naturally occurring flaws in both model and human written summaries, and intentional flaws in summaries written by humans to be deliberately misleading. We study scaling properties of critiquing with both topic-based summarization and synthetic tasks. Larger models write more helpful critiques, and on most tasks, are better at self-critiquing, despite having harder-to-critique outputs. Larger models can also integrate their own self-critiques as feedback, refining their own summaries into better ones. Finally, we motivate and introduce a framework for comparing critiquing ability to generation and discrimination ability. Our measurements suggest that even large models may still have relevant knowledge they cannot or do not articulate as critiques. These results are a proof of concept for using AI-assisted human feedback to scale the supervision of machine learning systems to tasks that are difficult for humans to evaluate directly. We release our training datasets, as well as samples from our critique assistance experiments.  ( 2 min )
    Equivariant Quantum Graph Circuits. (arXiv:2112.05261v3 [cs.LG] UPDATED)
    We investigate quantum circuits for graph representation learning, and propose equivariant quantum graph circuits (EQGCs), as a class of parameterized quantum circuits with strong relational inductive bias for learning over graph-structured data. Conceptually, EQGCs serve as a unifying framework for quantum graph representation learning, allowing us to define several interesting subclasses which subsume existing proposals. In terms of the representation power, we prove that the studied subclasses of EQGCs are universal approximators for functions over the bounded graph domain. This theoretical perspective on quantum graph machine learning methods opens many directions for further work, and could lead to models with capabilities beyond those of classical approaches. We empirically verify the expressive power of EQGCs through a dedicated experiment on synthetic data, and additionally observe that the performance of EQGCs scales well with the depth of the model and does not suffer from barren plateu issues.  ( 2 min )
    Autoregressive Quantile Flows for Predictive Uncertainty Estimation. (arXiv:2112.04643v2 [cs.LG] UPDATED)
    Numerous applications of machine learning involve representing probability distributions over high-dimensional data. We propose autoregressive quantile flows, a flexible class of normalizing flow models trained using a novel objective based on proper scoring rules. Our objective does not require calculating computationally expensive determinants of Jacobians during training and supports new types of neural architectures, such as neural autoregressive flows from which it is easy to sample. We leverage these models in quantile flow regression, an approach that parameterizes predictive conditional distributions with flows, resulting in improved probabilistic predictions on tasks such as time series forecasting and object detection. Our novel objective functions and neural flow parameterizations also yield improvements on popular generation and density estimation tasks, and represent a step beyond maximum likelihood learning of flows.  ( 2 min )
    Online Learning to Transport via the Minimal Selection Principle. (arXiv:2202.04732v2 [cs.LG] UPDATED)
    Motivated by robust dynamic resource allocation in operations research, we study the \textit{Online Learning to Transport} (OLT) problem where the decision variable is a probability measure, an infinite-dimensional object. We draw connections between online learning, optimal transport, and partial differential equations through an insight called the minimal selection principle, originally studied in the Wasserstein gradient flow setting by \citet{Ambrosio_2005}. This allows us to extend the standard online learning framework to the infinite-dimensional setting seamlessly. Based on our framework, we derive a novel method called the \textit{minimal selection or exploration (MSoE) algorithm} to solve OLT problems using mean-field approximation and discretization techniques. In the displacement convex setting, the main theoretical message underpinning our approach is that minimizing transport cost over time (via the minimal selection principle) ensures optimal cumulative regret upper bounds. On the algorithmic side, our MSoE algorithm applies beyond the displacement convex setting, making the mathematical theory of optimal transport practically relevant to non-convex settings common in dynamic resource allocation.  ( 2 min )
    Benign Overfitting in Two-layer Convolutional Neural Networks. (arXiv:2202.06526v3 [cs.LG] UPDATED)
    Modern neural networks often have great expressive power and can be trained to overfit the training data, while still achieving a good test performance. This phenomenon is referred to as "benign overfitting". Recently, there emerges a line of works studying "benign overfitting" from the theoretical perspective. However, they are limited to linear models or kernel/random feature models, and there is still a lack of theoretical understanding about when and how benign overfitting occurs in neural networks. In this paper, we study the benign overfitting phenomenon in training a two-layer convolutional neural network (CNN). We show that when the signal-to-noise ratio satisfies a certain condition, a two-layer CNN trained by gradient descent can achieve arbitrarily small training and test loss. On the other hand, when this condition does not hold, overfitting becomes harmful and the obtained CNN can only achieve a constant level test loss. These together demonstrate a sharp phase transition between benign overfitting and harmful overfitting, driven by the signal-to-noise ratio. To the best of our knowledge, this is the first work that precisely characterizes the conditions under which benign overfitting can occur in training convolutional neural networks.  ( 2 min )
    Context-Aware Sparse Deep Coordination Graphs. (arXiv:2106.02886v3 [cs.LG] UPDATED)
    Learning sparse coordination graphs adaptive to the coordination dynamics among agents is a long-standing problem in cooperative multi-agent learning. This paper studies this problem and proposes a novel method using the variance of payoff functions to construct context-aware sparse coordination topologies. We theoretically consolidate our method by proving that the smaller the variance of payoff functions is, the less likely action selection will change after removing the corresponding edge. Moreover, we propose to learn action representations to effectively reduce the influence of payoff functions' estimation errors on graph construction. To empirically evaluate our method, we present the Multi-Agent COordination (MACO) benchmark by collecting classic coordination problems in the literature, increasing their difficulty, and classifying them into different types. We carry out a case study and experiments on the MACO and StarCraft II micromanagement benchmark to demonstrate the dynamics of sparse graph learning, the influence of graph sparseness, and the learning performance of our method. (The MACO benchmark and codes are publicly available at https://github.com/TonghanWang/CASEC-MACO-benchmark.)
    Distribution Compression in Near-linear Time. (arXiv:2111.07941v4 [stat.ML] UPDATED)
    In distribution compression, one aims to accurately summarize a probability distribution $\mathbb{P}$ using a small number of representative points. Near-optimal thinning procedures achieve this goal by sampling $n$ points from a Markov chain and identifying $\sqrt{n}$ points with $\widetilde{\mathcal{O}}(1/\sqrt{n})$ discrepancy to $\mathbb{P}$. Unfortunately, these algorithms suffer from quadratic or super-quadratic runtime in the sample size $n$. To address this deficiency, we introduce Compress++, a simple meta-procedure for speeding up any thinning algorithm while suffering at most a factor of $4$ in error. When combined with the quadratic-time kernel halving and kernel thinning algorithms of Dwivedi and Mackey (2021), Compress++ delivers $\sqrt{n}$ points with $\mathcal{O}(\sqrt{\log n/n})$ integration error and better-than-Monte-Carlo maximum mean discrepancy in $\mathcal{O}(n \log^3 n)$ time and $\mathcal{O}( \sqrt{n} \log^2 n )$ space. Moreover, Compress++ enjoys the same near-linear runtime given any quadratic-time input and reduces the runtime of super-quadratic algorithms by a square-root factor. In our benchmarks with high-dimensional Monte Carlo samples and Markov chains targeting challenging differential equation posteriors, Compress++ matches or nearly matches the accuracy of its input algorithm in orders of magnitude less time.
    A Multi-Agent Reinforcement Learning Framework for Off-Policy Evaluation in Two-sided Markets. (arXiv:2202.10574v2 [stat.ML] UPDATED)
    The two-sided markets such as ride-sharing companies often involve a group of subjects who are making sequential decisions across time and/or location. With the rapid development of smart phones and internet of things, they have substantially transformed the transportation landscape of human beings. In this paper we consider large-scale fleet management in ride-sharing companies that involve multiple units in different areas receiving sequences of products (or treatments) over time. Major technical challenges, such as policy evaluation, arise in those studies because (i) spatial and temporal proximities induce interference between locations and times; and (ii) the large number of locations results in the curse of dimensionality. To address both challenges simultaneously, we introduce a multi-agent reinforcement learning (MARL) framework for carrying policy evaluation in these studies. We propose novel estimators for mean outcomes under different products that are consistent despite the high-dimensionality of state-action space. The proposed estimator works favorably in simulation experiments. We further illustrate our method using a real dataset obtained from a two-sided marketplace company to evaluate the effects of applying different subsidizing policies. A Python implementation of our proposed method is available at https://github.com/RunzheStat/CausalMARL.
    Variational Diffusion Models. (arXiv:2107.00630v4 [cs.LG] UPDATED)
    Diffusion-based generative models have demonstrated a capacity for perceptually impressive synthesis, but can they also be great likelihood-based models? We answer this in the affirmative, and introduce a family of diffusion-based generative models that obtain state-of-the-art likelihoods on standard image density estimation benchmarks. Unlike other diffusion-based models, our method allows for efficient optimization of the noise schedule jointly with the rest of the model. We show that the variational lower bound (VLB) simplifies to a remarkably short expression in terms of the signal-to-noise ratio of the diffused data, thereby improving our theoretical understanding of this model class. Using this insight, we prove an equivalence between several models proposed in the literature. In addition, we show that the continuous-time VLB is invariant to the noise schedule, except for the signal-to-noise ratio at its endpoints. This enables us to learn a noise schedule that minimizes the variance of the resulting VLB estimator, leading to faster optimization. Combining these advances with architectural improvements, we obtain state-of-the-art likelihoods on image density estimation benchmarks, outperforming autoregressive models that have dominated these benchmarks for many years, with often significantly faster optimization. In addition, we show how to use the model as part of a bits-back compression scheme, and demonstrate lossless compression rates close to the theoretical optimum. Code is available at https://github.com/google-research/vdm .
    Permutation Search of Tensor Network Structures via Local Sampling. (arXiv:2206.06597v1 [cs.LG])
    Recent works put much effort into tensor network structure search (TN-SS), aiming to select suitable tensor network (TN) structures, involving the TN-ranks, formats, and so on, for the decomposition or learning tasks. In this paper, we consider a practical variant of TN-SS, dubbed TN permutation search (TN-PS), in which we search for good mappings from tensor modes onto TN vertices (core tensors) for compact TN representations. We conduct a theoretical investigation of TN-PS and propose a practically-efficient algorithm to resolve the problem. Theoretically, we prove the counting and metric properties of search spaces of TN-PS, analyzing for the first time the impact of TN structures on these unique properties. Numerically, we propose a novel meta-heuristic algorithm, in which the searching is done by randomly sampling in a neighborhood established in our theory, and then recurrently updating the neighborhood until convergence. Numerical results demonstrate that the new algorithm can reduce the required model size of TNs in extensive benchmarks, implying the improvement in the expressive power of TNs. Furthermore, the computational cost for the new algorithm is significantly less than that in~\cite{li2020evolutionary}.
    Distillation of RL Policies with Formal Guarantees via Variational Abstraction of Markov Decision Processes (Technical Report). (arXiv:2112.09655v2 [cs.LG] UPDATED)
    We consider the challenge of policy simplification and verification in the context of policies learned through reinforcement learning (RL) in continuous environments. In well-behaved settings, RL algorithms have convergence guarantees in the limit. While these guarantees are valuable, they are insufficient for safety-critical applications. Furthermore, they are lost when applying advanced techniques such as deep-RL. To recover guarantees when applying advanced RL algorithms to more complex environments with (i) reachability, (ii) safety-constrained reachability, or (iii) discounted-reward objectives, we build upon the DeepMDP framework introduced by Gelada et al. to derive new bisimulation bounds between the unknown environment and a learned discrete latent model of it. Our bisimulation bounds enable the application of formal methods for Markov decision processes. Finally, we show how one can use a policy obtained via state-of-the-art RL to efficiently train a variational autoencoder that yields a discrete latent model with provably approximately correct bisimulation guarantees. Additionally, we obtain a distilled version of the policy for the latent model.
    Syntax-Guided Program Reduction for Understanding Neural Code Intelligence Models. (arXiv:2205.14374v2 [cs.SE] UPDATED)
    Neural code intelligence (CI) models are opaque black-boxes and offer little insight on the features they use in making predictions. This opacity may lead to distrust in their prediction and hamper their wider adoption in safety-critical applications. Recently, input program reduction techniques have been proposed to identify key features in the input programs to improve the transparency of CI models. However, this approach is syntax-unaware and does not consider the grammar of the programming language. In this paper, we apply a syntax-guided program reduction technique that considers the grammar of the input programs during reduction. Our experiments on multiple models across different types of input programs show that the syntax-guided program reduction technique is faster and provides smaller sets of key tokens in reduced programs. We also show that the key tokens could be used in generating adversarial examples for up to 65% of the input programs.
    On the proliferation of support vectors in high dimensions. (arXiv:2009.10670v2 [math.ST] UPDATED)
    The support vector machine (SVM) is a well-established classification method whose name refers to the particular training examples, called support vectors, that determine the maximum margin separating hyperplane. The SVM classifier is known to enjoy good generalization properties when the number of support vectors is small compared to the number of training examples. However, recent research has shown that in sufficiently high-dimensional linear classification problems, the SVM can generalize well despite a proliferation of support vectors where all training examples are support vectors. In this paper, we identify new deterministic equivalences for this phenomenon of support vector proliferation, and use them to (1) substantially broaden the conditions under which the phenomenon occurs in high-dimensional settings, and (2) prove a nearly matching converse result.
    Efficient Human-in-the-loop System for Guiding DNNs Attention. (arXiv:2206.05981v2 [cs.CV] UPDATED)
    Attention guidance is an approach to addressing dataset bias in deep learning, where the model relies on incorrect features to make decisions. Focusing on image classification tasks, we propose an efficient human-in-the-loop system to interactively direct the attention of classifiers to the regions specified by users, thereby reducing the influence of co-occurrence bias and improving the transferability and interpretability of a DNN. Previous approaches for attention guidance require the preparation of pixel-level annotations and are not designed as interactive systems. We present a new interactive method to allow users to annotate images with simple clicks, and study a novel active learning strategy to significantly reduce the number of annotations. We conducted both a numerical evaluation and a user study to evaluate the proposed system on multiple datasets. Compared to the existing non-active-learning approach which usually relies on huge amounts of polygon-based segmentation masks to fine-tune or train the DNNs, our system can save lots of labor and money and obtain a fine-tuned network that works better even when the dataset is biased. The experiment results indicate that the proposed system is efficient, reasonable, and reliable.
    Learning Behavior Representations Through Multi-Timescale Bootstrapping. (arXiv:2206.07041v1 [cs.LG])
    Natural behavior consists of dynamics that are both unpredictable, can switch suddenly, and unfold over many different timescales. While some success has been found in building representations of behavior under constrained or simplified task-based conditions, many of these models cannot be applied to free and naturalistic settings due to the fact that they assume a single scale of temporal dynamics. In this work, we introduce Bootstrap Across Multiple Scales (BAMS), a multi-scale representation learning model for behavior: we combine a pooling module that aggregates features extracted over encoders with different temporal receptive fields, and design a set of latent objectives to bootstrap the representations in each respective space to encourage disentanglement across different timescales. We first apply our method on a dataset of quadrupeds navigating in different terrain types, and show that our model captures the temporal complexity of behavior. We then apply our method to the MABe 2022 Multi-agent behavior challenge, where our model ranks 3rd overall and 1st on two subtasks, and show the importance of incorporating multi-timescales when analyzing behavior.
    Automated SSIM Regression for Detection and Quantification of Motion Artefacts in Brain MR Images. (arXiv:2206.06725v1 [eess.IV])
    Motion artefacts in magnetic resonance brain images are a crucial issue. The assessment of MR image quality is fundamental before proceeding with the clinical diagnosis. If the motion artefacts alter a correct delineation of structure and substructures of the brain, lesions, tumours and so on, the patients need to be re-scanned. Otherwise, neuro-radiologists could report an inaccurate or incorrect diagnosis. The first step right after scanning a patient is the "\textit{image quality assessment}" in order to decide if the acquired images are diagnostically acceptable. An automated image quality assessment based on the structural similarity index (SSIM) regression through a residual neural network has been proposed here, with the possibility to perform also the classification in different groups - by subdividing with SSIM ranges. This method predicts SSIM values of an input image in the absence of a reference ground truth image. The networks were able to detect motion artefacts, and the best performance for the regression and classification task has always been achieved with ResNet-18 with contrast augmentation. Mean and standard deviation of residuals' distribution were $\mu=-0.0009$ and $\sigma=0.0139$, respectively. Whilst for the classification task in 3, 5 and 10 classes, the best accuracies were 97, 95 and 89\%, respectively. The obtained results show that the proposed method could be a tool in supporting neuro-radiologists and radiographers in evaluating the image quality before the diagnosis.
    Precise expressions for random projections: Low-rank approximation and randomized Newton. (arXiv:2006.10653v3 [cs.LG] UPDATED)
    It is often desirable to reduce the dimensionality of a large dataset by projecting it onto a low-dimensional subspace. Matrix sketching has emerged as a powerful technique for performing such dimensionality reduction very efficiently. Even though there is an extensive literature on the worst-case performance of sketching, existing guarantees are typically very different from what is observed in practice. We exploit recent developments in the spectral analysis of random matrices to develop novel techniques that provide provably accurate expressions for the expected value of random projection matrices obtained via sketching. These expressions can be used to characterize the performance of dimensionality reduction in a variety of common machine learning tasks, ranging from low-rank approximation to iterative stochastic optimization. Our results apply to several popular sketching methods, including Gaussian and Rademacher sketches, and they enable precise analysis of these methods in terms of spectral properties of the data. Empirical results show that the expressions we derive reflect the practical performance of these sketching methods, down to lower-order effects and even constant factors.
    CoCa: Contrastive Captioners are Image-Text Foundation Models. (arXiv:2205.01917v2 [cs.CV] UPDATED)
    Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively. By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead. CoCa is pretrained end-to-end and from scratch on both web-scale alt-text data and annotated images by treating all labels simply as text, seamlessly unifying natural language supervision for representation learning. Empirically, CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (ImageNet, Kinetics-400/600/700, Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, NoCaps). Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and new state-of-the-art 91.0% top-1 accuracy on ImageNet with a finetuned encoder.
    Low-Rank Hankel Tensor Completion for Traffic Speed Estimation. (arXiv:2105.11335v2 [cs.LG] UPDATED)
    This paper studies the traffic state estimation (TSE) problem using sparse observations from mobile sensors. Most existing TSE methods either rely on well-defined physical traffic flow models or require large amounts of simulation data as input to train machine learning models. Different from previous studies, we propose a purely data-driven and model-free solution in this paper. We consider the TSE as a spatiotemporal matrix completion/interpolation problem, and apply spatiotemporal delay embedding to transform the original incomplete matrix into a fourth-order Hankel structured tensor. By imposing a low-rank assumption on this tensor structure, we can approximate and characterize both global and local spatiotemporal patterns in a data-driven manner. We use the truncated nuclear norm of a balanced spatiotemporal unfolding -- in which each column represents the vectorization of a small patch in the original matrix -- to approximate the tensor rank. An efficient solution algorithm based on the Alternating Direction Method of Multipliers (ADMM) is developed for model learning. The proposed framework only involves two hyperparameters, spatial and temporal window lengths, which are easy to set given the degree of data sparsity. We conduct numerical experiments on real-world high-resolution trajectory data, and our results demonstrate the effectiveness and superiority of the proposed model in some challenging scenarios.
    A Functional Information Perspective on Model Interpretation. (arXiv:2206.05700v2 [cs.LG] UPDATED)
    Contemporary predictive models are hard to interpret as their deep nets exploit numerous complex relations between input elements. This work suggests a theoretical framework for model interpretability by measuring the contribution of relevant features to the functional entropy of the network with respect to the input. We rely on the log-Sobolev inequality that bounds the functional entropy by the functional Fisher information with respect to the covariance of the data. This provides a principled way to measure the amount of information contribution of a subset of features to the decision function. Through extensive experiments, we show that our method surpasses existing interpretability sampling-based methods on various data signals such as image, text, and audio.
    Dynamic Relevance Learning for Few-Shot Object Detection. (arXiv:2108.02235v2 [cs.CV] UPDATED)
    Expensive bounding-box annotations have limited the development of object detection task. Thus, it is necessary to focus on more challenging task of few-shot object detection. It requires the detector to recognize objects of novel classes with only a few training samples. Nowadays, many existing popular methods adopting training way similar to meta-learning have achieved promising performance, such as Meta R-CNN series. However, support data is only used as the class attention to guide the detecting of query images each time. Their relevance to each other remains unexploited. Moreover, a lot of recent works treat the support data and query images as independent branch without considering the relationship between them. To address this issue, we propose a dynamic relevance learning model, which utilizes the relationship between all support images and Region of Interest (RoI) on the query images to construct a dynamic graph convolutional network (GCN). By adjusting the prediction distribution of the base detector using the output of this GCN, the proposed model serves as a hard auxiliary classification task, which guides the detector to improve the class representation implicitly. Comprehensive experiments have been conducted on Pascal VOC and MS-COCO dataset. The proposed model achieves the best overall performance, which shows its effectiveness of learning more generalized features. Our code is available at https://github.com/liuweijie19980216/DRL-for-FSOD.
    Scaling ResNets in the Large-depth Regime. (arXiv:2206.06929v1 [cs.LG])
    Deep ResNets are recognized for achieving state-of-the-art results in complex machine learning tasks. However, the remarkable performance of these architectures relies on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients, particularly as the depth $L$ increases. No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor $\alpha_L$. We show in a probabilistic setting that with standard i.i.d. initializations, the only non-trivial dynamics is for $\alpha_L = 1/\sqrt{L}$ (other choices lead either to explosion or to identity mapping). This scaling factor corresponds in the continuous-time limit to a neural stochastic differential equation, contrarily to a widespread interpretation that deep ResNets are discretizations of neural ordinary differential equations. By contrast, in the latter regime, stability is obtained with specific correlated initializations and $\alpha_L = 1/L$. Our analysis suggests a strong interplay between scaling and regularity of the weights as a function of the layer index. Finally, in a series of experiments, we exhibit a continuous range of regimes driven by these two parameters, which jointly impact performance before and after training.
    Resource Allocation for Compression-aided Federated Learning with High Distortion Rate. (arXiv:2206.06976v1 [cs.IT])
    Recently, a considerable amount of works have been made to tackle the communication burden in federated learning (FL) (e.g., model quantization, data sparsification, and model compression). However, the existing methods, that boost the communication efficiency in FL, result in a considerable trade-off between communication efficiency and global convergence rate. We formulate an optimization problem for compression-aided FL, which captures the relationship between the distortion rate, number of participating IoT devices, and convergence rate. Following that, the objective function is to minimize the total transmission time for FL convergence. Because the problem is non-convex, we propose to decompose it into sub-problems. Based on the property of a FL model, we first determine the number of IoT devices participating in the FL process. Then, the communication between IoT devices and the server is optimized by efficiently allocating wireless resources based on a coalition game. Our theoretical analysis shows that, by actively controlling the number of participating IoT devices, we can avoid the training divergence of compression-aided FL while maintaining the communication efficiency.
    COVIDHunter: COVID-19 pandemic wave prediction and mitigation via seasonality-aware modeling. (arXiv:2206.06692v1 [q-bio.QM])
    Early detection and isolation of COVID-19 patients are essential for successful implementation of mitigation strategies and eventually curbing the disease spread. With a limited number of daily COVID-19 tests performed in every country, simulating the COVID-19 spread along with the potential effect of each mitigation strategy currently remains one of the most effective ways in managing the healthcare system and guiding policy-makers. We introduce COVIDHunter, a flexible and accurate COVID-19 outbreak simulation model that evaluates the current mitigation measures that are applied to a region, predicts COVID-19 statistics (the daily number of cases, hospitalizations, and deaths), and provides suggestions on what strength the upcoming mitigation measure should be. The key idea of COVIDHunter is to quantify the spread of COVID-19 in a geographical region by simulating the average number of new infections caused by an infected person considering the effect of external factors, such as environmental conditions (e.g., climate, temperature, humidity), different variants of concern, vaccination rate, and mitigation measures. Using Switzerland as a case study, COVIDHunter estimates that we are experiencing a deadly new wave that will peak on 26 January 2022, which is very similar in numbers to the wave we had in February 2020. The policy-makers have only one choice that is to increase the strength of the currently applied mitigation measures for 30 days. Unlike existing models, the COVIDHunter model accurately monitors and predicts the daily number of cases, hospitalizations, and deaths due to COVID-19. Our model is flexible to configure and simple to modify for modeling different scenarios under different environmental conditions and mitigation measures. We release the source code of the COVIDHunter implementation at https://github.com/CMU-SAFARI/COVIDHunter.
    Hierarchical Primitive Composition: Simultaneous Activation of Skills with Inconsistent Action Dimensions in Multiple Hierarchies. (arXiv:2110.01833v4 [cs.LG] UPDATED)
    Deep reinforcement learning has shown its effectiveness in various applications, providing a promising direction for solving tasks with high complexity. However, naively applying classical RL for learning a complex long-horizon task with a single control policy is inefficient. Thus, policy modularization tackles this problem by learning a set of modules that are mapped to primitives and properly orchestrating them. In this study, we further expand the discussion by incorporating simultaneous activation of the skills and structuring them into multiple hierarchies in a recursive fashion. Moreover, we sought to devise an algorithm that can properly orchestrate the skills with different action spaces via multiplicative Gaussian distributions, which highly increases the reusability. By exploiting the modularity, interpretability can also be achieved by observing the modules that are used in the new task if each of the skills is known. We demonstrate how the proposed scheme can be employed in practice by solving a pick and place task with a 6 DoF manipulator, and examine the effects of each property from ablation studies.
    Recommender Transformers with Behavior Pathways. (arXiv:2206.06804v1 [cs.IR])
    Sequential recommendation requires the recommender to capture the evolving behavior characteristics from logged user behavior data for accurate recommendations. However, user behavior sequences are viewed as a script with multiple ongoing threads intertwined. We find that only a small set of pivotal behaviors can be evolved into the user's future action. As a result, the future behavior of the user is hard to predict. We conclude this characteristic for sequential behaviors of each user as the Behavior Pathway. Different users have their unique behavior pathways. Among existing sequential models, transformers have shown great capacity in capturing global-dependent characteristics. However, these models mainly provide a dense distribution over all previous behaviors using the self-attention mechanism, making the final predictions overwhelmed by the trivial behaviors not adjusted to each user. In this paper, we build the Recommender Transformer (RETR) with a novel Pathway Attention mechanism. RETR can dynamically plan the behavior pathway specified for each user, and sparingly activate the network through this behavior pathway to effectively capture evolving patterns useful for recommendation. The key design is a learned binary route to prevent the behavior pathway from being overwhelmed by trivial behaviors. We empirically verify the effectiveness of RETR on seven real-world datasets and RETR yields state-of-the-art performance.
    Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic Objectives with Skewed Hessian Spectrums. (arXiv:2110.14109v3 [cs.LG] UPDATED)
    Learning rate schedulers have been widely adopted in training deep neural networks. Despite their practical importance, there is a discrepancy between its practice and its theoretical analysis. For instance, it is not known what schedules of SGD achieve best convergence, even for simple problems such as optimizing quadratic objectives. In this paper, we propose Eigencurve, the first family of learning rate schedules that can achieve minimax optimal convergence rates (up to a constant) for SGD on quadratic objectives when the eigenvalue distribution of the underlying Hessian matrix is skewed. The condition is quite common in practice. Experimental results show that Eigencurve can significantly outperform step decay in image classification tasks on CIFAR-10, especially when the number of epochs is small. Moreover, the theory inspires two simple learning rate schedulers for practical applications that can approximate eigencurve. For some problems, the optimal shape of the proposed schedulers resembles that of cosine decay, which sheds light to the success of cosine decay for such situations. For other situations, the proposed schedulers are superior to cosine decay.
    Learning Optimal Fair Classification Trees. (arXiv:2201.09932v2 [cs.LG] UPDATED)
    The increasing use of machine learning in high-stakes domains -- where people's livelihoods are impacted -- creates an urgent need for interpretable and fair algorithms. In these settings it is also critical for such algorithms to be accurate. With these needs in mind, we propose a mixed integer optimization (MIO) framework for learning optimal classification trees of fixed depth that can be conveniently augmented with arbitrary domain specific fairness constraints. We benchmark our method against the state-of-the-art approach for building fair trees on popular datasets; given a fixed discrimination threshold, our approach improves out-of-sample (OOS) accuracy by 2.3 percentage points on average and obtains a higher OOS accuracy on 88.9% of the experiments. We also incorporate various algorithmic fairness notions into our method, showcasing its versatile modeling power that allows decision makers to fine-tune the trade-off between accuracy and fairness.
    The Kidneys Are Not All Normal: Investigating the Speckle Distributions of Transplanted Kidneys. (arXiv:2206.06654v1 [eess.IV])
    Modelling ultrasound speckle has generated considerable interest for its ability to characterize tissue properties. As speckle is dependent on the underlying tissue architecture, modelling it may aid in tasks like segmentation or disease detection. However, for the transplanted kidney where ultrasound is commonly used to investigate dysfunction, it is currently unknown which statistical distribution best characterises such speckle. This is especially true for the regions of the transplanted kidney: the cortex, the medulla and the central echogenic complex. Furthermore, it is unclear how these distributions vary by patient variables such as age, sex, body mass index, primary disease, or donor type. These traits may influence speckle modelling given their influence on kidney anatomy. We are the first to investigate these two aims. N=821 kidney transplant recipient B-mode images were automatically segmented into the cortex, medulla, and central echogenic complex using a neural network. Seven distinct probability distributions were fitted to each region. The Rayleigh and Nakagami distributions had model parameters that differed significantly between the three regions (p <= 0.05). While both had excellent goodness of fit, the Nakagami had higher Kullbeck-Leibler divergence. Recipient age correlated weakly with scale in the cortex (Omega: rho = 0.11, p = 0.004), while body mass index correlated weakly with shape in the medulla (m: rho = 0.08, p = 0.04). Neither sex, primary disease, nor donor type demonstrated any correlation. We propose the Nakagami distribution be used to characterize transplanted kidneys regionally independent of disease etiology and most patient characteristics based on our findings.
    Deep Variational Implicit Processes. (arXiv:2206.06720v1 [stat.ML])
    Implicit processes (IPs) are a generalization of Gaussian processes (GPs). IPs may lack a closed-form expression but are easy to sample from. Examples include, among others, Bayesian neural networks or neural samplers. IPs can be used as priors over functions, resulting in flexible models with well-calibrated prediction uncertainty estimates. Methods based on IPs usually carry out function-space approximate inference, which overcomes some of the difficulties of parameter-space approximate inference. Nevertheless, the approximations employed often limit the expressiveness of the final model, resulting, \emph{e.g.}, in a Gaussian predictive distribution, which can be restrictive. We propose here a multi-layer generalization of IPs called the Deep Variational Implicit process (DVIP). This generalization is similar to that of deep GPs over GPs, but it is more flexible due to the use of IPs as the prior distribution over the latent functions. We describe a scalable variational inference algorithm for training DVIP and show that it outperforms previous IP-based methods and also deep GPs. We support these claims via extensive regression and classification experiments. We also evaluate DVIP on large datasets with up to several million data instances to illustrate its good scalability and performance.
    Integral Probability Metric based Regularization for Optimal Transport. (arXiv:2011.05001v4 [cs.LG] UPDATED)
    Recently it has been shown that Maximum Mean Discrepancy (MMD) based regularization for optimal transport (OT), unlike the popular Kullback Leibler (KL) based regularization, leads to a dimension-free bound on the sample complexity of estimation. On the other hand, interesting classes of metrics like the Generalized Wasserstein (GW) metrics and the Gaussian-Hellinger-Kantorovich (GHK) metrics are defined using Total Variation and KL based regularizations, respectively. It is, however, an open question if appropriate metrics could be defined using the sample-efficient MMD regularization. In this work, we not only bridge this gap, but further consider a generic family of regularizers based on Integral Probability Metrics (IPMs), which include MMD as a special case. We present novel IPM regularized $p$-Wasserstein style OT formulations and prove that they indeed induce metrics over measures. While some of these novel metrics can be interpreted as infimal convolutions of IPMs, interestingly, others turn out to be the IPM-analogues of GW and GHK metrics. Finally, we present finite sample-based formulations for estimating the squared-MMD regularized metric and the corresponding barycenter. We empirically study other desirable properties of the proposed metrics and show their applicability in various machine learning applications.
    Explainable AI for High Energy Physics. (arXiv:2206.06632v1 [hep-ex])
    Neural Networks are ubiquitous in high energy physics research. However, these highly nonlinear parameterized functions are treated as \textit{black boxes}- whose inner workings to convey information and build the desired input-output relationship are often intractable. Explainable AI (xAI) methods can be useful in determining a neural model's relationship with data toward making it \textit{interpretable} by establishing a quantitative and tractable relationship between the input and the model's output. In this letter of interest, we explore the potential of using xAI methods in the context of problems in high energy physics.
    Generalized Classification of Satellite Image Time Series with Thermal Positional Encoding. (arXiv:2203.09175v2 [cs.CV] UPDATED)
    Large-scale crop type classification is a task at the core of remote sensing efforts with applications of both economic and ecological importance. Current state-of-the-art deep learning methods are based on self-attention and use satellite image time series (SITS) to discriminate crop types based on their unique growth patterns. However, existing methods generalize poorly to regions not seen during training mainly due to not being robust to temporal shifts of the growing season caused by variations in climate. To this end, we propose Thermal Positional Encoding (TPE) for attention-based crop classifiers. Unlike previous positional encoding based on calendar time (e.g. day-of-year), TPE is based on thermal time, which is obtained by accumulating daily average temperatures over the growing season. Since crop growth is directly related to thermal time, but not calendar time, TPE addresses the temporal shifts between different regions to improve generalization. We propose multiple TPE strategies, including learnable methods, to further improve results compared to the common fixed positional encodings. We demonstrate our approach on a crop classification task across four different European regions, where we obtain state-of-the-art generalization results.
    Risk and optimal policies in bandit experiments. (arXiv:2112.06363v6 [econ.EM] UPDATED)
    We provide a decision theoretic analysis of bandit experiments. Working within the framework of diffusion asymptotics, we define suitable notions of asymptotic Bayes and minimax risk for these experiments. For normally distributed rewards, the minimal Bayes risk can be characterized as the solution to a nonlinear second-order partial differential equation (PDE). Using a limit of experiments approach, we show that this PDE characterization also holds asymptotically under both parametric and non-parametric distribution of the rewards. The approach further describes the state variables it is asymptotically sufficient to restrict attention to, and therefore suggests a practical strategy for dimension reduction. The upshot is that we can approximate the dynamic programming problem defining the bandit experiment with a PDE which can be efficiently solved using sparse matrix routines. We derive the optimal Bayes and minimax policies from the numerical solutions to these PDEs. The proposed policies substantially dominate existing methods such as Thompson sampling. The framework can be generalized to allow for time discounting and pure exploration motives.
    Highly Efficient Structural Learning of Sparse Staged Trees. (arXiv:2206.06970v1 [stat.ML])
    Several structural learning algorithms for staged tree models, an asymmetric extension of Bayesian networks, have been defined. However, they do not scale efficiently as the number of variables considered increases. Here we introduce the first scalable structural learning algorithm for staged trees, which searches over a space of models where only a small number of dependencies can be imposed. A simulation study as well as a real-world application illustrate our routines and the practical use of such data-learned staged trees.
    On Convergence of Federated Averaging Langevin Dynamics. (arXiv:2112.05120v2 [stat.ML] UPDATED)
    We propose a federated averaging Langevin algorithm (FA-LD) for uncertainty quantification and mean predictions with distributed clients. In particular, we generalize beyond normal posterior distributions and consider a general class of models. We develop theoretical guarantees for FA-LD for strongly log-concave distributions with non-i.i.d data and study how the injected noise and the stochastic-gradient noise, the heterogeneity of data, and the varying learning rates affect the convergence. Such an analysis sheds light on the optimal choice of local updates to minimize communication costs. Important to our approach is that the communication efficiency does not deteriorate with the injected noise in the Langevin algorithms. In addition, we examine in our FA-LD algorithm both independent and correlated noise used over different clients. We observe there is a trade-off between the pairs among communication, accuracy, and data privacy. As local devices may become inactive in federated networks, we also show convergence results based on different averaging schemes where only partial device updates are available. In such a case, we discover an additional bias that does not decay to zero.
    On the Symmetries of Deep Learning Models and their Internal Representations. (arXiv:2205.14258v2 [cs.LG] UPDATED)
    Symmetry has been a fundamental tool in the exploration of a broad range of complex systems. In machine learning, symmetry has been explored in both models and data. In this paper we seek to connect the symmetries arising from the architecture of a family of models with the symmetries of that family's internal representation of data. We do this by calculating a set of fundamental symmetry groups, which we call the \emph{intertwiner groups} of the model. Each of these arises from a particular nonlinear layer of the model and different nonlinearities result in different symmetry groups. These groups change the weights of a model in such a way that the underlying function that the model represents remains constant but the internal representations of data inside the model may change. We connect intertwiner groups to a model's internal representations of data through a range of experiments that probe similarities between hidden states across models with the same architecture. Our work suggests that the symmetries of a network are propagated into the symmetries in that network's representation of data, providing us with a better understanding of how architecture affects the learning and prediction process. Finally, we speculate that for ReLU networks, the intertwiner groups may provide a justification for the common practice of concentrating model interpretability exploration on the activation basis in hidden layers rather than arbitrary linear combinations thereof.
    Zeroth-Order Topological Insights into Iterative Magnitude Pruning. (arXiv:2206.06563v1 [cs.LG])
    Modern-day neural networks are famously large, yet also highly redundant and compressible; there exist numerous pruning strategies in the deep learning literature that yield over 90% sparser sub-networks of fully-trained, dense architectures while still maintaining their original accuracies. Amongst these many methods though -- thanks to its conceptual simplicity, ease of implementation, and efficacy -- Iterative Magnitude Pruning (IMP) dominates in practice and is the de facto baseline to beat in the pruning community. However, theoretical explanations as to why a simplistic method such as IMP works at all are few and limited. In this work, we leverage the notion of persistent homology to gain insights into the workings of IMP and show that it inherently encourages retention of those weights which preserve topological information in a trained network. Subsequently, we also provide bounds on how much different networks can be pruned while perfectly preserving their zeroth order topological features, and present a modified version of IMP to do the same.
    Federated Optimization Algorithms with Random Reshuffling and Gradient Compression. (arXiv:2206.07021v1 [cs.LG])
    Gradient compression is a popular technique for improving communication complexity of stochastic first-order methods in distributed training of machine learning models. However, the existing works consider only with-replacement sampling of stochastic gradients. In contrast, it is well-known in practice and recently confirmed in theory that stochastic methods based on without-replacement sampling, e.g., Random Reshuffling (RR) method, perform better than ones that sample the gradients with-replacement. In this work, we close this gap in the literature and provide the first analysis of methods with gradient compression and without-replacement sampling. We first develop a distributed variant of random reshuffling with gradient compression (Q-RR), and show how to reduce the variance coming from gradient quantization through the use of control iterates. Next, to have a better fit to Federated Learning applications, we incorporate local computation and propose a variant of Q-RR called Q-NASTYA. Q-NASTYA uses local gradient steps and different local and global stepsizes. Next, we show how to reduce compression variance in this setting as well. Finally, we prove the convergence results for the proposed methods and outline several settings in which they improve upon existing algorithms.
    Severe Damage Recovery in Evolving Soft Robots through Differentiable Programming. (arXiv:2206.06674v1 [cs.NE])
    Biological systems are very robust to morphological damage, but artificial systems (robots) are currently not. In this paper we present a system based on neural cellular automata, in which locomoting robots are evolved and then given the ability to regenerate their morphology from damage through gradient-based training. Our approach thus combines the benefits of evolution to discover a wide range of different robot morphologies, with the efficiency of supervised training for robustness through differentiable update rules. The resulting neural cellular automata are able to grow virtual robots capable of regaining more than 80\% of their functionality, even after severe types of morphological damage.
    A Low-Cost Robot Science Kit for Education with Symbolic Regression for Hypothesis Discovery and Validation. (arXiv:2204.04187v3 [cond-mat.mtrl-sci] UPDATED)
    The next generation of physical science involves robot scientists - autonomous physical science systems capable of experimental design, execution, and analysis in a closed loop. Such systems have shown real-world success for scientific exploration and discovery, including the first discovery of a best-in-class material. To build and use these systems, the next generation workforce requires expertise in diverse areas including ML, control systems, measurement science, materials synthesis, decision theory, among others. However, education is lagging. Educators need a low-cost, easy-to-use platform to teach the required skills. Industry can also use such a platform for developing and evaluating autonomous physical science methodologies. We present the next generation in science education, a kit for building a low-cost autonomous scientist. The kit was used during two courses at the University of Maryland to teach undergraduate and graduate students autonomous physical science. We discuss its use in the course and its greater capability to teach the dual tasks of autonomous model exploration, optimization, and determination, with an example of autonomous experimental "discovery" of the Henderson-Hasselbalch equation.
    Astock: A New Dataset and Automated Stock Trading based on Stock-specific News Analyzing Model. (arXiv:2206.06606v1 [cs.CL])
    Natural Language Processing(NLP) demonstrates a great potential to support financial decision-making by analyzing the text from social media or news outlets. In this work, we build a platform to study the NLP-aided stock auto-trading algorithms systematically. In contrast to the previous work, our platform is characterized by three features: (1) We provide financial news for each specific stock. (2) We provide various stock factors for each stock. (3) We evaluate performance from more financial-relevant metrics. Such a design allows us to develop and evaluate NLP-aided stock auto-trading algorithms in a more realistic setting. In addition to designing an evaluation platform and dataset collection, we also made a technical contribution by proposing a system to automatically learn a good feature representation from various input information. The key to our algorithm is a method called semantic role labeling Pooling (SRLP), which leverages Semantic Role Labeling (SRL) to create a compact representation of each news paragraph. Based on SRLP, we further incorporate other stock factors to make the final prediction. In addition, we propose a self-supervised learning strategy based on SRLP to enhance the out-of-distribution generalization performance of our system. Through our experimental study, we show that the proposed method achieves better performance and outperforms all the baselines' annualized rate of return as well as the maximum drawdown of the CSI300 index and XIN9 index on real trading. Our Astock dataset and code are available at https://github.com/JinanZou/Astock.
    On the Role of Channel Capacity in Learning Gaussian Mixture Models. (arXiv:2202.07707v2 [cs.IT] UPDATED)
    This paper studies the sample complexity of learning the $k$ unknown centers of a balanced Gaussian mixture model (GMM) in $\mathbb{R}^d$ with spherical covariance matrix $\sigma^2\mathbf{I}$. In particular, we are interested in the following question: what is the maximal noise level $\sigma^2$, for which the sample complexity is essentially the same as when estimating the centers from labeled measurements? To that end, we restrict attention to a Bayesian formulation of the problem, where the centers are uniformly distributed on the sphere $\sqrt{d}\mathcal{S}^{d-1}$. Our main results characterize the exact noise threshold $\sigma^2$ below which the GMM learning problem, in the large system limit $d,k\to\infty$, is as easy as learning from labeled observations, and above which it is substantially harder. The threshold occurs at $\frac{\log k}{d} = \frac12\log\left( 1+\frac{1}{\sigma^2} \right)$, which is the capacity of the additive white Gaussian noise (AWGN) channel. Thinking of the set of $k$ centers as a code, this noise threshold can be interpreted as the largest noise level for which the error probability of the code over the AWGN channel is small. Previous works on the GMM learning problem have identified the minimum distance between the centers as a key parameter in determining the statistical difficulty of learning the corresponding GMM. While our results are only proved for GMMs whose centers are uniformly distributed over the sphere, they hint that perhaps it is the decoding error probability associated with the center constellation as a channel code that determines the statistical difficulty of learning the corresponding GMM, rather than just the minimum distance.
    Exploring Representation of Horn Clauses using GNNs. (arXiv:2206.06986v1 [cs.AI])
    Learning program semantics from raw source code is challenging due to the complexity of real-world programming language syntax and due to the difficulty of reconstructing long-distance relational information implicitly represented in programs using identifiers. Addressing the first point, we consider Constrained Horn Clauses (CHCs) as a standard representation of program verification problems, providing a simple and programming language-independent syntax. For the second challenge, we explore graph representations of CHCs, and propose a new Relational Hypergraph Neural Network (R-HyGNN) architecture to learn program features. We introduce two different graph representations of CHCs. One is called constraint graph (CG), and emphasizes syntactic information of CHCs by translating the symbols and their relations in CHCs as typed nodes and binary edges, respectively, and constructing the constraints as abstract syntax trees. The second one is called control- and data-flow hypergraph (CDHG), and emphasizes semantic information of CHCs by representing the control and data flow through ternary hyperedges. We then propose a new GNN architecture, R-HyGNN, extending Relational Graph Convolutional Networks, to handle hypergraphs. To evaluate the ability of R-HyGNN to extract semantic information from programs, we use R-HyGNNs to train models on the two graph representations, and on five proxy tasks with increasing difficulty, using benchmarks from CHC-COMP 2021 as training data. The most difficult proxy task requires the model to predict the occurrence of clauses in counter-examples, which subsumes satisfiability of CHCs. CDHG achieves 90.59% accuracy in this task. Furthermore, R-HyGNN has perfect predictions on one of the graphs consisting of more than 290 clauses. Overall, our experiments indicate that R-HyGNN can capture intricate program features for guiding verification problems.
    ReCo: Retrieve and Co-segment for Zero-shot Transfer. (arXiv:2206.07045v1 [cs.CV])
    Semantic segmentation has a broad range of applications, but its real-world impact has been significantly limited by the prohibitive annotation costs necessary to enable deployment. Segmentation methods that forgo supervision can side-step these costs, but exhibit the inconvenient requirement to provide labelled examples from the target distribution to assign concept names to predictions. An alternative line of work in language-image pre-training has recently demonstrated the potential to produce models that can both assign names across large vocabularies of concepts and enable zero-shot transfer for classification, but do not demonstrate commensurate segmentation abilities. In this work, we strive to achieve a synthesis of these two approaches that combines their strengths. We leverage the retrieval abilities of one such language-image pre-trained model, CLIP, to dynamically curate training sets from unlabelled images for arbitrary collections of concept names, and leverage the robust correspondences offered by modern image representations to co-segment entities among the resulting collections. The synthetic segment collections are then employed to construct a segmentation model (without requiring pixel labels) whose knowledge of concepts is inherited from the scalable pre-training process of CLIP. We demonstrate that our approach, termed Retrieve and Co-segment (ReCo) performs favourably to unsupervised segmentation approaches while inheriting the convenience of nameable predictions and zero-shot transfer. We also demonstrate ReCo's ability to generate specialist segmenters for extremely rare objects.
    Learning Best Combination for Efficient N:M Sparsity. (arXiv:2206.06662v1 [cs.LG])
    By forcing at most N out of M consecutive weights to be non-zero, the recent N:M network sparsity has received increasing attention for its two attractive advantages: 1) Promising performance at a high sparsity. 2) Significant speedups on NVIDIA A100 GPUs. Recent studies require an expensive pre-training phase or a heavy dense-gradient computation. In this paper, we show that the N:M learning can be naturally characterized as a combinatorial problem which searches for the best combination candidate within a finite collection. Motivated by this characteristic, we solve N:M sparsity in an efficient divide-and-conquer manner. First, we divide the weight vector into $C_{\text{M}}^{\text{N}}$ combination subsets of a fixed size N. Then, we conquer the combinatorial problem by assigning each combination a learnable score that is jointly optimized with its associate weights. We prove that the introduced scoring mechanism can well model the relative importance between combination subsets. And by gradually removing low-scored subsets, N:M fine-grained sparsity can be efficiently optimized during the normal training phase. Comprehensive experiments demonstrate that our learning best combination (LBC) performs consistently better than off-the-shelf N:M sparsity methods across various networks. Our code is released at \url{https://github.com/zyxxmu/LBC}.
    Towards Alternative Techniques for Improving Adversarial Robustness: Analysis of Adversarial Training at a Spectrum of Perturbations. (arXiv:2206.06496v1 [cs.LG])
    Adversarial training (AT) and its variants have spearheaded progress in improving neural network robustness to adversarial perturbations and common corruptions in the last few years. Algorithm design of AT and its variants are focused on training models at a specified perturbation strength $\epsilon$ and only using the feedback from the performance of that $\epsilon$-robust model to improve the algorithm. In this work, we focus on models, trained on a spectrum of $\epsilon$ values. We analyze three perspectives: model performance, intermediate feature precision and convolution filter sensitivity. In each, we identify alternative improvements to AT that otherwise wouldn't have been apparent at a single $\epsilon$. Specifically, we find that for a PGD attack at some strength $\delta$, there is an AT model at some slightly larger strength $\epsilon$, but no greater, that generalizes best to it. Hence, we propose overdesigning for robustness where we suggest training models at an $\epsilon$ just above $\delta$. Second, we observe (across various $\epsilon$ values) that robustness is highly sensitive to the precision of intermediate features and particularly those after the first and second layer. Thus, we propose adding a simple quantization to defenses that improves accuracy on seen and unseen adaptive attacks. Third, we analyze convolution filters of each layer of models at increasing $\epsilon$ and notice that those of the first and second layer may be solely responsible for amplifying input perturbations. We present our findings and demonstrate our techniques through experiments with ResNet and WideResNet models on the CIFAR-10 and CIFAR-10-C datasets.
    SpecNet2: Orthogonalization-free spectral embedding by neural networks. (arXiv:2206.06644v1 [stat.ML])
    Spectral methods which represent data points by eigenvectors of kernel matrices or graph Laplacian matrices have been a primary tool in unsupervised data analysis. In many application scenarios, parametrizing the spectral embedding by a neural network that can be trained over batches of data samples gives a promising way to achieve automatic out-of-sample extension as well as computational scalability. Such an approach was taken in the original paper of SpectralNet (Shaham et al. 2018), which we call SpecNet1. The current paper introduces a new neural network approach, named SpecNet2, to compute spectral embedding which optimizes an equivalent objective of the eigen-problem and removes the orthogonalization layer in SpecNet1. SpecNet2 also allows separating the sampling of rows and columns of the graph affinity matrix by tracking the neighbors of each data point through the gradient formula. Theoretically, we show that any local minimizer of the new orthogonalization-free objective reveals the leading eigenvectors. Furthermore, global convergence for this new orthogonalization-free objective using a batch-based gradient descent method is proved. Numerical experiments demonstrate the improved performance and computational efficiency of SpecNet2 on simulated data and image datasets.
    Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward. (arXiv:2206.06426v1 [cs.LG])
    The remarkable success of reinforcement learning (RL) heavily relies on observing the reward of every visited state-action pair. In many real world applications, however, an agent can observe only a score that represents the quality of the whole trajectory, which is referred to as the {\em trajectory-wise reward}. In such a situation, it is difficult for standard RL methods to well utilize trajectory-wise reward, and large bias and variance errors can be incurred in policy evaluation. In this work, we propose a novel offline RL algorithm, called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED), which decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value iteration based on the learned proxy reward. To ensure the value functions constructed by PARTED are always pessimistic with respect to the optimal ones, we design a new penalty term to offset the uncertainty of the proxy reward. For general episodic MDPs with large state space, we show that PARTED with overparameterized neural network function approximation achieves an $\tilde{\mathcal{O}}(D_{\text{eff}}H^2/\sqrt{N})$ suboptimality, where $H$ is the length of episode, $N$ is the total number of samples, and $D_{\text{eff}}$ is the effective dimension of the neural tangent kernel matrix. To further illustrate the result, we show that PARTED achieves an $\tilde{\mathcal{O}}(dH^3/\sqrt{N})$ suboptimality with linear MDPs, where $d$ is the feature dimension, which matches with that with neural network function approximation, when $D_{\text{eff}}=dH$. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
    Explain yourself! Effects of Explanations in Human-Robot Interaction. (arXiv:2204.04501v2 [cs.RO] UPDATED)
    Recent developments in explainable artificial intelligence promise the potential to transform human-robot interaction: Explanations of robot decisions could affect user perceptions, justify their reliability, and increase trust. However, the effects on human perceptions of robots that explain their decisions have not been studied thoroughly. To analyze the effect of explainable robots, we conduct a study in which two simulated robots play a competitive board game. While one robot explains its moves, the other robot only announces them. Providing explanations for its actions was not sufficient to change the perceived competence, intelligence, likeability or safety ratings of the robot. However, the results show that the robot that explains its moves is perceived as more lively and human-like. This study demonstrates the need for and potential of explainable human-robot interaction and the wider assessment of its effects as a novel research direction.
    Look, Radiate, and Learn: Self-supervised Localisation via Radio-Visual Correspondence. (arXiv:2206.06424v1 [cs.LG])
    Next generation cellular networks will implement radio sensing functions alongside customary communications, thereby enabling unprecedented worldwide sensing coverage outdoors. Deep learning has revolutionised computer vision but has had limited application to radio perception tasks, in part due to lack of systematic datasets and benchmarks dedicated to the study of the performance and promise of radio sensing. To address this gap, we present MaxRay: a synthetic radio-visual dataset and benchmark that facilitate precise target localisation in radio. We further propose to learn to localise targets in radio without supervision by extracting self-coordinates from radio-visual correspondence. We use such self-supervised coordinates to train a radio localiser network. We characterise our performance against a number of state-of-the-art baselines. Our results indicate that accurate radio target localisation can be automatically learned from paired radio-visual data without labels, which is highly relevant to empirical data. This opens the door for vast data scalability and may prove key to realising the promise of robust radio sensing atop a unified perception-communication cellular infrastructure. Dataset will be hosted on IEEE DataPort.
    Adversarial Robustness via Fisher-Rao Regularization. (arXiv:2106.06685v3 [cs.LG] UPDATED)
    Adversarial robustness has become a topic of growing interest in machine learning since it was observed that neural networks tend to be brittle. We propose an information-geometric formulation of adversarial defense and introduce FIRE, a new Fisher-Rao regularization for the categorical cross-entropy loss, which is based on the geodesic distance between the softmax outputs corresponding to natural and perturbed input features. Based on the information-geometric properties of the class of softmax distributions, we derive an explicit characterization of the Fisher-Rao Distance (FRD) for the binary and multiclass cases, and draw some interesting properties as well as connections with standard regularization metrics. Furthermore, for a simple linear and Gaussian model, we show that all Pareto-optimal points in the accuracy-robustness region can be reached by FIRE while other state-of-the-art methods fail. Empirically, we evaluate the performance of various classifiers trained with the proposed loss on standard datasets, showing up to a simultaneous 1\% of improvement in terms of clean and robust performances while reducing the training time by 20\% over the best-performing methods.
    AuxMix: Semi-Supervised Learning with Unconstrained Unlabeled Data. (arXiv:2206.06959v1 [cs.CV])
    Semi-supervised learning (SSL) has seen great strides when labeled data is scarce but unlabeled data is abundant. Critically, most recent work assume that such unlabeled data is drawn from the same distribution as the labeled data. In this work, we show that state-of-the-art SSL algorithms suffer a degradation in performance in the presence of unlabeled auxiliary data that does not necessarily possess the same class distribution as the labeled set. We term this problem as Auxiliary-SSL and propose AuxMix, an algorithm that leverages self-supervised learning tasks to learn generic features in order to mask auxiliary data that are not semantically similar to the labeled set. We also propose to regularize learning by maximizing the predicted entropy for dissimilar auxiliary samples. We show an improvement of 5% over existing baselines on a ResNet-50 model when trained on CIFAR10 dataset with 4k labeled samples and all unlabeled data is drawn from the Tiny-ImageNet dataset. We report competitive results on several datasets and conduct ablation studies.
    Realistic Actor-Critic: A Framework for Balance Between Value Overestimation and Underestimation. (arXiv:2110.09712v4 [cs.LG] UPDATED)
    This paper proposes a reinforcement learning framework to enhance the exploration-exploitation trade-off by learning a range of policies concerning various confidence bounds. The underestimated values provide stable updates but suffer from inefficient exploration behaviors. On the other hand, overestimated values can help the agent escape local optima, but it might cause over-exploration on low-value areas and function approximation errors accumulation. Algorithms have been proposed to mitigate the above contradiction. However, we lack an understanding of how the value bias impact performance and a method for efficient exploration while keeping value away from catastrophic overestimation bias accumulation. In this paper, we 1) highlight that both under- and overestimation bias can improve learning efficiency, and it is a particular form of the exploration-exploitation dilemma; 2) propose a unified framework called Realistic Actor-Critic(RAC), which employs Universal Value Function Approximators (UVFA) to simultaneously learn policies with different value confidence-bond with the same neural network, each with a different under-overestimation trade-off. This allows us to perform directed exploration without over-exploration using the upper bounds while still avoiding overestimation using the lower bounds. % 3) propose a variant of soft Bellman backup, called punished Bellman backup, which provides fine-granular estimation bias control to train policies efficiently. Through carefully designed experiments, We empirically verify that RAC achieves 10x sample efficiency and 25\% performance improvement compared to Soft Actor-Critic on the most challenging Humanoid environment. All the source codes are available at \url{https://github.com/ihuhuhu/RAC}.
    A Consistent and Efficient Evaluation Strategy for Attribution Methods. (arXiv:2202.00449v2 [cs.CV] UPDATED)
    With a variety of local feature attribution methods being proposed in recent years, follow-up work suggested several evaluation strategies. To assess the attribution quality across different attribution techniques, the most popular among these evaluation strategies in the image domain use pixel perturbations. However, recent advances discovered that different evaluation strategies produce conflicting rankings of attribution methods and can be prohibitively expensive to compute. In this work, we present an information-theoretic analysis of evaluation strategies based on pixel perturbations. Our findings reveal that the results are strongly affected by information leakage through the shape of the removed pixels as opposed to their actual values. Using our theoretical insights, we propose a novel evaluation framework termed Remove and Debias (ROAD) which offers two contributions: First, it mitigates the impact of the confounders, which entails higher consistency among evaluation strategies. Second, ROAD does not require the computationally expensive retraining step and saves up to 99% in computational costs compared to the state-of-the-art. We release our source code at https://github.com/tleemann/road_evaluation.
    Manifold Alignment-Based Multi-Fidelity Reduced-Order Modeling Applied to Structural Analysis. (arXiv:2206.06920v1 [cs.LG])
    This work presents the application of a recently developed parametric, non-intrusive, and multi-fidelity reduced-order modeling method on high-dimensional displacement and stress fields arising from the structural analysis of geometries that differ in the size of discretization and structural topology.The proposed approach leverages manifold alignment to fuse inconsistent field outputs from high- and low-fidelity simulations by individually projecting their solution onto a common subspace. The effectiveness of the method is demonstrated on two multi-fidelity scenarios involving the structural analysis of a benchmark wing geometry. Results show that outputs from structural simulations using incompatible grids, or related yet different topologies, are easily combined into a single predictive model, thus eliminating the need for additional pre-processing of the data. The new multi-fidelity reduced-order model achieves a relatively higher predictive accuracy at a lower computational cost when compared to a single-fidelity model.
    Multimodal Learning with Transformers: A Survey. (arXiv:2206.06488v1 [cs.CV])
    Transformer is a promising neural network learner, and has achieved great success in various machine learning tasks. Thanks to the recent prevalence of multimodal applications and big data, Transformer-based multimodal learning has become a hot topic in AI research. This paper presents a comprehensive survey of Transformer techniques oriented at multimodal data. The main contents of this survey include: (1) a background of multimodal learning, Transformer ecosystem, and the multimodal big data era, (2) a theoretical review of Vanilla Transformer, Vision Transformer, and multimodal Transformers, from a geometrically topological perspective, (3) a review of multimodal Transformer applications, via two important paradigms, i.e., for multimodal pretraining and for specific multimodal tasks, (4) a summary of the common challenges and designs shared by the multimodal Transformer models and applications, and (5) a discussion of open problems and potential research directions for the community.
    A Local Optima Network Analysis of the Feedforward Neural Architecture Space. (arXiv:2206.06903v1 [cs.NE])
    This study investigates the use of local optima network (LON) analysis, a derivative of the fitness landscape of candidate solutions, to characterise and visualise the neural architecture space. The search space of feedforward neural network architectures with up to three layers, each with up to 10 neurons, is fully enumerated by evaluating trained model performance on a selection of data sets. Extracted LONs, while heterogeneous across data sets, all exhibit simple global structures, with single global funnels in all cases but one. These results yield early indication that LONs may provide a viable paradigm by which to analyse and optimise neural architectures.
    Probabilistic Conformal Prediction Using Conditional Random Samples. (arXiv:2206.06584v1 [stat.ML])
    This paper proposes probabilistic conformal prediction (PCP), a predictive inference algorithm that estimates a target variable by a discontinuous predictive set. Given inputs, PCP construct the predictive set based on random samples from an estimated generative model. It is efficient and compatible with either explicit or implicit conditional generative models. Theoretically, we show that PCP guarantees correct marginal coverage with finite samples. Empirically, we study PCP on a variety of simulated and real datasets. Compared to existing methods for conformal inference, PCP provides sharper predictive sets.
    Extracting Expert's Goals by What-if Interpretable Modeling. (arXiv:2110.15165v3 [cs.LG] UPDATED)
    Although reinforcement learning (RL) has tremendous success in many fields, applying RL to real-world settings such as healthcare is challenging when the reward is hard to specify and no exploration is allowed. In this work, we focus on recovering clinicians' rewards in treating patients. We incorporate the what-if reasoning to explain the clinician's treatments based on their potential future outcomes. We use generalized additive models (GAMs) - a class of accurate, interpretable models - to recover the reward. In both simulation and a real-world hospital dataset, we show our model outperforms baselines. Finally, our model's explanations match several clinical guidelines when treating patients while we found the commonly-used linear model often contradicts them.
    Tailored max-out networks for learning convex PWQ functions. (arXiv:2206.06826v1 [eess.SY])
    Convex piecewise quadratic (PWQ) functions frequently appear in control and elsewhere. For instance, it is well-known that the optimal value function (OVF) as well as Q-functions for linear MPC are convex PWQ functions. Now, in learning-based control, these functions are often represented with the help of artificial neural networks (NN). In this context, a recurring question is how to choose the topology of the NN in terms of depth, width, and activations in order to enable efficient learning. An elegant answer to that question could be a topology that, in principle, allows to exactly describe the function to be learned. Such solutions are already available for related problems. In fact, suitable topologies are known for piecewise affine (PWA) functions that can, for example, reflect the optimal control law in linear MPC. Following this direction, we show in this paper that convex PWQ functions can be exactly described by max-out-NN with only one hidden layer and two neurons.
    RoSGAS: Adaptive Social Bot Detection with Reinforced Self-Supervised GNN Architecture Search. (arXiv:2206.06757v1 [cs.SI])
    Social bots are referred to as the automated accounts on social networks that make attempts to behave like human. While Graph Neural Networks (GNNs) has been massively applied to the field of social bot detection, a huge amount of domain expertise and prior knowledge is heavily engaged in the state-of-the art approaches to design a dedicated neural network architecture for a specific classification task. Involving oversized nodes and network layers in the model design, however, usually causes the over-smoothing problem and the lack of embedding discrimination. In this paper, we propose RoSGAS, a novel Reinforced and Self-supervised GNN Architecture Search framework to adaptively pinpoint the most suitable multi-hop neighborhood and the number of layers in the GNN architecture. More specifically, we consider the social bot detection problem as a user-centric subgraph embedding and classification task. We exploit heterogeneous information network to present the user connectivity by leveraging account metadata, relationships, behavioral features and content features. RoSGAS uses a multi-agent deep reinforcement learning (RL) mechanism for navigating the search of optimal neighborhood and network layers to learn individually the subgraph embedding for each target user. A nearest neighbor mechanism is developed for accelerating the RL training process, and RoSGAS can learn more discriminative subgraph embedding with the aid of self-supervised learning. Experiments on 5 Twitter datasets show that RoSGAS outperforms the state-of-the-art approaches in terms of accuracy, training efficiency and stability, and has better generalization when handling unseen samples.
    DeepTPI: Test Point Insertion with Deep Reinforcement Learning. (arXiv:2206.06975v1 [cs.AI])
    Test point insertion (TPI) is a widely used technique for testability enhancement, especially for logic built-in self-test (LBIST) due to its relatively low fault coverage. In this paper, we propose a novel TPI approach based on deep reinforcement learning (DRL), named DeepTPI. Unlike previous learning-based solutions that formulate the TPI task as a supervised-learning problem, we train a novel DRL agent, instantiated as the combination of a graph neural network (GNN) and a Deep Q-Learning network (DQN), to maximize the test coverage improvement. Specifically, we model circuits as directed graphs and design a graph-based value network to estimate the action values for inserting different test points. The policy of the DRL agent is defined as selecting the action with the maximum value. Moreover, we apply the general node embeddings from a pre-trained model to enhance node features, and propose a dedicated testability-aware attention mechanism for the value network. Experimental results on circuits with various scales show that DeepTPI significantly improves test coverage compared to the commercial DFT tool. The code of this work is available at https://github.com/cure-lab/DeepTPI.
    Overparametrized linear dimensionality reductions: From projection pursuit to two-layer neural networks. (arXiv:2206.06526v1 [stat.ML])
    Given a cloud of $n$ data points in $\mathbb{R}^d$, consider all projections onto $m$-dimensional subspaces of $\mathbb{R}^d$ and, for each such projection, the empirical distribution of the projected points. What does this collection of probability distributions look like when $n,d$ grow large? We consider this question under the null model in which the points are i.i.d. standard Gaussian vectors, focusing on the asymptotic regime in which $n,d\to\infty$, with $n/d\to\alpha\in (0,\infty)$, while $m$ is fixed. Denoting by $\mathscr{F}_{m, \alpha}$ the set of probability distributions in $\mathbb{R}^m$ that arise as low-dimensional projections in this limit, we establish new inner and outer bounds on $\mathscr{F}_{m, \alpha}$. In particular, we characterize the Wasserstein radius of $\mathscr{F}_{m,\alpha}$ up to logarithmic factors, and determine it exactly for $m=1$. We also prove sharp bounds in terms of Kullback-Leibler divergence and R\'{e}nyi information dimension. The previous question has application to unsupervised learning methods, such as projection pursuit and independent component analysis. We introduce a version of the same problem that is relevant for supervised learning, and prove a sharp Wasserstein radius bound. As an application, we establish an upper bound on the interpolation threshold of two-layers neural networks with $m$ hidden neurons.
    Revisiting the Shape-Bias of Deep Learning for Dermoscopic Skin Lesion Classification. (arXiv:2206.06466v1 [cs.CV])
    It is generally believed that the human visual system is biased towards the recognition of shapes rather than textures. This assumption has led to a growing body of work aiming to align deep models' decision-making processes with the fundamental properties of human vision. The reliance on shape features is primarily expected to improve the robustness of these models under covariate shift. In this paper, we revisit the significance of shape-biases for the classification of skin lesion images. Our analysis shows that different skin lesion datasets exhibit varying biases towards individual image features. Interestingly, despite deep feature extractors being inclined towards learning entangled features for skin lesion classification, individual features can still be decoded from this entangled representation. This indicates that these features are still represented in the learnt embedding spaces of the models, but not used for classification. In addition, the spectral analysis of different datasets shows that in contrast to common visual recognition, dermoscopic skin lesion classification, by nature, is reliant on complex feature combinations beyond shape-bias. As a natural consequence, shifting away from the prevalent desire of shape-biasing models can even improve skin lesion classifiers in some cases.
    Object Scene Representation Transformer. (arXiv:2206.06922v1 [cs.CV])
    A compositional understanding of the world in terms of objects and their geometry in 3D space is considered a cornerstone of human cognition. Facilitating the learning of such a representation in neural networks holds promise for substantially improving labeled data efficiency. As a key step in this direction, we make progress on the problem of learning 3D-consistent decompositions of complex scenes into individual objects in an unsupervised fashion. We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis. OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods. At the same time, it is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder. We believe this work will not only accelerate future architecture exploration and scaling efforts, but it will also serve as a useful tool for both object-centric as well as neural scene representation learning communities.
    Near-Optimal Randomized Exploration for Tabular Markov Decision Processes. (arXiv:2102.09703v4 [cs.LG] UPDATED)
    We study algorithms using randomized value functions for exploration in reinforcement learning. This type of algorithms enjoys appealing empirical performance. We show that when we use 1) a single random seed in each episode, and 2) a Bernstein-type magnitude of noise, we obtain a worst-case $\widetilde{O}\left(H\sqrt{SAT}\right)$ regret bound for episodic time-inhomogeneous Markov Decision Process where $S$ is the size of state space, $A$ is the size of action space, $H$ is the planning horizon and $T$ is the number of interactions. This bound polynomially improves all existing bounds for algorithms based on randomized value functions, and for the first time, matches the $\Omega\left(H\sqrt{SAT}\right)$ lower bound up to logarithmic factors. Our result highlights that randomized exploration can be near-optimal, which was previously achieved only by optimistic algorithms. To achieve the desired result, we develop 1) a new clipping operation to ensure both the probability of being optimistic and the probability of being pessimistic are lower bounded by a constant, and 2) a new recursive formula for the absolute value of estimation errors to analyze the regret.
    Temporal Multimodal Multivariate Learning. (arXiv:2206.06878v1 [cs.LG])
    We introduce temporal multimodal multivariate learning, a new family of decision making models that can indirectly learn and transfer online information from simultaneous observations of a probability distribution with more than one peak or more than one outcome variable from one time stage to another. We approximate the posterior by sequentially removing additional uncertainties across different variables and time, based on data-physics driven correlation, to address a broader class of challenging time-dependent decision-making problems under uncertainty. Extensive experiments on real-world datasets ( i.e., urban traffic data and hurricane ensemble forecasting data) demonstrate the superior performance of the proposed targeted decision-making over the state-of-the-art baseline prediction methods across various settings.
    ABCinML: Anticipatory Bias Correction in Machine Learning Applications. (arXiv:2206.06960v1 [cs.LG])
    The idealization of a static machine-learned model, trained once and deployed forever, is not practical. As input distributions change over time, the model will not only lose accuracy, any constraints to reduce bias against a protected class may fail to work as intended. Thus, researchers have begun to explore ways to maintain algorithmic fairness over time. One line of work focuses on dynamic learning: retraining after each batch, and the other on robust learning which tries to make algorithms robust against all possible future changes. Dynamic learning seeks to reduce biases soon after they have occurred and robust learning often yields (overly) conservative models. We propose an anticipatory dynamic learning approach for correcting the algorithm to mitigate bias before it occurs. Specifically, we make use of anticipations regarding the relative distributions of population subgroups (e.g., relative ratios of male and female applicants) in the next cycle to identify the right parameters for an importance weighing fairness approach. Results from experiments over multiple real-world datasets suggest that this approach has promise for anticipatory bias correction.
    FreeKD: Free-direction Knowledge Distillation for Graph Neural Networks. (arXiv:2206.06561v1 [cs.LG])
    Knowledge distillation (KD) has demonstrated its effectiveness to boost the performance of graph neural networks (GNNs), where its goal is to distill knowledge from a deeper teacher GNN into a shallower student GNN. However, it is actually difficult to train a satisfactory teacher GNN due to the well-known over-parametrized and over-smoothing issues, leading to invalid knowledge transfer in practical applications. In this paper, we propose the first Free-direction Knowledge Distillation framework via Reinforcement learning for GNNs, called FreeKD, which is no longer required to provide a deeper well-optimized teacher GNN. The core idea of our work is to collaboratively build two shallower GNNs in an effort to exchange knowledge between them via reinforcement learning in a hierarchical way. As we observe that one typical GNN model often has better and worse performances at different nodes during training, we devise a dynamic and free-direction knowledge transfer strategy that consists of two levels of actions: 1) node-level action determines the directions of knowledge transfer between the corresponding nodes of two networks; and then 2) structure-level action determines which of the local structures generated by the node-level actions to be propagated. In essence, our FreeKD is a general and principled framework which can be naturally compatible with GNNs of different architectures. Extensive experiments on five benchmark datasets demonstrate our FreeKD outperforms two base GNNs in a large margin, and shows its efficacy to various GNNs. More surprisingly, our FreeKD has comparable or even better performance than traditional KD algorithms that distill knowledge from a deeper and stronger teacher GNN.
    Physics-driven Deep Learning for PET/MRI. (arXiv:2206.06788v1 [eess.IV])
    In this paper, we review physics- and data-driven reconstruction techniques for simultaneous positron emission tomography (PET) / magnetic resonance imaging (MRI) systems, which have significant advantages for clinical imaging of cancer, neurological disorders, and heart disease. These reconstruction approaches utilize priors, either structural or statistical, together with a physics-based description of the PET system response. However, due to the nested representation of the forward problem, direct PET/MRI reconstruction is a nonlinear problem. We elucidate how a multi-faceted approach accommodates hybrid data- and physics-driven machine learning for reconstruction of 3D PET/MRI, summarizing important deep learning developments made in the last 5 years to address attenuation correction, scattering, low photon counts, and data consistency. We also describe how applications of these multi-modality approaches extend beyond PET/MRI to improving accuracy in radiation therapy planning. We conclude by discussing opportunities for extending the current state-of-the-art following the latest trends in physics- and deep learning-based computational imaging and next-generation detector hardware.
    How are policy gradient methods affected by the limits of control?. (arXiv:2206.06863v1 [math.OC])
    We study stochastic policy gradient methods from the perspective of control-theoretic limitations. Our main result is that ill-conditioned linear systems in the sense of Doyle inevitably lead to noisy gradient estimates. We also give an example of a class of stable systems in which policy gradient methods suffer from the curse of dimensionality. Our results apply to both state feedback and partially observed systems.
    Monitoring Urban Forests from Auto-Generated Segmentation Maps. (arXiv:2206.06948v1 [cs.CV])
    We present and evaluate a weakly-supervised methodology to quantify the spatio-temporal distribution of urban forests based on remotely sensed data with close-to-zero human interaction. Successfully training machine learning models for semantic segmentation typically depends on the availability of high-quality labels. We evaluate the benefit of high-resolution, three-dimensional point cloud data (LiDAR) as source of noisy labels in order to train models for the localization of trees in orthophotos. As proof of concept we sense Hurricane Sandy's impact on urban forests in Coney Island, New York City (NYC) and reference it to less impacted urban space in Brooklyn, NYC.
    Acceleration of cerebral blood flow and arterial transit time maps estimation from multiple post-labeling delay arterial spin-labeled MRI via deep learning. (arXiv:2206.06372v1 [eess.IV])
    Purpose: Arterial spin labeling (ASL) perfusion imaging indicates direct and absolute measurement of cerebral blood flow (CBF). Arterial transit time (ATT) is a related physiological parameter reflecting the duration for the labeled spins to reach the brain region of interest. Multiple post-labeling delay (PLDs) can provide robust measures of both CBF and ATT, allowing for optimization of regional CBF modeling based on ATT. The prolonged acquisition time can potentially reduce the quality and accuracy of the CBF and ATT estimation. We proposed a novel network to significantly reduce the number of PLDs with higher signal-to-noise ratio (SNR). Method: CBF and ATT estimations were performed for one PLD and two PLDs sepa-rately. Each model was trained independently to learn the nonlinear transformation from perfusion weighted image (PWI) to CBF and ATT images. Results: Both one-PLD and two-PLD models outperformed the conventional method visually on CBF and two-PLD model showed more accurate structure on ATT estima-tion. The proposed method significantly reduces the number of PLDs from 6 to 2 on ATT and even to single PLD on CBF without sacrificing the SNR. Conclusion: It is feasible to generate CBF and ATT maps with reduced PLDs using deep learning with high quality.
    Disentangled Federated Learning for Tackling Attributes Skew via Invariant Aggregation and Diversity Transferring. (arXiv:2206.06818v1 [cs.LG])
    Attributes skew hinders the current federated learning (FL) frameworks from consistent optimization directions among the clients, which inevitably leads to performance reduction and unstable convergence. The core problems lie in that: 1) Domain-specific attributes, which are non-causal and only locally valid, are indeliberately mixed into global aggregation. 2) The one-stage optimizations of entangled attributes cannot simultaneously satisfy two conflicting objectives, i.e., generalization and personalization. To cope with these, we proposed disentangled federated learning (DFL) to disentangle the domain-specific and cross-invariant attributes into two complementary branches, which are trained by the proposed alternating local-global optimization independently. Importantly, convergence analysis proves that the FL system can be stably converged even if incomplete client models participate in the global aggregation, which greatly expands the application scope of FL. Extensive experiments verify that DFL facilitates FL with higher performance, better interpretability, and faster convergence rate, compared with SOTA FL methods on both manually synthesized and realistic attributes skew datasets.
    PhML-DyR: A Physics-Informed ML framework for Dynamic Reconfiguration in Power Systems. (arXiv:2206.06789v1 [eess.SY])
    A transformation of the US electricity sector is underway with aggressive targets to achieve 100% carbon pollution-free electricity by 2035. To achieve this objective while maintaining a safe and reliable power grid, new operating paradigms are needed, of computationally fast and accurate decision making in a dynamic and uncertain environment. We propose a novel physics-informed machine learning framework for the decision of dynamic grid reconfiguration (PhML-DyR), a key task in power systems. Dynamic reconfiguration (DyR) is a process by which switch-states are dynamically set so as to lead to an optimal grid topology that minimizes line losses. To address the underlying computational complexities of NP-hardness due to the mixed nature of the decision variables, we propose the use of physics-informed ML (PhML) which integrates both operating constraints and topological and connectivity constraints into a neural network framework. Our PhML approach learns to simultaneously optimize grid topology and generator dispatch to meet loads, increase efficiency, and remain within safe operating limits. We demonstrate the effectiveness of PhML-DyR on a canonical grid, showing a reduction in electricity loss by 23%, and improved voltage profiles. We also show a reduction in constraint violations by an order of magnitude as well as in training time using PhML-DyR.
    Explainable Mixed Data Representation and Lossless Visualization Toolkit for Knowledge Discovery. (arXiv:2206.06476v1 [cs.LG])
    Developing Machine Learning (ML) algorithms for heterogeneous/mixed data is a longstanding problem. Many ML algorithms are not applicable to mixed data, which include numeric and non-numeric data, text, graphs and so on to generate interpretable models. Another longstanding problem is developing algorithms for lossless visualization of multidimensional mixed data. The further progress in ML heavily depends on success interpretable ML algorithms for mixed data and lossless interpretable visualization of multidimensional data. The later allows developing interpretable ML models using visual knowledge discovery by end-users, who can bring valuable domain knowledge which is absent in the training data. The challenges for mixed data include: (1) generating numeric coding schemes for non-numeric attributes for numeric ML algorithms to provide accurate and interpretable ML models, (2) generating methods for lossless visualization of n-D non-numeric data and visual rule discovery in these visualizations. This paper presents a classification of mixed data types, analyzes their importance for ML and present the developed experimental toolkit to deal with mixed data. It combines the Data Types Editor, VisCanvas data visualization and rule discovery system which is available on GitHub.
    Downlink Power Allocation in Massive MIMO via Deep Learning: Adversarial Attacks and Training. (arXiv:2206.06592v1 [cs.LG])
    The successful emergence of deep learning (DL) in wireless system applications has raised concerns about new security-related challenges. One such security challenge is adversarial attacks. Although there has been much work demonstrating the susceptibility of DL-based classification tasks to adversarial attacks, regression-based problems in the context of a wireless system have not been studied so far from an attack perspective. The aim of this paper is twofold: (i) we consider a regression problem in a wireless setting and show that adversarial attacks can break the DL-based approach and (ii) we analyze the effectiveness of adversarial training as a defensive technique in adversarial settings and show that the robustness of DL-based wireless system against attacks improves significantly. Specifically, the wireless application considered in this paper is the DL-based power allocation in the downlink of a multicell massive multi-input-multi-output system, where the goal of the attack is to yield an infeasible solution by the DL model. We extend the gradient-based adversarial attacks: fast gradient sign method (FGSM), momentum iterative FGSM, and projected gradient descent method to analyze the susceptibility of the considered wireless application with and without adversarial training. We analyze the deep neural network (DNN) models performance against these attacks, where the adversarial perturbations are crafted using both the white-box and black-box attacks.
    Exponential Error Convergence in Data Classification with Optimized Random Features: Acceleration by Quantum Machine Learning. (arXiv:2106.09028v2 [quant-ph] UPDATED)
    Classification is a common task in machine learning. Random features (RFs) stand as a central technique for scalable learning algorithms based on kernel methods, and more recently proposed optimized random features, sampled depending on the model and the data distribution, can significantly reduce and provably minimize the required number of features. However, existing research on classification using optimized RFs has suffered from computational hardness in sampling each optimized RF; moreover, it has failed to achieve the exponentially fast error-convergence speed that other state-of-the-art kernel methods can achieve under a low-noise condition. To overcome these slowdowns, we here construct a classification algorithm with optimized RFs accelerated by means of quantum machine learning (QML) and study its runtime to clarify overall advantage. We prove that our algorithm can achieve the exponential error convergence under the low-noise condition even with optimized RFs; at the same time, our algorithm can exploit the advantage of the significant reduction of the number of features without the computational hardness owing to QML. These results discover a promising application of QML to acceleration of the leading kernel-based classification algorithm without ruining its wide applicability and the exponential error-convergence speed.
    A Survey on Uncertainty Reasoning and Quantification for Decision Making: Belief Theory Meets Deep Learning. (arXiv:2206.05675v2 [cs.AI] UPDATED)
    An in-depth understanding of uncertainty is the first step to making effective decisions under uncertainty. Deep/machine learning (ML/DL) has been hugely leveraged to solve complex problems involved with processing high-dimensional data. However, reasoning and quantifying different types of uncertainties to achieve effective decision-making have been much less explored in ML/DL than in other Artificial Intelligence (AI) domains. In particular, belief/evidence theories have been studied in KRR since the 1960s to reason and measure uncertainties to enhance decision-making effectiveness. We found that only a few studies have leveraged the mature uncertainty research in belief/evidence theories in ML/DL to tackle complex problems under different types of uncertainty. In this survey paper, we discuss several popular belief theories and their core ideas dealing with uncertainty causes and types and quantifying them, along with the discussions of their applicability in ML/DL. In addition, we discuss three main approaches that leverage belief theories in Deep Neural Networks (DNNs), including Evidential DNNs, Fuzzy DNNs, and Rough DNNs, in terms of their uncertainty causes, types, and quantification methods along with their applicability in diverse problem domains. Based on our in-depth survey, we discuss insights, lessons learned, limitations of the current state-of-the-art bridging belief theories and ML/DL, and finally, future research directions.
    Latent Diffusion Energy-Based Model for Interpretable Text Modeling. (arXiv:2206.05895v2 [cs.LG] UPDATED)
    Latent space Energy-Based Models (EBMs), also known as energy-based priors, have drawn growing interests in generative modeling. Fueled by its flexibility in the formulation and strong modeling power of the latent space, recent works built upon it have made interesting attempts aiming at the interpretability of text modeling. However, latent space EBMs also inherit some flaws from EBMs in data space; the degenerate MCMC sampling quality in practice can lead to poor generation quality and instability in training, especially on data with complex latent structures. Inspired by the recent efforts that leverage diffusion recovery likelihood learning as a cure for the sampling issue, we introduce a novel symbiosis between the diffusion models and latent space EBMs in a variational learning framework, coined as the latent diffusion energy-based model. We develop a geometric clustering-based regularization jointly with the information bottleneck to further improve the quality of the learned latent space. Experiments on several challenging tasks demonstrate the superior performance of our model on interpretable text modeling over strong counterparts.
    Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method. (arXiv:2206.06900v1 [cs.LG])
    The classical AdaGrad method adapts the learning rate by dividing by the square root of a sum of squared gradients. Because this sum on the denominator is increasing, the method can only decrease step sizes over time, and requires a learning rate scaling hyper-parameter to be carefully tuned. To overcome this restriction, we introduce GradaGrad, a method in the same family that naturally grows or shrinks the learning rate based on a different accumulation in the denominator, one that can both increase and decrease. We show that it obeys a similar convergence rate as AdaGrad and demonstrate its non-monotone adaptation capability with experiments.
    Invariant Structure Learning for Better Generalization and Causal Explainability. (arXiv:2206.06469v1 [cs.LG])
    Learning the causal structure behind data is invaluable for improving generalization and obtaining high-quality explanations. We propose a novel framework, Invariant Structure Learning (ISL), that is designed to improve causal structure discovery by utilizing generalization as an indication. ISL splits the data into different environments, and learns a structure that is invariant to the target across different environments by imposing a consistency constraint. An aggregation mechanism then selects the optimal classifier based on a graph structure that reflects the causal mechanisms in the data more accurately compared to the structures learnt from individual environments. Furthermore, we extend ISL to a self-supervised learning setting where accurate causal structure discovery does not rely on any labels. This self-supervised ISL utilizes invariant causality proposals by iteratively setting different nodes as targets. On synthetic and real-world datasets, we demonstrate that ISL accurately discovers the causal structure, outperforms alternative methods, and yields superior generalization for datasets with significant distribution shifts.
    Causal Discovery for Fairness. (arXiv:2206.06685v1 [cs.AI])
    It is crucial to consider the social and ethical consequences of AI and ML based decisions for the safe and acceptable use of these emerging technologies. Fairness, in particular, guarantees that the ML decisions do not result in discrimination against individuals or minorities. Identifying and measuring reliably fairness/discrimination is better achieved using causality which considers the causal relation, beyond mere association, between the sensitive attribute (e.g. gender, race, religion, etc.) and the decision (e.g. job hiring, loan granting, etc.). The big impediment to the use of causality to address fairness, however, is the unavailability of the causal model (typically represented as a causal graph). Existing causal approaches to fairness in the literature do not address this problem and assume that the causal model is available. In this paper, we do not make such assumption and we review the major algorithms to discover causal relations from observable data. This study focuses on causal discovery and its impact on fairness. In particular, we show how different causal discovery approaches may result in different causal models and, most importantly, how even slight differences between causal models can have significant impact on fairness/discrimination conclusions. These results are consolidated by empirical analysis using synthetic and standard fairness benchmark datasets. The main goal of this study is to highlight the importance of the causal discovery step to appropriately address fairness using causality.
    Competing Bandits: The Perils of Exploration Under Competition. (arXiv:2007.10144v6 [cs.GT] UPDATED)
    Most online platforms strive to learn from interactions with users, and many engage in exploration: making potentially suboptimal choices for the sake of acquiring new information. We study the interplay between exploration and competition: how such platforms balance the exploration for learning and the competition for users. Here users play three distinct roles: they are customers that generate revenue, they are sources of data for learning, and they are self-interested agents which choose among the competing platforms. We consider a stylized duopoly model in which two firms face the same multi-armed bandit problem. Users arrive one by one and choose between the two firms, so that each firm makes progress on its bandit problem only if it is chosen. Through a mix of theoretical results and numerical simulations, we study whether and to what extent competition incentivizes the adoption of better bandit algorithms, and whether it leads to welfare increases for users. We find that stark competition induces firms to commit to a "greedy" bandit algorithm that leads to low welfare. However, weakening competition by providing firms with some "free" users incentivizes better exploration strategies and increases welfare. We investigate two channels for weakening the competition: relaxing the rationality of users and giving one firm a first-mover advantage. Our findings are closely related to the "competition vs. innovation" relationship, and elucidate the first-mover advantage in the digital economy.
    Exploring speaker enrolment for few-shot personalisation in emotional vocalisation prediction. (arXiv:2206.06680v1 [cs.SD])
    In this work, we explore a novel few-shot personalisation architecture for emotional vocalisation prediction. The core contribution is an `enrolment' encoder which utilises two unlabelled samples of the target speaker to adjust the output of the emotion encoder; the adjustment is based on dot-product attention, thus effectively functioning as a form of `soft' feature selection. The emotion and enrolment encoders are based on two standard audio architectures: CNN14 and CNN10. The two encoders are further guided to forget or learn auxiliary emotion and/or speaker information. Our best approach achieves a CCC of $.650$ on the ExVo Few-Shot dev set, a $2.5\%$ increase over our baseline CNN14 CCC of $.634$.
    Dynamic stability of power grids -- new datasets for Graph Neural Networks. (arXiv:2206.06369v1 [cs.LG])
    One of the key challenges for the success of the energy transition towards renewable energies is the analysis of the dynamic stability of power grids. However, dynamic solutions are intractable and exceedingly expensive for large grids. Graph Neural Networks (GNNs) are a promising method to reduce the computational effort of predicting dynamic stability of power grids, however datasets of appropriate complexity and size do not yet exist. We introduce two new datasets of synthetically generated power grids. For each grid, the dynamic stability has been estimated using Monte-Carlo simulations. The datasets have 10 times more grids than previously published. To evaluate the potential for real-world applications, we demonstrate the successful prediction on a Texan power grid model. The performance can be improved to surprisingly high levels by training more complex models on more data. Furthermore, the investigated grids have different sizes, enabling the application of out-of-distribution evaluation and transfer learning from a small to a large domain. We invite the community to improve our benchmark models and thus aid the energy transition with better tools.
    Large Batch Experience Replay. (arXiv:2110.01528v2 [cs.LG] UPDATED)
    Several algorithms have been proposed to sample non-uniformly the replay buffer of deep Reinforcement Learning (RL) agents to speed-up learning, but very few theoretical foundations of these sampling schemes have been provided. Among others, Prioritized Experience Replay appears as a hyperparameter sensitive heuristic, even though it can provide good performance. In this work, we cast the replay buffer sampling problem as an importance sampling one for estimating the gradient. This allows deriving the theoretically optimal sampling distribution, yielding the best theoretical convergence speed. Elaborating on the knowledge of the ideal sampling scheme, we exhibit new theoretical foundations of Prioritized Experience Replay. The optimal sampling distribution being intractable, we make several approximations providing good results in practice and introduce, among others, LaBER (Large Batch Experience Replay), an easy-to-code and efficient method for sampling the replay buffer. LaBER, which can be combined with Deep Q-Networks, distributional RL agents or actor-critic methods, yields improved performance over a diverse range of Atari games and PyBullet environments, compared to the base agent it is implemented on and to other prioritization schemes.
    The Modality Focusing Hypothesis: On the Blink of Multimodal Knowledge Distillation. (arXiv:2206.06487v1 [cs.CV])
    Multimodal knowledge distillation (KD) extends traditional knowledge distillation to the area of multimodal learning. One common practice is to adopt a well-performed multimodal network as the teacher in the hope that it can transfer its full knowledge to a unimodal student for performance improvement. In this paper, we investigate the efficacy of multimodal KD. We begin by providing two failure cases of it and demonstrate that KD is not a universal cure in multimodal knowledge transfer. We present the modality Venn diagram to understand modality relationships and the modality focusing hypothesis revealing the decisive factor in the efficacy of multimodal KD. Experimental results on 6 multimodal datasets help justify our hypothesis, diagnose failure cases, and point directions to improve distillation performance.
    CorticalFlow$^{++}$: Boosting Cortical Surface Reconstruction Accuracy, Regularity, and Interoperability. (arXiv:2206.06598v1 [eess.IV])
    The problem of Cortical Surface Reconstruction from magnetic resonance imaging has been traditionally addressed using lengthy pipelines of image processing techniques like FreeSurfer, CAT, or CIVET. These frameworks require very long runtimes deemed unfeasible for real-time applications and unpractical for large-scale studies. Recently, supervised deep learning approaches have been introduced to speed up this task cutting down the reconstruction time from hours to seconds. Using the state-of-the-art CorticalFlow model as a blueprint, this paper proposes three modifications to improve its accuracy and interoperability with existing surface analysis tools, while not sacrificing its fast inference time and low GPU memory consumption. First, we employ a more accurate ODE solver to reduce the diffeomorphic mapping approximation error. Second, we devise a routine to produce smoother template meshes avoiding mesh artifacts caused by sharp edges in CorticalFlow's convex-hull based template. Last, we recast pial surface prediction as the deformation of the predicted white surface leading to a one-to-one mapping between white and pial surface vertices. This mapping is essential to many existing surface analysis tools for cortical morphometry. We name the resulting method CorticalFlow$^{++}$. Using large-scale datasets, we demonstrate the proposed changes provide more geometric accuracy and surface regularity while keeping the reconstruction time and GPU memory requirements almost unchanged.
    Variance Reduction for Policy-Gradient Methods via Empirical Variance Minimization. (arXiv:2206.06827v1 [cs.LG])
    Policy-gradient methods in Reinforcement Learning(RL) are very universal and widely applied in practice but their performance suffers from the high variance of the gradient estimate. Several procedures were proposed to reduce it including actor-critic(AC) and advantage actor-critic(A2C) methods. Recently the approaches have got new perspective due to the introduction of Deep RL: both new control variates(CV) and new sub-sampling procedures became available in the setting of complex models like neural networks. The vital part of CV-based methods is the goal functional for the training of the CV, the most popular one is the least-squares criterion of A2C. Despite its practical success, the criterion is not the only one possible. In this paper we for the first time investigate the performance of the one called Empirical Variance(EV). We observe in the experiments that not only EV-criterion performs not worse than A2C but sometimes can be considerably better. Apart from that, we also prove some theoretical guarantees of the actual variance reduction under very general assumptions and show that A2C least-squares goal functional is an upper bound for EV goal. Our experiments indicate that in terms of variance reduction EV-based methods are much better than A2C and allow stronger variance reduction.
    Confidence Score for Source-Free Unsupervised Domain Adaptation. (arXiv:2206.06640v1 [cs.CV])
    Source-free unsupervised domain adaptation (SFUDA) aims to obtain high performance in the unlabeled target domain using the pre-trained source model, not the source data. Existing SFUDA methods assign the same importance to all target samples, which is vulnerable to incorrect pseudo-labels. To differentiate between sample importance, in this study, we propose a novel sample-wise confidence score, the Joint Model-Data Structure (JMDS) score for SFUDA. Unlike existing confidence scores that use only one of the source or target domain knowledge, the JMDS score uses both knowledge. We then propose a Confidence score Weighting Adaptation using the JMDS (CoWA-JMDS) framework for SFUDA. CoWA-JMDS consists of the JMDS scores as sample weights and weight Mixup that is our proposed variant of Mixup. Weight Mixup promotes the model make more use of the target domain knowledge. The experimental results show that the JMDS score outperforms the existing confidence scores. Moreover, CoWA-JMDS achieves state-of-the-art performance on various SFUDA scenarios: closed, open, and partial-set scenarios.
    Edge Graph Neural Networks for Massive MIMO Detection. (arXiv:2206.06979v1 [cs.IT])
    Massive Multiple-Input Multiple-Out (MIMO) detection is an important problem in modern wireless communication systems. While traditional Belief Propagation (BP) detectors perform poorly on loopy graphs, the recent Graph Neural Networks (GNNs)-based method can overcome the drawbacks of BP and achieve superior performance. Nevertheless, direct use of GNN ignores the importance of edge attributes and suffers from high computation overhead using a fully connected graph structure. In this paper, we propose an efficient GNN-inspired algorithm, called the Edge Graph Neural Network (EGNN), to detect MIMO signals. We first compute graph edge weights through channel correlation and then leverage the obtained weights as a metric to evaluate the importance of neighbors of each node. Moreover, we design an adaptive Edge Drop (ED) scheme to sparsify the graph such that computational cost can be significantly reduced. Experimental results demonstrate that our proposed EGNN achieves better or comparable performance to popular MIMO detection methods for different modulation schemes and costs the least detection time compared to GNN-based approaches.
    Neural interval-censored Cox regression with feature selection. (arXiv:2206.06885v1 [stat.ML])
    The classical Cox model emerged in 1972 promoting breakthroughs in how patient prognosis is quantified using time-to-event analysis in biomedicine. One of the most useful characteristics of the model for practitioners is the interpretability of the variables in the analysis. However, this comes at the price of introducing strong assumptions concerning the functional form of the regression model. To break this gap, this paper aims to exploit the explainability advantages of the classical Cox model in the setting of interval-censoring using a new Lasso neural network that simultaneously selects the most relevant variables while quantifying non-linear relations between predictors and survival times. The gain of the new method is illustrated empirically in an extensive simulation study with examples that involve linear and non-linear ground dependencies. We also demonstrate the performance of our strategy in the analysis of physiological, clinical and accelerometer data from the NHANES 2003-2006 waves to predict the effect of physical activity on the survival of patients. Our method outperforms the prior results in the literature that use the traditional Cox model.
    General-purpose, long-context autoregressive modeling with Perceiver AR. (arXiv:2202.07765v2 [cs.LG] UPDATED)
    Real-world data is high-dimensional: a book, image, or musical performance can easily contain hundreds of thousands of elements even after compression. However, the most commonly used autoregressive models, Transformers, are prohibitively expensive to scale to the number of inputs and layers needed to capture this long-range structure. We develop Perceiver AR, an autoregressive, modality-agnostic architecture which uses cross-attention to map long-range inputs to a small number of latents while also maintaining end-to-end causal masking. Perceiver AR can directly attend to over a hundred thousand tokens, enabling practical long-context density estimation without the need for hand-crafted sparsity patterns or memory mechanisms. When trained on images or music, Perceiver AR generates outputs with clear long-term coherence and structure. Our architecture also obtains state-of-the-art likelihood on long-sequence benchmarks, including 64 x 64 ImageNet images and PG-19 books.
    Learning Enhanced Representations for Tabular Data via Neighborhood Propagation. (arXiv:2206.06587v1 [cs.LG])
    Prediction over tabular data is an essential and fundamental problem in many important downstream tasks. However, existing methods either take a data instance of the table independently as input or do not fully utilize the multi-rows features and labels to directly change and enhance the target data representations. In this paper, we propose to 1) construct a hypergraph from relevant data instance retrieval to model the cross-row and cross-column patterns of those instances, and 2) perform message Propagation to Enhance the target data instance representation for Tabular prediction tasks. Specifically, our specially-designed message propagation step benefits from 1) fusion of label and features during propagation, and 2) locality-aware high-order feature interactions. Experiments on two important tabular data prediction tasks validate the superiority of the proposed PET model against other baselines. Additionally, we demonstrate the effectiveness of the model components and the feature enhancement ability of PET via various ablation studies and visualizations. The code is included in https://github.com/KounianhuaDu/PET.
    Deep Isolation Forest for Anomaly Detection. (arXiv:2206.06602v1 [cs.LG])
    Isolation forest (iForest) has been emerging as arguably the most popular anomaly detector in recent years. It iteratively performs axis-parallel data space partition in a tree structure to isolate deviated data objects from the other data, with the isolation difficulty of the objects defined as anomaly scores. iForest shows effective performance across popular dataset benchmarks, but its axis-parallel-based linear data partition is ineffective in handling hard anomalies in high-dimensional/non-linear-separable data space, and even worse, it leads to a notorious algorithmic bias that assigns unexpectedly large anomaly scores to artefact regions. There have been several extensions of iForest, but they still focus on linear data partition, failing to effectively isolate those hard anomalies. This paper introduces a novel extension of iForest, deep isolation forest. Our method offers a comprehensive isolation method that can arbitrarily partition the data at any random direction and angle on subspaces of any size, effectively avoiding the algorithmic bias in the linear partition. Further, it requires only randomly initialised neural networks (i.e., no optimisation is required in our method) to ensure the freedom of the partition. In doing so, desired randomness and diversity in both random network-based representations and random partition-based isolation can be fully leveraged to significantly enhance the isolation ensemble-based anomaly detection. Also, our approach offers a data-type-agnostic anomaly detection solution. It is versatile to detect anomalies in different types of data by simply plugging in corresponding randomly initialised neural networks in the feature mapping. Extensive empirical results on a large collection of real-world datasets show that our model achieves substantial improvement over state-of-the-art isolation-based and non-isolation-based anomaly detection models.
    The Open Kidney Ultrasound Data Set. (arXiv:2206.06657v1 [eess.IV])
    Ultrasound use is because of its low cost, non-ionizing, and non-invasive characteristics, and has established itself as a cornerstone radiological examination. Research on ultrasound applications has also expanded, especially with image analysis with machine learning. However, ultrasound data are frequently restricted to closed data sets, with only a few openly available. Despite being a frequently examined organ, the kidney lacks a publicly available ultrasonography data set. The proposed Open Kidney Ultrasound Data Set is the first publicly available set of kidney B-mode ultrasound data that includes annotations for multi-class semantic segmentation. It is based on data retrospectively collected in a 5-year period from over 500 patients with a mean age of 53.2 +/- 14.7 years, body mass index of 27.0 +/- 5.4 kg/m2, and most common primary diseases being diabetes mellitus, IgA nephropathy, and hypertension. There are labels for the view and fine-grained manual annotations from two expert sonographers. Notably, this data includes native and transplanted kidneys. Initial benchmarking measurements are performed, demonstrating a state-of-the-art algorithm achieving a Dice Sorenson Coefficient of 0.74 for the kidney capsule. This data set is a high-quality data set, including two sets of expert annotations, with a larger breadth of images than previously available. In increasing access to kidney ultrasound data, future researchers may be able to create novel image analysis techniques for tissue characterization, disease detection, and prognostication.
    Learning Markov Games with Adversarial Opponents: Efficient Algorithms and Fundamental Limits. (arXiv:2203.06803v4 [cs.LG] UPDATED)
    An ideal strategy in zero-sum games should not only grant the player an average reward no less than the value of Nash equilibrium, but also exploit the (adaptive) opponents when they are suboptimal. While most existing works in Markov games focus exclusively on the former objective, it remains open whether we can achieve both objectives simultaneously. To address this problem, this work studies no-regret learning in Markov games with adversarial opponents when competing against the best fixed policy in hindsight. Along this direction, we present a new complete set of positive and negative results: When the policies of the opponents are revealed at the end of each episode, we propose new efficient algorithms achieving $\sqrt{K}$-regret bounds when either (1) the baseline policy class is small or (2) the opponent's policy class is small. This is complemented with an exponential lower bound when neither conditions are true. When the policies of the opponents are not revealed, we prove a statistical hardness result even in the most favorable scenario when both above conditions are true. Our hardness result is much stronger than the existing hardness results which either only involve computational hardness, or require further restrictions on the algorithms.
    Mapping fNIRS to fMRI with Neural Data Augmentation and Machine Learning Models. (arXiv:2206.06486v1 [q-bio.NC])
    Advances in neuroimaging techniques have provided us novel insights into understanding how the human mind works. Functional magnetic resonance imaging (fMRI) is the most popular and widely used neuroimaging technique, and there is growing interest in fMRI-based markers of individual differences. However, its utility is often limited due to its high cost and difficulty acquiring from specific populations, including children and infants. Surrogate markers, or neural correlates of fMRI markers, would have important practical implications, but we have few stand-alone predictors for the fMRI markers. Here, using machine learning (ML) models and data augmentation, we predicted well-validated fMRI markers of human cognition from multivariate patterns of functional near-infrared spectroscopy (fNIRS), a portable and relatively inexpensive optical neuroimaging technique. We recruited 50 human participants who performed two cognitive tasks (stop signal task and probabilistic reversal learning task), while neural activation was measured with either fNIRS or fMRI at each of the total two visits. Using ML models and data augmentation, we could predict the well-established fMRI markers of response inhibition or prediction error signals from 48-channel fNIRS activation in the prefrontal cortex. These results suggest that fNIRS might offer a surrogate marker of fMRI activation, which would broaden our understanding of various populations, including infants.
    Darknet Traffic Classification and Adversarial Attacks. (arXiv:2206.06371v1 [cs.LG])
    The anonymous nature of darknets is commonly exploited for illegal activities. Previous research has employed machine learning and deep learning techniques to automate the detection of darknet traffic in an attempt to block these criminal activities. This research aims to improve darknet traffic detection by assessing Support Vector Machines (SVM), Random Forest (RF), Convolutional Neural Networks (CNN), and Auxiliary-Classifier Generative Adversarial Networks (AC-GAN) for classification of such traffic and the underlying application types. We find that our RF model outperforms the state-of-the-art machine learning techniques used in prior work with the CIC-Darknet2020 dataset. To evaluate the robustness of our RF classifier, we obfuscate select application type classes to simulate realistic adversarial attack scenarios. We demonstrate that our best-performing classifier can be defeated by such attacks, and we consider ways to deal with such adversarial attacks.
    Bandwidth Enables Generalization in Quantum Kernel Models. (arXiv:2206.06686v1 [quant-ph])
    Quantum computers are known to provide speedups over classical state-of-the-art machine learning methods in some specialized settings. For example, quantum kernel methods have been shown to provide an exponential speedup on a learning version of the discrete logarithm problem. Understanding the generalization of quantum models is essential to realizing similar speedups on problems of practical interest. Recent results demonstrate that generalization is hindered by the exponential size of the quantum feature space. Although these results suggest that quantum models cannot generalize when the number of qubits is large, in this paper we show that these results rely on overly restrictive assumptions. We consider a wider class of models by varying a hyperparameter that we call quantum kernel bandwidth. We analyze the large-qubit limit and provide explicit formulas for the generalization of a quantum model that can be solved in closed form. Specifically, we show that changing the value of the bandwidth can take a model from provably not being able to generalize to any target function to good generalization for well-aligned targets. Our analysis shows how the bandwidth controls the spectrum of the kernel integral operator and thereby the inductive bias of the model. We demonstrate empirically that our theory correctly predicts how varying the bandwidth affects generalization of quantum models on challenging datasets, including those far outside our theoretical assumptions. We discuss the implications of our results for quantum advantage in machine learning.
    Learning towards Synchronous Network Memorizability and Generalizability for Continual Segmentation across Multiple Sites. (arXiv:2206.06813v1 [eess.IV])
    In clinical practice, a segmentation network is often required to continually learn on a sequential data stream from multiple sites rather than a consolidated set, due to the storage cost and privacy restriction. However, during the continual learning process, existing methods are usually restricted in either network memorizability on previous sites or generalizability on unseen sites. This paper aims to tackle the challenging problem of Synchronous Memorizability and Generalizability (SMG) and to simultaneously improve performance on both previous and unseen sites, with a novel proposed SMG-learning framework. First, we propose a Synchronous Gradient Alignment (SGA) objective, which \emph{not only} promotes the network memorizability by enforcing coordinated optimization for a small exemplar set from previous sites (called replay buffer), \emph{but also} enhances the generalizability by facilitating site-invariance under simulated domain shift. Second, to simplify the optimization of SGA objective, we design a Dual-Meta algorithm that approximates the SGA objective as dual meta-objectives for optimization without expensive computation overhead. Third, for efficient rehearsal, we configure the replay buffer comprehensively considering additional inter-site diversity to reduce redundancy. Experiments on prostate MRI data sequentially acquired from six institutes demonstrate that our method can simultaneously achieve higher memorizability and generalizability over state-of-the-art methods. Code is available at https://github.com/jingyzhang/SMG-Learning.
    Robust Reinforcement Learning with Distributional Risk-averse formulation. (arXiv:2206.06841v1 [cs.LG])
    Robust Reinforcement Learning tries to make predictions more robust to changes in the dynamics or rewards of the system. This problem is particularly important when the dynamics and rewards of the environment are estimated from the data. In this paper, we approximate the Robust Reinforcement Learning constrained with a $\Phi$-divergence using an approximate Risk-Averse formulation. We show that the classical Reinforcement Learning formulation can be robustified using standard deviation penalization of the objective. Two algorithms based on Distributional Reinforcement Learning, one for discrete and one for continuous action spaces are proposed and tested in a classical Gym environment to demonstrate the robustness of the algorithms.
    Deep Reinforcement Learning for Exact Combinatorial Optimization: Learning to Branch. (arXiv:2206.06965v1 [cs.LG])
    Branch-and-bound is a systematic enumerative method for combinatorial optimization, where the performance highly relies on the variable selection strategy. State-of-the-art handcrafted heuristic strategies suffer from relatively slow inference time for each selection, while the current machine learning methods require a significant amount of labeled data. We propose a new approach for solving the data labeling and inference latency issues in combinatorial optimization based on the use of the reinforcement learning (RL) paradigm. We use imitation learning to bootstrap an RL agent and then use Proximal Policy Optimization (PPO) to further explore global optimal actions. Then, a value network is used to run Monte-Carlo tree search (MCTS) to enhance the policy network. We evaluate the performance of our method on four different categories of combinatorial optimization problems and show that our approach performs strongly compared to the state-of-the-art machine learning and heuristics based methods.
    DeepEmotex: Classifying Emotion in Text Messages using Deep Transfer Learning. (arXiv:2206.06775v1 [cs.IR])
    Transfer learning has been widely used in natural language processing through deep pretrained language models, such as Bidirectional Encoder Representations from Transformers and Universal Sentence Encoder. Despite the great success, language models get overfitted when applied to small datasets and are prone to forgetting when fine-tuned with a classifier. To remedy this problem of forgetting in transferring deep pretrained language models from one domain to another domain, existing efforts explore fine-tuning methods to forget less. We propose DeepEmotex an effective sequential transfer learning method to detect emotion in text. To avoid forgetting problem, the fine-tuning step is instrumented by a large amount of emotion-labeled data collected from Twitter. We conduct an experimental study using both curated Twitter data sets and benchmark data sets. DeepEmotex models achieve over 91% accuracy for multi-class emotion classification on test dataset. We evaluate the performance of the fine-tuned DeepEmotex models in classifying emotion in EmoInt and Stimulus benchmark datasets. The models correctly classify emotion in 73% of the instances in the benchmark datasets. The proposed DeepEmotex-BERT model outperforms Bi-LSTM result on the benchmark datasets by 23%. We also study the effect of the size of the fine-tuning dataset on the accuracy of our models. Our evaluation results show that fine-tuning with a large set of emotion-labeled data improves both the robustness and effectiveness of the resulting target task model.
    Architectural patterns for handling runtime uncertainty of data-driven models in safety-critical perception. (arXiv:2206.06838v1 [cs.SE])
    Data-driven models (DDM) based on machine learning and other AI techniques play an important role in the perception of increasingly autonomous systems. Due to the merely implicit definition of their behavior mainly based on the data used for training, DDM outputs are subject to uncertainty. This poses a challenge with respect to the realization of safety-critical perception tasks by means of DDMs. A promising approach to tackling this challenge is to estimate the uncertainty in the current situation during operation and adapt the system behavior accordingly. In previous work, we focused on runtime estimation of uncertainty and discussed approaches for handling uncertainty estimations. In this paper, we present additional architectural patterns for handling uncertainty. Furthermore, we evaluate the four patterns qualitatively and quantitatively with respect to safety and performance gains. For the quantitative evaluation, we consider a distance controller for vehicle platooning where performance gains are measured by considering how much the distance can be reduced in different operational situations. We conclude that the consideration of context information of the driving situation makes it possible to accept more or less uncertainty depending on the inherent risk of the situation, which results in performance gains.
    Conformal Off-Policy Prediction. (arXiv:2206.06711v1 [stat.ML])
    Off-policy evaluation is critical in a number of applications where new policies need to be evaluated offline before online deployment. Most existing methods focus on the expected return, define the target parameter through averaging and provide a point estimator only. In this paper, we develop a novel procedure to produce reliable interval estimators for a target policy's return starting from any initial state. Our proposal accounts for the variability of the return around its expectation, focuses on the individual effect and offers valid uncertainty quantification. Our main idea lies in designing a pseudo policy that generates subsamples as if they were sampled from the target policy so that existing conformal prediction algorithms are applicable to prediction interval construction. Our methods are justified by theories, synthetic data and real data from short-video platforms.
    FETILDA: An Effective Framework For Fin-tuned Embeddings For Long Financial Text Documents. (arXiv:2206.06952v1 [cs.CL])
    Unstructured data, especially text, continues to grow rapidly in various domains. In particular, in the financial sphere, there is a wealth of accumulated unstructured financial data, such as the textual disclosure documents that companies submit on a regular basis to regulatory agencies, such as the Securities and Exchange Commission (SEC). These documents are typically very long and tend to contain valuable soft information about a company's performance. It is therefore of great interest to learn predictive models from these long textual documents, especially for forecasting numerical key performance indicators (KPIs). Whereas there has been a great progress in pre-trained language models (LMs) that learn from tremendously large corpora of textual data, they still struggle in terms of effective representations for long documents. Our work fills this critical need, namely how to develop better models to extract useful information from long textual documents and learn effective features that can leverage the soft financial and risk information for text regression (prediction) tasks. In this paper, we propose and implement a deep learning framework that splits long documents into chunks and utilizes pre-trained LMs to process and aggregate the chunks into vector representations, followed by self-attention to extract valuable document-level features. We evaluate our model on a collection of 10-K public disclosure reports from US banks, and another dataset of reports submitted by US companies. Overall, our framework outperforms strong baseline methods for textual modeling as well as a baseline regression model using only numerical data. Our work provides better insights into how utilizing pre-trained domain-specific and fine-tuned long-input LMs in representing long documents can improve the quality of representation of textual data, and therefore, help in improving predictive analyses.
    Universally Expressive Communication in Multi-Agent Reinforcement Learning. (arXiv:2206.06758v1 [cs.MA])
    Allowing agents to share information through communication is crucial for solving complex tasks in multi-agent reinforcement learning. In this work, we consider the question of whether a given communication protocol can express an arbitrary policy. By observing that many existing protocols can be viewed as instances of graph neural networks (GNNs), we demonstrate the equivalence of joint action selection to node labelling. With standard GNN approaches provably limited in their expressive capacity, we draw from existing GNN literature and consider augmenting agent observations with: (1) unique agent IDs and (2) random noise. We provide a theoretical analysis as to how these approaches yield universally expressive communication, and also prove them capable of targeting arbitrary sets of actions for identical agents. Empirically, these augmentations are found to improve performance on tasks where expressive communication is required, whilst, in general, the optimal communication protocol is found to be task-dependent.
    Task Transfer and Domain Adaptation for Zero-Shot Question Answering. (arXiv:2206.06705v1 [cs.CL])
    Pretrained language models have shown success in various areas of natural language processing, including reading comprehension tasks. However, when applying machine learning methods to new domains, labeled data may not always be available. To address this, we use supervised pretraining on source-domain data to reduce sample complexity on domain-specific downstream tasks. We evaluate zero-shot performance on domain-specific reading comprehension tasks by combining task transfer with domain adaptation to fine-tune a pretrained model with no labelled data from the target task. Our approach outperforms Domain-Adaptive Pretraining on downstream domain-specific reading comprehension tasks in 3 out of 4 domains.
    CNN-based Classification Framework for Tissues of Lung with Additional Information. (arXiv:2206.06701v1 [eess.IV])
    Interstitial lung diseases are a large group of heterogeneous diseases characterized by different degrees of alveolitis and pulmonary fibrosis. Accurately diagnosing these diseases has significant guiding value for formulating treatment plans. Although previous work has produced impressive results in classifying interstitial lung diseases, there is still room for improving the accuracy of these techniques, mainly to enhance automated decision-making. In order to improve the classification precision, our study proposes a convolutional neural networks-based framework with additional information. Firstly, ILD images are added with their medical information by re-scaling the original image in Hounsfield Units. Secondly, a modified CNN model is used to produce a vector of classification probability for each tissue. Thirdly, location information of the input image, consisting of the occurrence frequencies of different diseases in the CT scans on certain locations, is used to calculate a location weight vector. Finally, the Hadamard product between two vectors is used to produce a decision vector for the prediction. Compared to the state-of-the-art methods, the results using a publicly available ILD database show the potential of predicting these using different additional information.
    Fast Model Editing at Scale. (arXiv:2110.11309v2 [cs.LG] UPDATED)
    While large pre-trained models have enabled impressive results on a variety of downstream tasks, the largest existing models still make errors, and even accurate predictions may become outdated over time. Because detecting all such failures at training time is impossible, enabling both developers and end users of such models to correct inaccurate outputs while leaving the model otherwise intact is desirable. However, the distributed, black-box nature of the representations learned by large neural networks makes producing such targeted edits difficult. If presented with only a single problematic input and new desired output, fine-tuning approaches tend to overfit; other editing algorithms are either computationally infeasible or simply ineffective when applied to very large models. To enable easy post-hoc editing at scale, we propose Model Editor Networks using Gradient Decomposition (MEND), a collection of small auxiliary editing networks that use a single desired input-output pair to make fast, local edits to a pre-trained model's behavior. MEND learns to transform the gradient obtained by standard fine-tuning, using a low-rank decomposition of the gradient to make the parameterization of this transformation tractable. MEND can be trained on a single GPU in less than a day even for 10 billion+ parameter models; once trained MEND enables rapid application of new edits to the pre-trained model. Our experiments with T5, GPT, BERT, and BART models show that MEND is the only approach to model editing that effectively edits the behavior of models with more than 10 billion parameters. Code and data available at https://sites.google.com/view/mend-editing.
    Assessing Privacy Leakage in Synthetic 3-D PET Imaging using Transversal GAN. (arXiv:2206.06448v1 [eess.IV])
    Training computer-vision related algorithms on medical images for disease diagnosis or image segmentation is difficult in large part due to privacy concerns. For this reason, generative image models are highly sought after to facilitate data sharing. However, 3-D generative models are understudied, and investigation of their privacy leakage is needed. We introduce our 3-D generative model, Transversal GAN (TrGAN), using head & neck PET images which are conditioned on tumour masks as a case study. We define quantitative measures of image fidelity, utility and privacy for our model. These metrics are evaluated in the course of training to identify ideal fidelity, utility and privacy trade-offs and establish the relationships between these parameters. We show that the discriminator of the TrGAN is vulnerable to attack, and that an attacker can identify which samples were used in training with almost perfect accuracy (AUC = 0.99). We also show that an attacker with access to only the generator cannot reliably classify whether a sample had been used for training (AUC = 0.51). This suggests that TrGAN generators, but not discriminators, may be used for sharing synthetic 3-D PET data with minimal privacy risk while maintaining good utility and fidelity.
    Compositional Mixture Representations for Vision and Text. (arXiv:2206.06404v1 [cs.CV])
    Learning a common representation space between vision and language allows deep networks to relate objects in the image to the corresponding semantic meaning. We present a model that learns a shared Gaussian mixture representation imposing the compositionality of the text onto the visual domain without having explicit location supervision. By combining the spatial transformer with a representation learning approach we learn to split images into separately encoded patches to associate visual and textual representations in an interpretable manner. On variations of MNIST and CIFAR10, our model is able to perform weakly supervised object detection and demonstrates its ability to extrapolate to unseen combination of objects.
    LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks. (arXiv:2206.06565v1 [cs.LG])
    Fine-tuning pretrained language models (LMs) without making any architectural changes has become a norm for learning various language downstream tasks. However, for non-language downstream tasks, a common practice is to employ task-specific designs for input, output layers, and loss functions. For instance, it is possible to fine-tune an LM into an MNIST classifier by replacing the word embedding layer with an image patch embedding layer, the word token output layer with a 10-way output layer, and the word prediction loss with a 10-way classification loss, respectively. A natural question arises: can LM fine-tuning solve non-language downstream tasks without changing the model architecture or loss function? To answer this, we propose Language-Interfaced Fine-Tuning (LIFT) and study its efficacy and limitations by conducting an extensive empirical study on a suite of non-language classification and regression tasks. LIFT does not make any changes to the model architecture or loss function, and it solely relies on the natural language interface, enabling "no-code machine learning with LMs." We find that LIFT performs relatively well across a wide range of low-dimensional classification and regression tasks, matching the performances of the best baselines in many cases, especially for the classification tasks. We report the experimental results on the fundamental properties of LIFT, including its inductive bias, sample efficiency, ability to extrapolate, robustness to outliers and label noise, and generalization. We also analyze a few properties/techniques specific to LIFT, e.g., context-aware learning via appropriate prompting, quantification of predictive uncertainty, and two-stage fine-tuning. Our code is available at https://github.com/UW-Madison-Lee-Lab/LanguageInterfacedFineTuning.
    Quantitative performance evaluation of Bayesian neural networks. (arXiv:2206.06779v1 [cs.LG])
    Due to the growing adoption of deep neural networks in many fields of science and engineering, modeling and estimating their uncertainties has become of primary importance. Various approaches have been investigated including Bayesian neural networks, ensembles, deterministic approximations, amongst others. Despite the growing litterature about uncertainty quantification in deep learning, the quality of the uncertainty estimates remains an open question. In this work, we attempt to assess the performance of several algorithms on sampling and regression tasks by evaluating the quality of the confidence regions and how well the generated samples are representative of the unknown target distribution. Towards this end, several sampling and regression tasks are considered, and the selected algorithms are compared in terms of coverage probabilities, kernelized Stein discrepancies, and maximum mean discrepancies.
    Adversarial Vulnerability of Randomized Ensembles. (arXiv:2206.06737v1 [cs.LG])
    Despite the tremendous success of deep neural networks across various tasks, their vulnerability to imperceptible adversarial perturbations has hindered their deployment in the real world. Recently, works on randomized ensembles have empirically demonstrated significant improvements in adversarial robustness over standard adversarially trained (AT) models with minimal computational overhead, making them a promising solution for safety-critical resource-constrained applications. However, this impressive performance raises the question: Are these robustness gains provided by randomized ensembles real? In this work we address this question both theoretically and empirically. We first establish theoretically that commonly employed robustness evaluation methods such as adaptive PGD provide a false sense of security in this setting. Subsequently, we propose a theoretically-sound and efficient adversarial attack algorithm (ARC) capable of compromising random ensembles even in cases where adaptive PGD fails to do so. We conduct comprehensive experiments across a variety of network architectures, training schemes, datasets, and norms to support our claims, and empirically establish that randomized ensembles are in fact more vulnerable to $\ell_p$-bounded adversarial perturbations than even standard AT models. Our code can be found at https://github.com/hsndbk4/ARC.
    Supervised Dictionary Learning with Auxiliary Covariates. (arXiv:2206.06774v1 [stat.ML])
    Supervised dictionary learning (SDL) is a classical machine learning method that simultaneously seeks feature extraction and classification tasks, which are not necessarily a priori aligned objectives. The goal of SDL is to learn a class-discriminative dictionary, which is a set of latent feature vectors that can well-explain both the features as well as labels of observed data. In this paper, we provide a systematic study of SDL, including the theory, algorithm, and applications of SDL. First, we provide a novel framework that `lifts' SDL as a convex problem in a combined factor space and propose a low-rank projected gradient descent algorithm that converges exponentially to the global minimizer of the objective. We also formulate generative models of SDL and provide global estimation guarantees of the true parameters depending on the hyperparameter regime. Second, viewed as a nonconvex constrained optimization problem, we provided an efficient block coordinate descent algorithm for SDL that is guaranteed to find an $\varepsilon$-stationary point of the objective in $O(\varepsilon^{-1}(\log \varepsilon^{-1})^{2})$ iterations. For the corresponding generative model, we establish a novel non-asymptotic local consistency result for constrained and regularized maximum likelihood estimation problems, which may be of independent interest. Third, we apply SDL for imbalanced document classification by supervised topic modeling and also for pneumonia detection from chest X-ray images. We also provide simulation studies to demonstrate that SDL becomes more effective when there is a discrepancy between the best reconstructive and the best discriminative dictionaries.
    Image-based Treatment Effect Heterogeneity. (arXiv:2206.06417v1 [cs.LG])
    Randomized controlled trials (RCTs) are considered the gold standard for estimating the effects of interventions. Recent work has studied effect heterogeneity in RCTs by conditioning estimates on tabular variables such as age and ethnicity. However, such variables are often only observed near the time of the experiment and may fail to capture historical or geographical reasons for effect variation. When experiment units are associated with a particular location, satellite imagery can provide such historical and geographical information, yet there is no method which incorporates it for describing effect heterogeneity. In this paper, we develop such a method which estimates, using a deep probabilistic modeling framework, the clusters of images having the same distribution over treatment effects. We compare the proposed methods against alternatives in simulation and in an application to estimating the effects of an anti-poverty intervention in Uganda. A causal regularization penalty is introduced to ensure reliability of the cluster model in recovering Average Treatment Effects (ATEs). Finally, we discuss feasibility, limitations, and the applicability of these methods to other domains, such as medicine and climate science, where image information is prevalent. We make code for all modeling strategies publicly available in an open-source software package.
    GraphMLP: A Graph MLP-Like Architecture for 3D Human Pose Estimation. (arXiv:2206.06420v1 [cs.CV])
    Modern multi-layer perceptron (MLP) models have shown competitive results in learning visual representations without self-attention. However, existing MLP models are not good at capturing local details and lack prior knowledge of human configurations, which limits their modeling power for skeletal representation learning. To address these issues, we propose a simple yet effective graph-reinforced MLP-Like architecture, named GraphMLP, that combines MLPs and graph convolutional networks (GCNs) in a global-local-graphical unified architecture for 3D human pose estimation. GraphMLP incorporates the graph structure of human bodies into an MLP model to meet the domain-specific demand while also allowing for both local and global spatial interactions. Extensive experiments show that the proposed GraphMLP achieves state-of-the-art performance on two datasets, i.e., Human3.6M and MPI-INF-3DHP. Our source code and pretrained models will be publicly available.
    On the Convergence of the Shapley Value in Parametric Bayesian Learning Games. (arXiv:2205.07428v2 [cs.LG] UPDATED)
    Measuring contributions is a classical problem in cooperative game theory where the Shapley value is the most well-known solution concept. In this paper, we establish the convergence property of the Shapley value in parametric Bayesian learning games where players perform a Bayesian inference using their combined data, and the posterior-prior KL divergence is used as the characteristic function. We show that for any two players, under some regularity conditions, their difference in Shapley value converges in probability to the difference in Shapley value of a limiting game whose characteristic function is proportional to the log-determinant of the joint Fisher information. As an application, we present an online collaborative learning framework that is asymptotically Shapley-fair. Our result enables this to be achieved without any costly computations of posterior-prior KL divergences. Only a consistent estimator of the Fisher information is needed. The effectiveness of our framework is demonstrated with experiments using real-world data.
    Visual Radial Basis Q-Network. (arXiv:2206.06712v1 [cs.CV])
    While reinforcement learning (RL) from raw images has been largely investigated in the last decade, existing approaches still suffer from a number of constraints. The high input dimension is often handled using either expert knowledge to extract handcrafted features or environment encoding through convolutional networks. Both solutions require numerous parameters to be optimized. In contrast, we propose a generic method to extract sparse features from raw images with few trainable parameters. We achieved this using a Radial Basis Function Network (RBFN) directly on raw image. We evaluate the performance of the proposed approach for visual extraction in Q-learning tasks in the Vizdoom environment. Then, we compare our results with two Deep Q-Network, one trained directly on images and another one trained on feature extracted by a pretrained auto-encoder. We show that the proposed approach provides similar or, in some cases, even better performances with fewer trainable parameters while being conceptually simpler.
    Estimating Causal Effects Under Image Confounding Bias with an Application to Poverty in Africa. (arXiv:2206.06410v1 [cs.LG])
    Observational studies of causal effects require adjustment for confounding factors. In the tabular setting, where these factors are well-defined, separate random variables, the effect of confounding is well understood. However, in public policy, ecology, and in medicine, decisions are often made in non-tabular settings, informed by patterns or objects detected in images (e.g., maps, satellite or tomography imagery). Using such imagery for causal inference presents an opportunity because objects in the image may be related to the treatment and outcome of interest. In these cases, we rely on the images to adjust for confounding but observed data do not directly label the existence of the important objects. Motivated by real-world applications, we formalize this challenge, how it can be handled, and what conditions are sufficient to identify and estimate causal effects. We analyze finite-sample performance using simulation experiments, estimating effects using a propensity adjustment algorithm that employs a machine learning model to estimate the image confounding. Our experiments also examine sensitivity to misspecification of the image pattern mechanism. Finally, we use our methodology to estimate the effects of policy interventions on poverty in African communities from satellite imagery.
    SS-GNN: A Simple-Structured Graph Neural Network for Affinity Prediction. (arXiv:2206.07015v1 [q-bio.BM])
    Efficient and effective drug-target binding affinity (DTBA) prediction is a challenging task due to the limited computational resources in practical applications and is a crucial basis for drug screening. Inspired by the good representation ability of graph neural networks (GNNs), we propose a simple-structured GNN model named SS-GNN to accurately predict DTBA. By constructing a single undirected graph based on a distance threshold to represent protein-ligand interactions, the scale of the graph data is greatly reduced. Moreover, ignoring covalent bonds in the protein further reduces the computational cost of the model. The GNN-MLP module takes the latent feature extraction of atoms and edges in the graph as two mutually independent processes. We also develop an edge-based atom-pair feature aggregation method to represent complex interactions and a graph pooling-based method to predict the binding affinity of the complex. We achieve state-of-the-art prediction performance using a simple model (with only 0.6M parameters) without introducing complicated geometric feature descriptions. SS-GNN achieves Pearson's Rp=0.853 on the PDBbind v2016 core set, outperforming state-of-the-art GNN-based methods by 5.2%. Moreover, the simplified model structure and concise data processing procedure improve the prediction efficiency of the model. For a typical protein-ligand complex, affinity prediction takes only 0.2 ms. All codes are freely accessible at https://github.com/xianyuco/SS-GNN.
    When adversarial attacks become interpretable counterfactual explanations. (arXiv:2206.06854v1 [cs.AI])
    We argue that, when learning a 1-Lipschitz neural network with the dual loss of an optimal transportation problem, the gradient of the model is both the direction of the transportation plan and the direction to the closest adversarial attack. Traveling along the gradient to the decision boundary is no more an adversarial attack but becomes a counterfactual explanation, explicitly transporting from one class to the other. Through extensive experiments on XAI metrics, we find that the simple saliency map method, applied on such networks, becomes a reliable explanation, and outperforms the state-of-the-art explanation approaches on unconstrained models. The proposed networks were already known to be certifiably robust, and we prove that they are also explainable with a fast and simple method.
    On Image Segmentation With Noisy Labels: Characterization and Volume Properties of the Optimal Solutions to Accuracy and Dice. (arXiv:2206.06484v1 [cs.CV])
    We study two of the most popular performance metrics in medical image segmentation, Accuracy and Dice, when the target labels are noisy. For both metrics, several statements related to characterization and volume properties of the set of optimal segmentations are proved, and associated experiments are provided. Our main insights are: (i) the volume of the solutions to both metrics may deviate significantly from the expected volume of the target, (ii) the volume of a solution to Accuracy is always less than or equal to the volume of a solution to Dice and (iii) the optimal solutions to both of these metrics coincide when the set of feasible segmentations is constrained to the set of segmentations with the volume equal to the expected volume of the target.
    Energy Flows: Towards Determinant-Free Training of Normalizing Flows. (arXiv:2206.06672v1 [cs.LG])
    Normalizing flows are a popular approach for constructing probabilistic and generative models. However, maximum likelihood training of flows is challenging due to the need to calculate computationally expensive determinants of Jacobians. This paper takes steps towards addressing this challenge by introducing an approach for determinant-free training of flows inspired by two-sample testing. Central to our framework is the energy objective, a multidimensional extension of proper scoring rules that admits efficient estimators based on random projections and that outperforms a range of alternative two-sample objectives that can be derived in our framework. Crucially, the energy objective and its alternatives do not require calculating determinants and therefore support general flow architectures that are not well-suited to maximum likelihood training (e.g., densely connected networks). We empirically demonstrate that energy flows achieve competitive generative modeling performance while maintaining fast generation and posterior inference.
    Physics-Informed Transfer Learning Strategy to Accelerate Unsteady Fluid Flow Simulations. (arXiv:2206.06817v1 [physics.flu-dyn])
    Since the derivation of the Navier Stokes equations, it has become possible to numerically solve real world viscous flow problems (computational fluid dynamics (CFD)). However, despite the rapid advancements in the performance of central processing units (CPUs), the computational cost of simulating transient flows with extremely small time/grid scale physics is still unrealistic. In recent years, machine learning (ML) technology has received significant attention across industries, and this big wave has propagated various interests in the fluid dynamics community. Recent ML CFD studies have revealed that completely suppressing the increase in error with the increase in interval between the training and prediction times in data driven methods is unrealistic. The development of a practical CFD acceleration methodology that applies ML is a remaining issue. Therefore, the objectives of this study were developing a realistic ML strategy based on a physics-informed transfer learning and validating the accuracy and acceleration performance of this strategy using an unsteady CFD dataset. This strategy can determine the timing of transfer learning while monitoring the residuals of the governing equations in a cross coupling computation framework. Consequently, our hypothesis that continuous fluid flow time series prediction is feasible was validated, as the intermediate CFD simulations periodically not only reduce the increased residuals but also update the network parameters. Notably, the cross coupling strategy with a grid based network model does not compromise the simulation accuracy for computational acceleration. The simulation was accelerated by 1.8 times in the laminar counterflow CFD dataset condition including the parameter updating time. Open source CFD software OpenFOAM and open-source ML software TensorFlow were used in this feasibility study.
    Missingness Bias in Model Debugging. (arXiv:2204.08945v2 [cs.CV] UPDATED)
    Missingness, or the absence of features from an input, is a concept fundamental to many model debugging tools. However, in computer vision, pixels cannot simply be removed from an image. One thus tends to resort to heuristics such as blacking out pixels, which may in turn introduce bias into the debugging process. We study such biases and, in particular, show how transformer-based architectures can enable a more natural implementation of missingness, which side-steps these issues and improves the reliability of model debugging in practice. Our code is available at https://github.com/madrylab/missingness
    Generalizable Method for Face Anti-Spoofing with Semi-Supervised Learning. (arXiv:2206.06510v1 [cs.CV])
    Face anti-spoofing has drawn a lot of attention due to the high security requirements in biometric authentication systems. Bringing face biometric to commercial hardware became mostly dependent on developing reliable methods for detecting fake login sessions without specialized sensors. Current CNN-based method perform well on the domains they were trained for, but often show poor generalization on previously unseen datasets. In this paper we describe a method for utilizing unsupervised pretraining for improving performance across multiple datasets without any adaptation, introduce the Entry Antispoofing Dataset for supervised fine-tuning, and propose a multi-class auxiliary classification layer for augmenting the binary classification task of detecting spoofing attempts with explicit interpretable signals. We demonstrate the efficiency of our model by achieving state-of-the-art results on cross-dataset testing on MSU-MFSD, Replay-Attack, and OULU-NPU datasets.
    Adversarially Robust Multi-Armed Bandit Algorithm with Variance-Dependent Regret Bounds. (arXiv:2206.06810v1 [cs.LG])
    This paper considers the multi-armed bandit (MAB) problem and provides a new best-of-both-worlds (BOBW) algorithm that works nearly optimally in both stochastic and adversarial settings. In stochastic settings, some existing BOBW algorithms achieve tight gap-dependent regret bounds of $O(\sum_{i: \Delta_i>0} \frac{\log T}{\Delta_i})$ for suboptimality gap $\Delta_i$ of arm $i$ and time horizon $T$. As Audibert et al. [2007] have shown, however, that the performance can be improved in stochastic environments with low-variance arms. In fact, they have provided a stochastic MAB algorithm with gap-variance-dependent regret bounds of $O(\sum_{i: \Delta_i>0} (\frac{\sigma_i^2}{\Delta_i} + 1) \log T )$ for loss variance $\sigma_i^2$ of arm $i$. In this paper, we propose the first BOBW algorithm with gap-variance-dependent bounds, showing that the variance information can be used even in the possibly adversarial environment. Further, the leading constant factor in our gap-variance dependent bound is only (almost) twice the value for the lower bound. Additionally, the proposed algorithm enjoys multiple data-dependent regret bounds in adversarial settings and works well in stochastic settings with adversarial corruptions. The proposed algorithm is based on the follow-the-regularized-leader method and employs adaptive learning rates that depend on the empirical prediction error of the loss, which leads to gap-variance-dependent regret bounds reflecting the variance of the arms.
    End-to-end Kernel Learning via Generative Random Fourier Features. (arXiv:2009.04614v3 [cs.LG] UPDATED)
    Random Fourier features (RFFs) provide a promising way for kernel learning in a spectral case. Current RFFs-based kernel learning methods usually work in a two-stage way. In the first-stage process, learning the optimal feature map is often formulated as a target alignment problem, which aims to align the learned kernel with the pre-defined target kernel (usually the ideal kernel). In the second-stage process, a linear learner is conducted with respect to the mapped random features. Nevertheless, the pre-defined kernel in target alignment is not necessarily optimal for the generalization of the linear learner. Instead, in this paper, we consider a one-stage process that incorporates the kernel learning and linear learner into a unifying framework. To be specific, a generative network via RFFs is devised to implicitly learn the kernel, followed by a linear classifier parameterized as a full-connected layer. Then the generative network and the classifier are jointly trained by solving the empirical risk minimization (ERM) problem to reach a one-stage solution. This end-to-end scheme naturally allows deeper features, in correspondence to a multi-layer structure, and shows superior generalization performance over the classical two-stage, RFFs-based methods in real-world classification tasks. Moreover, inspired by the randomized resampling mechanism of the proposed method, its enhanced adversarial robustness is investigated and experimentally verified.
    Predicting waves in fluids with deep neural network. (arXiv:2201.06628v4 [physics.flu-dyn] UPDATED)
    In this paper, we present a deep learning technique for data-driven predictions of wave propagation in a fluid medium. The technique relies on an attention-based convolutional recurrent autoencoder network (AB-CRAN). To construct a low-dimensional representation of wave propagation data, we employ a denoising-based convolutional autoencoder. The AB-CRAN architecture with attention-based long short-term memory cells forms our deep neural network model for the time marching of the low-dimensional features. We assess the proposed AB-CRAN framework against the standard recurrent neural network for the low-dimensional learning of wave propagation. To demonstrate the effectiveness of the AB-CRAN model, we consider three benchmark problems, namely, one-dimensional linear convection, the nonlinear viscous Burgers equation, and the two-dimensional Saint-Venant shallow water system. Using the spatial-temporal datasets from the benchmark problems, our novel AB-CRAN architecture accurately captures the wave amplitude and preserves the wave characteristics of the solution for long time horizons. The attention-based sequence-to-sequence network increases the time-horizon of prediction compared to the standard recurrent neural network with long short-term memory cells. The denoising autoencoder further reduces the mean squared error of prediction and improves the generalization capability in the parameter space.
    Fiberwise dimensionality reduction of topologically complex data with vector bundles. (arXiv:2206.06513v1 [cs.CG])
    Datasets with non-trivial large scale topology can be hard to embed in low-dimensional Euclidean space with existing dimensionality reduction algorithms. We propose to model topologically complex datasets using vector bundles, in such a way that the base space accounts for the large scale topology, while the fibers account for the local geometry. This allows one to reduce the dimensionality of the fibers, while preserving the large scale topology. We formalize this point of view, and, as an application, we describe an algorithm which takes as input a dataset together with an initial representation of it in Euclidean space, assumed to recover part of its large scale topology, and outputs a new representation that integrates local representations, obtained through local linear dimensionality reduction, along the initial global representation. We demonstrate this algorithm on examples coming from dynamical systems and chemistry. In these examples, our algorithm is able to learn topologically faithful embeddings of the data in lower target dimension than various well known metric-based dimensionality reduction algorithms.
    Stein Variational Goal Generation For Reinforcement Learning in Hard Exploration Problems. (arXiv:2206.06719v1 [cs.LG])
    Multi-goal Reinforcement Learning has recently attracted a large amount of research interest. By allowing experience to be shared between related training tasks, this setting favors generalization for new tasks at test time, whenever some smoothness exists in the considered representation space of goals. However, in settings with discontinuities in state or goal spaces (e.g. walls in a maze), a majority of goals are difficult to reach, due to the sparsity of rewards in the absence of expert knowledge. This implies hard exploration, for which some curriculum of goals must be discovered, to help agents learn by adapting training tasks to their current capabilities. Building on recent automatic curriculum learning techniques for goal-conditioned policies, we propose a novel approach: Stein Variational Goal Generation (SVGG), which seeks at preferably sampling new goals in the zone of proximal development of the agent, by leveraging a learned model of its abilities, and a goal distribution modeled as particles in the exploration space. Our approach relies on Stein Variational Gradient Descent to dynamically attract the goal sampling distribution in areas of appropriate difficulty. We demonstrate the performances of the approach, in terms of success coverage in the goal space, compared to recent state-of-the-art RL methods for hard exploration problems.
    Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search. (arXiv:2206.06588v1 [cs.IR])
    Improving the quality of search results can significantly enhance users experience and engagement with search engines. In spite of several recent advancements in the fields of machine learning and data mining, correctly classifying items for a particular user search query has been a long-standing challenge, which still has a large room for improvement. This paper introduces the "Shopping Queries Dataset", a large dataset of difficult Amazon search queries and results, publicly released with the aim of fostering research in improving the quality of search results. The dataset contains around 130 thousand unique queries and 2.6 million manually labeled (query,product) relevance judgements. The dataset is multilingual with queries in English, Japanese, and Spanish. The Shopping Queries Dataset is being used in one of the KDDCup'22 challenges. In this paper, we describe the dataset and present three evaluation tasks along with baseline results: (i) ranking the results list, (ii) classifying product results into relevance categories, and (iii) identifying substitute products for a given query. We anticipate that this data will become the gold standard for future research in the topic of product search.
    SmartGD: A Self-Challenging Generative Adversarial Network for Graph Drawing. (arXiv:2206.06434v1 [cs.LG])
    A multitude of studies have been conducted on graph drawing, but many existing methods only focus on optimizing particular aesthetic aspects of graph layout. Given a graph, generating a good layout that satisfies certain human aesthetic preference remains a challenging task, especially if such preference can not be expressed as a differentiable objective function. In this paper, we propose a student-teacher GAN-based graph drawing framework, SmartGD, which learns to draw graphs just like how humans learn to perform tasks. The student network in the SmartGD learns graph drawing by imitating good layout examples, while the teacher network in SmartGD is responsible for providing ratings regarding the goodness of the generated layouts. When there is a lack of concrete aesthetic criteria to specify what constitutes a good layout, the student network can learn from the good layout examples. On the other hand, when the goodness of a layout can be assessed by quantitative criteria (even if not differentiable), the student network can use it as a concrete goal to optimize the target aesthetics. To accomplish the goal, we propose a novel variant of GAN, self-challenging GAN, to learn the optimal layout distribution with respect to any aesthetic criterion, whether the criterion is differentiable or not. The proposed graph drawing framework can not only draw graphs in a similar style as the good layout examples but also optimize the graph layouts according to any given aesthetic criteria when available. Once the model is trained, it can be used to visualize arbitrary graphs according to the style of the example layouts or the chosen aesthetic criteria. The comprehensive experimental studies show that SmartGD outperforms 12 benchmark methods according to the commonly agreed metrics.
    Robust Distillation for Worst-class Performance. (arXiv:2206.06479v1 [cs.LG])
    Knowledge distillation has proven to be an effective technique in improving the performance a student model using predictions from a teacher model. However, recent work has shown that gains in average efficiency are not uniform across subgroups in the data, and in particular can often come at the cost of accuracy on rare subgroups and classes. To preserve strong performance across classes that may follow a long-tailed distribution, we develop distillation techniques that are tailored to improve the student's worst-class performance. Specifically, we introduce robust optimization objectives in different combinations for the teacher and student, and further allow for training with any tradeoff between the overall accuracy and the robust worst-class objective. We show empirically that our robust distillation techniques not only achieve better worst-class performance, but also lead to Pareto improvement in the tradeoff between overall performance and worst-class performance compared to other baseline methods. Theoretically, we provide insights into what makes a good teacher when the goal is to train a robust student.
    Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency. (arXiv:2204.09787v2 [cs.LG] UPDATED)
    We study reinforcement learning for partially observed Markov decision processes (POMDPs) with infinite observation and state spaces, which remains less investigated theoretically. To this end, we make the first attempt at bridging partial observability and function approximation for a class of POMDPs with a linear structure. In detail, we propose a reinforcement learning algorithm (Optimistic Exploration via Adversarial Integral Equation or OP-TENET) that attains an $\epsilon$-optimal policy within $O(1/\epsilon^2)$ episodes. In particular, the sample complexity scales polynomially in the intrinsic dimension of the linear structure and is independent of the size of the observation and state spaces. The sample efficiency of OP-TENET is enabled by a sequence of ingredients: (i) a Bellman operator with finite memory, which represents the value function in a recursive manner, (ii) the identification and estimation of such an operator via an adversarial integral equation, which features a smoothed discriminator tailored to the linear structure, and (iii) the exploration of the observation and state spaces via optimism, which is based on quantifying the uncertainty in the adversarial integral equation.
    Transformers are Meta-Reinforcement Learners. (arXiv:2206.06614v1 [cs.LG])
    The transformer architecture and variants presented remarkable success across many machine learning tasks in recent years. This success is intrinsically related to the capability of handling long sequences and the presence of context-dependent weights from the attention mechanism. We argue that these capabilities suit the central role of a Meta-Reinforcement Learning algorithm. Indeed, a meta-RL agent needs to infer the task from a sequence of trajectories. Furthermore, it requires a fast adaptation strategy to adapt its policy for a new task -- which can be achieved using the self-attention mechanism. In this work, we present TrMRL (Transformers for Meta-Reinforcement Learning), a meta-RL agent that mimics the memory reinstatement mechanism using the transformer architecture. It associates the recent past of working memories to build an episodic memory recursively through the transformer layers. We show that the self-attention computes a consensus representation that minimizes the Bayes Risk at each layer and provides meaningful features to compute the best actions. We conducted experiments in high-dimensional continuous control environments for locomotion and dexterous manipulation. Results show that TrMRL presents comparable or superior asymptotic performance, sample efficiency, and out-of-distribution generalization compared to the baselines in these environments.
    AMEIR: Automatic Behavior Modeling, Interaction Exploration and MLP Investigation in the Recommender System. (arXiv:2006.05933v2 [cs.LG] UPDATED)
    Recently, deep learning models have been widely spread in the industrial recommender systems and boosted the recommendation quality. Though having achieved remarkable success, the design of task-aware recommender systems usually requires manual feature engineering and architecture engineering from domain experts. To relieve those human efforts, we explore the potential of neural architecture search (NAS) and introduce AMEIR for Automatic behavior Modeling, interaction Exploration and multi-layer perceptron (MLP) Investigation in the Recommender system. The core contributions of AMEIR are the three-stage search space and the tailored three-step searching pipeline. Specifically, AMEIR divides the complete recommendation models into three stages of behavior modeling, interaction exploration, MLP aggregation, and introduces a novel search space containing three tailored subspaces that cover most of the existing methods and thus allow for searching better models. To find the ideal architecture efficiently and effectively, AMEIR realizes the one-shot random search in recommendation progressively on the three stages and assembles the search results as the final outcome. Further analysis reveals that AMEIR's search space could cover most of the representative recommendation models, which demonstrates the universality of our design. The extensive experiments over various scenarios reveal that AMEIR outperforms competitive baselines of elaborate manual design and leading algorithmic complex NAS methods with lower model complexity and comparable time cost, indicating efficacy, efficiency and robustness of the proposed method.
    On the Finite-Time Performance of the Knowledge Gradient Algorithm. (arXiv:2206.06847v1 [stat.ML])
    The knowledge gradient (KG) algorithm is a popular and effective algorithm for the best arm identification (BAI) problem. Due to the complex calculation of KG, theoretical analysis of this algorithm is difficult, and existing results are mostly about the asymptotic performance of it, e.g., consistency, asymptotic sample allocation, etc. In this research, we present new theoretical results about the finite-time performance of the KG algorithm. Under independent and normally distributed rewards, we derive lower bounds and upper bounds for the probability of error and simple regret of the algorithm. With these bounds, existing asymptotic results become simple corollaries. We also show the performance of the algorithm for the multi-armed bandit (MAB) problem. These developments not only extend the existing analysis of the KG algorithm, but can also be used to analyze other improvement-based algorithms. Last, we use numerical experiments to further demonstrate the finite-time behavior of the KG algorithm.
    Symbolic Regression in Materials Science: Discovering Interatomic Potentials from Data. (arXiv:2206.06422v1 [cond-mat.mtrl-sci])
    Particle-based modeling of materials at atomic scale plays an important role in the development of new materials and understanding of their properties. The accuracy of particle simulations is determined by interatomic potentials, which allow to calculate the potential energy of an atomic system as a function of atomic coordinates and potentially other properties. First-principles-based ab initio potentials can reach arbitrary levels of accuracy, however their aplicability is limited by their high computational cost. Machine learning (ML) has recently emerged as an effective way to offset the high computational costs of ab initio atomic potentials by replacing expensive models with highly efficient surrogates trained on electronic structure data. Among a plethora of current methods, symbolic regression (SR) is gaining traction as a powerful "white-box" approach for discovering functional forms of interatomic potentials. This contribution discusses the role of symbolic regression in Materials Science (MS) and offers a comprehensive overview of current methodological challenges and state-of-the-art results. A genetic programming-based approach for modeling atomic potentials from raw data (consisting of snapshots of atomic positions and associated potential energy) is presented and empirically validated on ab initio electronic structure data.
    Provably Efficient Model-Free Algorithm for MDPs with Peak Constraints. (arXiv:2003.05555v6 [math.OC] UPDATED)
    In the optimization of dynamic systems, the variables typically have constraints. Such problems can be modeled as a Constrained Markov Decision Process (CMDP). This paper considers the peak Constrained Markov Decision Process (PCMDP), where the agent chooses the policy to maximize total reward in the finite horizon as well as satisfy constraints at each epoch with probability 1. We propose a model-free algorithm that converts PCMDP problem to an unconstrained problem and a Q-learning based approach is applied. We define the concept of probably approximately correct (PAC) to the proposed PCMDP problem. The proposed algorithm is proved to achieve an $(\epsilon,p)$-PAC policy when the episode $K\geq\Omega(\frac{I^2H^6SA\ell}{\epsilon^2})$, where $S$ and $A$ are the number of states and actions, respectively. $H$ is the number of epochs per episode. $I$ is the number of constraint functions, and $\ell=\log(\frac{SAT}{p})$. We note that this is the first result on PAC kind of analysis for PCMDP with peak constraints, where the transition dynamics are not known apriori. We demonstrate the proposed algorithm on an energy harvesting problem and a single machine scheduling problem, where it performs close to the theoretical upper bound of the studied optimization problem.
    Adversarial Audio Synthesis with Complex-valued Polynomial Networks. (arXiv:2206.06811v1 [eess.AS])
    Time-frequency (TF) representations in audio synthesis have been increasingly modeled with real-valued networks. However, overlooking the complex-valued nature of TF representations can result in suboptimal performance and require additional modules (e.g., for modeling the phase). To this end, we introduce complex-valued polynomial networks, called APOLLO, that integrate such complex-valued representations in a natural way. Concretely, APOLLO captures high-order correlations of the input elements using high-order tensors as scaling parameters. By leveraging standard tensor decompositions, we derive different architectures and enable modeling richer correlations. We outline such architectures and showcase their performance in audio generation across four benchmarks. As a highlight, APOLLO results in $17.5\%$ improvement over adversarial methods and $8.2\%$ over the state-of-the-art diffusion models on SC09 dataset in audio generation. Our models can encourage the systematic design of other efficient architectures on the complex field.
    Continual-Learning-as-a-Service (CLaaS): On-Demand Efficient Adaptation of Predictive Models. (arXiv:2206.06957v1 [cs.LG])
    Predictive machine learning models nowadays are often updated in a stateless and expensive way. The two main future trends for companies that want to build machine learning-based applications and systems are real-time inference and continual updating. Unfortunately, both trends require a mature infrastructure that is hard and costly to realize on-premise. This paper defines a novel software service and model delivery infrastructure termed Continual Learning-as-a-Service (CLaaS) to address these issues. Specifically, it embraces continual machine learning and continuous integration techniques. It provides support for model updating and validation tools for data scientists without an on-premise solution and in an efficient, stateful and easy-to-use manner. Finally, this CL model service is easy to encapsulate in any machine learning infrastructure or cloud system. This paper presents the design and implementation of a CLaaS instantiation, called LiquidBrain, evaluated in two real-world scenarios. The former is a robotic object recognition setting using the CORe50 dataset while the latter is a named category and attribute prediction using the DeepFashion-C dataset in the fashion domain. Our preliminary results suggest the usability and efficiency of the Continual Learning model services and the effectiveness of the solution in addressing real-world use-cases regardless of where the computation happens in the continuum Edge-Cloud.
    Two-terminal source coding with common sum reconstruction. (arXiv:2206.06973v1 [cs.IT])
    We present the problem of two-terminal source coding with Common Sum Reconstruction (CSR). Consider two terminals, each with access to one of two correlated sources. Both terminals want to reconstruct the sum of the two sources under some average distortion constraint, and the reconstructions at two terminals must be identical with high probability. In this paper, we develop inner and outer bounds to the achievable rate distortion region of the CSR problem for a doubly symmetric binary source. We employ existing achievability results for Steinberg's common reconstruction and Wyner-Ziv's source coding with side information problems, and an achievability result for the lossy version of Korner-Marton's modulo-two sum computation problem.
    SoTeacher: A Student-oriented Teacher Network Training Framework for Knowledge Distillation. (arXiv:2206.06661v1 [cs.LG])
    How to train an ideal teacher for knowledge distillation is still an open problem. It has been widely observed that a teacher minimizing the empirical risk not necessarily yields the best performing student, suggesting a fundamental discrepancy between the common practice in teacher network training and the distillation objective. To fill this gap, we propose a novel student-oriented teacher network training framework SoTeacher, inspired by recent findings that student performance hinges on teacher's capability to approximate the true label distribution of training samples. We theoretically established that (1) the empirical risk minimizer with proper scoring rules as loss function can provably approximate the true label distribution of training data if the hypothesis function is locally Lipschitz continuous around training samples; and (2) when data augmentation is employed for training, an additional constraint is required that the minimizer has to produce consistent predictions across augmented views of the same training input. In light of our theory, SoTeacher renovates the empirical risk minimization by incorporating Lipschitz regularization and consistency regularization. It is worth mentioning that SoTeacher is applicable to almost all teacher-student architecture pairs, requires no prior knowledge of the student upon teacher's training, and induces almost no computation overhead. Experiments on two benchmark datasets confirm that SoTeacher can improve student performance significantly and consistently across various knowledge distillation algorithms and teacher-student pairs.
    Physics Informed Neural Fields for Smoke Reconstruction with Sparse Data. (arXiv:2206.06577v1 [cs.GR])
    High-fidelity reconstruction of fluids from sparse multiview RGB videos remains a formidable challenge due to the complexity of the underlying physics as well as complex occlusion and lighting in captures. Existing solutions either assume knowledge of obstacles and lighting, or only focus on simple fluid scenes without obstacles or complex lighting, and thus are unsuitable for real-world scenes with unknown lighting or arbitrary obstacles. We present the first method to reconstruct dynamic fluid by leveraging the governing physics (ie, Navier -Stokes equations) in an end-to-end optimization from sparse videos without taking lighting conditions, geometry information, or boundary conditions as input. We provide a continuous spatio-temporal scene representation using neural networks as the ansatz of density and velocity solution functions for fluids as well as the radiance field for static objects. With a hybrid architecture that separates static and dynamic contents, fluid interactions with static obstacles are reconstructed for the first time without additional geometry input or human labeling. By augmenting time-varying neural radiance fields with physics-informed deep learning, our method benefits from the supervision of images and physical priors. To achieve robust optimization from sparse views, we introduced a layer-by-layer growing strategy to progressively increase the network capacity. Using progressively growing models with a new regularization term, we manage to disentangle density-color ambiguity in radiance fields without overfitting. A pretrained density-to-velocity fluid model is leveraged in addition as the data prior to avoid suboptimal velocity which underestimates vorticity but trivially fulfills physical equations. Our method exhibits high-quality results with relaxed constraints and strong flexibility on a representative set of synthetic and real flow captures.
    Matching Pursuit Based Scheduling for Over-the-Air Federated Learning. (arXiv:2206.06679v1 [cs.IT])
    This paper develops a class of low-complexity device scheduling algorithms for over-the-air federated learning via the method of matching pursuit. The proposed scheme tracks closely the close-to-optimal performance achieved by difference-of-convex programming, and outperforms significantly the well-known benchmark algorithms based on convex relaxation. Compared to the state-of-the-art, the proposed scheme poses a drastically lower computational load on the system: For $K$ devices and $N$ antennas at the parameter server, the benchmark complexity scales with $\left(N^2+K\right)^3 + N^6$ while the complexity of the proposed scheme scales with $K^p N^q$ for some $0 < p,q \leq 2$. The efficiency of the proposed scheme is confirmed via numerical experiments on the CIFAR-10 dataset.
    Generalizing experimental findings: identification beyond adjustments. (arXiv:2206.06699v1 [stat.ME])
    We aim to generalize the results of a randomized controlled trial (RCT) to a target population with the help of some observational data. This is a problem of causal effect identification with multiple data sources. Challenges arise when the RCT is conducted in a context that differs from the target population. Earlier research has focused on cases where the estimates from the RCT can be adjusted by observational data in order to remove the selection bias and other domain specific differences. We consider examples where the experimental findings cannot be generalized by an adjustment and show that the generalization may still be possible by other identification strategies that can be derived by applying do-calculus. The obtained identifying functionals for these examples contain trapdoor variables of a new type. The value of a trapdoor variable needs to be fixed in the estimation and the choice of the value may have a major effect on the bias and accuracy of estimates, which is also seen in simulations. The presented results expand the scope of settings where the generalization of experimental findings is doable
    On Finite-Sample Identifiability of Contrastive Learning-Based Nonlinear Independent Component Analysis. (arXiv:2206.06593v1 [cs.LG])
    Nonlinear independent component analysis (nICA) aims at recovering statistically independent latent components that are mixed by unknown nonlinear functions. Central to nICA is the identifiability of the latent components, which had been elusive until very recently. Specifically, Hyv\"arinen et al. have shown that the nonlinearly mixed latent components are identifiable (up to often inconsequential ambiguities) under a generalized contrastive learning (GCL) formulation, given that the latent components are independent conditioned on a certain auxiliary variable. The GCL-based identifiability of nICA is elegant, and establishes interesting connections between nICA and popular unsupervised/self-supervised learning paradigms in representation learning, causal learning, and factor disentanglement. However, existing identifiability analyses of nICA all build upon an unlimited sample assumption and the use of ideal universal function learners -- which creates a non-negligible gap between theory and practice. Closing the gap is a nontrivial challenge, as there is a lack of established ``textbook'' routine for finite sample analysis of such unsupervised problems. This work puts forth a finite-sample identifiability analysis of GCL-based nICA. Our analytical framework judiciously combines the properties of the GCL loss function, statistical generalization analysis, and numerical differentiation. Our framework also takes the learning function's approximation error into consideration, and reveals an intuitive trade-off between the complexity and expressiveness of the employed function learner. Numerical experiments are used to validate the theorems.
    Counting Markov Equivalent Directed Acyclic Graphs Consistent with Background Knowledge. (arXiv:2206.06744v1 [cs.DS])
    A polynomial-time exact algorithm for counting the number of directed acyclic graphs in a Markov equivalence class was recently given by Wien\"obst, Bannach, and Li\'skiewicz (AAAI 2021). In this paper, we consider the more general problem of counting the number of directed acyclic graphs in a Markov equivalence class when the directions of some of the edges are also fixed (this setting arises, for example, when interventional data is partially available). This problem has been shown in earlier work to be complexity-theoretically hard. In contrast, we show that the problem is nevertheless tractable in an interesting class of instances, by establishing that it is ``fixed-parameter tractable''. In particular, our counting algorithm runs in time that is bounded by a polynomial in the size of the graph, where the degree of the polynomial does \emph{not} depend upon the number of additional edges provided as input.
    Does a Technique for Building Multimodal Representation Matter? -- Comparative Analysis. (arXiv:2206.06367v1 [cs.LG])
    Creating a meaningful representation by fusing single modalities (e.g., text, images, or audio) is the core concept of multimodal learning. Although several techniques for building multimodal representations have been proven successful, they have not been compared yet. Therefore it has been ambiguous which technique can be expected to yield the best results in a given scenario and what factors should be considered while choosing such a technique. This paper explores the most common techniques for building multimodal data representations -- the late fusion, the early fusion, and the sketch, and compares them in classification tasks. Experiments are conducted on three datasets: Amazon Reviews, MovieLens25M, and MovieLens1M datasets. In general, our results confirm that multimodal representations are able to boost the performance of unimodal models from 0.919 to 0.969 of accuracy on Amazon Reviews and 0.907 to 0.918 of AUC on MovieLens25M. However, experiments on both MovieLens datasets indicate the importance of the meaningful input data to the given task. In this article, we show that the choice of the technique for building multimodal representation is crucial to obtain the highest possible model's performance, that comes with the proper modalities combination. Such choice relies on: the influence that each modality has on the analyzed machine learning (ML) problem; the type of the ML task; the memory constraints while training and predicting phase.
    Don't "research fast and break things": On the ethics of Computational Social Science. (arXiv:2206.06370v1 [cs.CY])
    This article is concerned with setting up practical guardrails within the research activities and environments of CSS. It aims to provide CSS scholars, as well as policymakers and other stakeholders who apply CSS methods, with the critical and constructive means needed to ensure that their practices are ethical, trustworthy, and responsible. It begins by providing a taxonomy of the ethical challenges faced by researchers in the field of CSS. These are challenges related to (1) the treatment of research subjects, (2) the impacts of CSS research on affected individuals and communities, (3) the quality of CSS research and to its epistemological status, (4) research integrity, and (5) research equity. Taking these challenges as a motivation for cultural transformation, it then argues for the end-to-end incorporation of habits of responsible research and innovation (RRI) into CSS practices, focusing on the role that contextual considerations, anticipatory reflection, impact assessment, public engagement, and justifiable and well-documented action should play across the research lifecycle. In proposing the inclusion of habits of RRI in CSS practices, the chapter lays out several practical steps needed for ethical, trustworthy, and responsible CSS research activities. These include stakeholder engagement processes, research impact assessments, data lifecycle documentation, bias self-assessments, and transparent research reporting protocols.
    Evaluating histopathology transfer learning with ChampKit. (arXiv:2206.06862v1 [q-bio.QM])
    Histopathology remains the gold standard for diagnosis of various cancers. Recent advances in computer vision, specifically deep learning, have facilitated the analysis of histopathology images for various tasks, including immune cell detection and microsatellite instability classification. The state-of-the-art for each task often employs base architectures that have been pretrained for image classification on ImageNet. The standard approach to develop classifiers in histopathology tends to focus narrowly on optimizing models for a single task, not considering the aspects of modeling innovations that improve generalization across tasks. Here we present ChampKit (Comprehensive Histopathology Assessment of Model Predictions toolKit): an extensible, fully reproducible benchmarking toolkit that consists of a broad collection of patch-level image classification tasks across different cancers. ChampKit enables a way to systematically document the performance impact of proposed improvements in models and methodology. ChampKit source code and data are freely accessible at https://github.com/kaczmarj/champkit .
    What Should I Know? Using Meta-gradient Descent for Predictive Feature Discovery in a Single Stream of Experience. (arXiv:2206.06485v1 [cs.LG])
    In computational reinforcement learning, a growing body of work seeks to construct an agent's perception of the world through predictions of future sensations; predictions about environment observations are used as additional input features to enable better goal-directed decision-making. An open challenge in this line of work is determining from the infinitely many predictions that the agent could possibly make which predictions might best support decision-making. This challenge is especially apparent in continual learning problems where a single stream of experience is available to a singular agent. As a primary contribution, we introduce a meta-gradient descent process by which an agent learns 1) what predictions to make, 2) the estimates for its chosen predictions, and 3) how to use those estimates to generate policies that maximize future reward -- all during a single ongoing process of continual learning. In this manuscript we consider predictions expressed as General Value Functions: temporally extended estimates of the accumulation of a future signal. We demonstrate that through interaction with the environment an agent can independently select predictions that resolve partial-observability, resulting in performance similar to expertly specified GVFs. By learning, rather than manually specifying these predictions, we enable the agent to identify useful predictions in a self-supervised manner, taking a step towards truly autonomous systems.
    A Stochastic Proximal Method for Nonsmooth Regularized Finite Sum Optimization. (arXiv:2206.06531v1 [stat.ML])
    We consider the problem of training a deep neural network with nonsmooth regularization to retrieve a sparse and efficient sub-structure. Our regularizer is only assumed to be lower semi-continuous and prox-bounded. We combine an adaptive quadratic regularization approach with proximal stochastic gradient principles to derive a new solver, called SR2, whose convergence and worst-case complexity are established without knowledge or approximation of the gradient's Lipschitz constant. We formulate a stopping criteria that ensures an appropriate first-order stationarity measure converges to zero under certain conditions. We establish a worst-case iteration complexity of $\mathcal{O}(\epsilon^{-2})$ that matches those of related methods like ProxGEN, where the learning rate is assumed to be related to the Lipschitz constant. Our experiments on network instances trained on CIFAR-10 and CIFAR-100 with $\ell_1$ and $\ell_0$ regularizations show that SR2 consistently achieves higher sparsity and accuracy than related methods such as ProxGEN and ProxSGD.
    The Dynamics of Riemannian Robbins-Monro Algorithms. (arXiv:2206.06795v1 [math.OC])
    Many important learning algorithms, such as stochastic gradient methods, are often deployed to solve nonlinear problems on Riemannian manifolds. Motivated by these applications, we propose a family of Riemannian algorithms generalizing and extending the seminal stochastic approximation framework of Robbins and Monro. Compared to their Euclidean counterparts, Riemannian iterative algorithms are much less understood due to the lack of a global linear structure on the manifold. We overcome this difficulty by introducing an extended Fermi coordinate frame which allows us to map the asymptotic behavior of the proposed Riemannian Robbins-Monro (RRM) class of algorithms to that of an associated deterministic dynamical system under very mild assumptions on the underlying manifold. In so doing, we provide a general template of almost sure convergence results that mirrors and extends the existing theory for Euclidean Robbins-Monro schemes, albeit with a significantly more involved analysis that requires a number of new geometric ingredients. We showcase the flexibility of the proposed RRM framework by using it to establish the convergence of a retraction-based analogue of the popular optimistic / extra-gradient methods for solving minimization problems and games, and we provide a unified treatment for their convergence.
    Safe Output Feedback Motion Planning from Images via Learned Perception Modules and Contraction Theory. (arXiv:2206.06553v1 [cs.RO])
    We present a motion planning algorithm for a class of uncertain control-affine nonlinear systems which guarantees runtime safety and goal reachability when using high-dimensional sensor measurements (e.g., RGB-D images) and a learned perception module in the feedback control loop. First, given a dataset of states and observations, we train a perception system that seeks to invert a subset of the state from an observation, and estimate an upper bound on the perception error which is valid with high probability in a trusted domain near the data. Next, we use contraction theory to design a stabilizing state feedback controller and a convergent dynamic state observer which uses the learned perception system to update its state estimate. We derive a bound on the trajectory tracking error when this controller is subjected to errors in the dynamics and incorrect state estimates. Finally, we integrate this bound into a sampling-based motion planner, guiding it to return trajectories that can be safely tracked at runtime using sensor data. We demonstrate our approach in simulation on a 4D car, a 6D planar quadrotor, and a 17D manipulation task with RGB(-D) sensor measurements, demonstrating that our method safely and reliably steers the system to the goal, while baselines that fail to consider the trusted domain or state estimation errors can be unsafe.
    Density Estimation with Autoregressive Bayesian Predictives. (arXiv:2206.06462v1 [stat.ML])
    Bayesian methods are a popular choice for statistical inference in small-data regimes due to the regularization effect induced by the prior, which serves to counteract overfitting. In the context of density estimation, the standard Bayesian approach is to target the posterior predictive. In general, direct estimation of the posterior predictive is intractable and so methods typically resort to approximating the posterior distribution as an intermediate step. The recent development of recursive predictive copula updates, however, has made it possible to perform tractable predictive density estimation without the need for posterior approximation. Although these estimators are computationally appealing, they tend to struggle on non-smooth data distributions. This is largely due to the comparatively restrictive form of the likelihood models from which the proposed copula updates were derived. To address this shortcoming, we consider a Bayesian nonparametric model with an autoregressive likelihood decomposition and Gaussian process prior, which yields a data-dependent bandwidth parameter in the copula update. Further, we formulate a novel parameterization of the bandwidth using an autoregressive neural network that maps the data into a latent space, and is thus able to capture more complex dependencies in the data. Our extensions increase the modelling capacity of existing recursive Bayesian density estimators, achieving state-of-the-art results on tabular data sets.
    Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate and Momentum. (arXiv:2006.15815v11 [cs.LG] UPDATED)
    Adaptive Moment Estimation (Adam), which combines Adaptive Learning Rate and Momentum, would be the most popular stochastic optimizer for accelerating the training of deep neural networks. However, it is empirically known that Adam often generalizes worse than Stochastic Gradient Descent (SGD). The purpose of this paper is to unveil the mystery of this behavior in the diffusion theoretical framework. Specifically, we disentangle the effects of Adaptive Learning Rate and Momentum of the Adam dynamics on saddle-point escaping and flat minima selection. We prove that Adaptive Learning Rate can escape saddle points efficiently, but cannot select flat minima as SGD does. In contrast, Momentum provides a drift effect to help the training process pass through saddle points, and almost does not affect flat minima selection. This partly explains why SGD (with Momentum) generalizes better, while Adam generalizes worse but converges faster. Furthermore, motivated by the analysis, we design a novel adaptive optimization framework named Adaptive Inertia, which uses parameter-wise adaptive inertia to accelerate the training and provably favors flat minima as well as SGD. Our extensive experiments demonstrate that the proposed adaptive inertia method can generalize significantly better than SGD and conventional adaptive gradient methods.
    Projection-free Distributed Online Learning with Sublinear Communication Complexity. (arXiv:2103.11102v2 [cs.LG] UPDATED)
    To deal with complicated constraints via locally light computations in distributed online learning, a recent study has presented a projection-free algorithm called distributed online conditional gradient (D-OCG), and achieved an $O(T^{3/4})$ regret bound for convex losses, where $T$ is the number of total rounds. However, it requires $T$ communication rounds, and cannot utilize the strong convexity of losses. In this paper, we propose an improved variant of D-OCG, namely D-BOCG, which can attain the same $O(T^{3/4})$ regret bound with only $O(\sqrt{T})$ communication rounds for convex losses, and a better regret bound of $O(T^{2/3}(\log T)^{1/3})$ with fewer $O(T^{1/3}(\log T)^{2/3})$ communication rounds for strongly convex losses. The key idea is to adopt a delayed update mechanism that reduces the communication complexity, and redefine the surrogate loss function in D-OCG for exploiting the strong convexity. Furthermore, we provide lower bounds to demonstrate that the $O(\sqrt{T})$ communication rounds required by D-BOCG are optimal (in terms of $T$) for achieving the $O(T^{3/4})$ regret with convex losses, and the $O(T^{1/3}(\log T)^{2/3})$ communication rounds required by D-BOCG are near-optimal (in terms of $T$) for achieving the $O(T^{2/3}(\log T)^{1/3})$ regret with strongly convex losses up to polylogarithmic factors. Finally, to handle the more challenging bandit setting, in which only the loss value is available, we incorporate the classical one-point gradient estimator into D-BOCG, and obtain similar theoretical guarantees.
    Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training. (arXiv:2206.06501v1 [cs.LG])
    Data clipping is crucial in reducing noise in quantization operations and improving the achievable accuracy of quantization-aware training (QAT). Current practices rely on heuristics to set clipping threshold scalars and cannot be shown to be optimal. We propose Optimally Clipped Tensors And Vectors (OCTAV), a recursive algorithm to determine MSE-optimal clipping scalars. Derived from the fast Newton-Raphson method, OCTAV finds optimal clipping scalars on the fly, for every tensor, at every iteration of the QAT routine. Thus, the QAT algorithm is formulated with provably minimum quantization noise at each step. In addition, we reveal limitations in common gradient estimation techniques in QAT and propose magnitude-aware differentiation as a remedy to further improve accuracy. Experimentally, OCTAV-enabled QAT achieves state-of-the-art accuracy on multiple tasks. These include training-from-scratch and retraining ResNets and MobileNets on ImageNet, and Squad fine-tuning using BERT models, where OCTAV-enabled QAT consistently preserves accuracy at low precision (4-to-6-bits). Our results require no modifications to the baseline training recipe, except for the insertion of quantization operations where appropriate.
    On Provably Robust Meta-Bayesian Optimization. (arXiv:2206.06872v1 [cs.LG])
    Bayesian optimization (BO) has become popular for sequential optimization of black-box functions. When BO is used to optimize a target function, we often have access to previous evaluations of potentially related functions. This begs the question as to whether we can leverage these previous experiences to accelerate the current BO task through meta-learning (meta-BO), while ensuring robustness against potentially harmful dissimilar tasks that could sabotage the convergence of BO. This paper introduces two scalable and provably robust meta-BO algorithms: robust meta-Gaussian process-upper confidence bound (RM-GP-UCB) and RM-GP-Thompson sampling (RM-GP-TS). We prove that both algorithms are asymptotically no-regret even when some or all previous tasks are dissimilar to the current task, and show that RM-GP-UCB enjoys a better theoretical robustness than RM-GP-TS. We also exploit the theoretical guarantees to optimize the weights assigned to individual previous tasks through regret minimization via online learning, which diminishes the impact of dissimilar tasks and hence further enhances the robustness. Empirical evaluations show that (a) RM-GP-UCB performs effectively and consistently across various applications, and (b) RM-GP-TS, despite being less robust than RM-GP-UCB both in theory and in practice, performs competitively in some scenarios with less dissimilar tasks and is more computationally efficient.
    A universal synthetic dataset for machine learning on spectroscopic data. (arXiv:2206.06031v2 [cs.LG] UPDATED)
    To assist in the development of machine learning methods for automated classification of spectroscopic data, we have generated a universal synthetic dataset that can be used for model validation. This dataset contains artificial spectra designed to represent experimental measurements from techniques including X-ray diffraction, nuclear magnetic resonance, and Raman spectroscopy. The dataset generation process features customizable parameters, such as scan length and peak count, which can be adjusted to fit the problem at hand. As an initial benchmark, we simulated a dataset containing 35,000 spectra based on 500 unique classes. To automate the classification of this data, eight different machine learning architectures were evaluated. From the results, we shed light on which factors are most critical to achieve optimal performance for the classification task. The scripts used to generate synthetic spectra, as well as our benchmark dataset and evaluation routines, are made publicly available to aid in the development of improved machine learning models for spectroscopic analysis.
  • Open

    Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method. (arXiv:2206.06900v1 [cs.LG])
    The classical AdaGrad method adapts the learning rate by dividing by the square root of a sum of squared gradients. Because this sum on the denominator is increasing, the method can only decrease step sizes over time, and requires a learning rate scaling hyper-parameter to be carefully tuned. To overcome this restriction, we introduce GradaGrad, a method in the same family that naturally grows or shrinks the learning rate based on a different accumulation in the denominator, one that can both increase and decrease. We show that it obeys a similar convergence rate as AdaGrad and demonstrate its non-monotone adaptation capability with experiments.
    Variational Diffusion Models. (arXiv:2107.00630v4 [cs.LG] UPDATED)
    Diffusion-based generative models have demonstrated a capacity for perceptually impressive synthesis, but can they also be great likelihood-based models? We answer this in the affirmative, and introduce a family of diffusion-based generative models that obtain state-of-the-art likelihoods on standard image density estimation benchmarks. Unlike other diffusion-based models, our method allows for efficient optimization of the noise schedule jointly with the rest of the model. We show that the variational lower bound (VLB) simplifies to a remarkably short expression in terms of the signal-to-noise ratio of the diffused data, thereby improving our theoretical understanding of this model class. Using this insight, we prove an equivalence between several models proposed in the literature. In addition, we show that the continuous-time VLB is invariant to the noise schedule, except for the signal-to-noise ratio at its endpoints. This enables us to learn a noise schedule that minimizes the variance of the resulting VLB estimator, leading to faster optimization. Combining these advances with architectural improvements, we obtain state-of-the-art likelihoods on image density estimation benchmarks, outperforming autoregressive models that have dominated these benchmarks for many years, with often significantly faster optimization. In addition, we show how to use the model as part of a bits-back compression scheme, and demonstrate lossless compression rates close to the theoretical optimum. Code is available at https://github.com/google-research/vdm .
    Highly Efficient Structural Learning of Sparse Staged Trees. (arXiv:2206.06970v1 [stat.ML])
    Several structural learning algorithms for staged tree models, an asymmetric extension of Bayesian networks, have been defined. However, they do not scale efficiently as the number of variables considered increases. Here we introduce the first scalable structural learning algorithm for staged trees, which searches over a space of models where only a small number of dependencies can be imposed. A simulation study as well as a real-world application illustrate our routines and the practical use of such data-learned staged trees.
    Provably Efficient Model-Free Algorithm for MDPs with Peak Constraints. (arXiv:2003.05555v6 [math.OC] UPDATED)
    In the optimization of dynamic systems, the variables typically have constraints. Such problems can be modeled as a Constrained Markov Decision Process (CMDP). This paper considers the peak Constrained Markov Decision Process (PCMDP), where the agent chooses the policy to maximize total reward in the finite horizon as well as satisfy constraints at each epoch with probability 1. We propose a model-free algorithm that converts PCMDP problem to an unconstrained problem and a Q-learning based approach is applied. We define the concept of probably approximately correct (PAC) to the proposed PCMDP problem. The proposed algorithm is proved to achieve an $(\epsilon,p)$-PAC policy when the episode $K\geq\Omega(\frac{I^2H^6SA\ell}{\epsilon^2})$, where $S$ and $A$ are the number of states and actions, respectively. $H$ is the number of epochs per episode. $I$ is the number of constraint functions, and $\ell=\log(\frac{SAT}{p})$. We note that this is the first result on PAC kind of analysis for PCMDP with peak constraints, where the transition dynamics are not known apriori. We demonstrate the proposed algorithm on an energy harvesting problem and a single machine scheduling problem, where it performs close to the theoretical upper bound of the studied optimization problem.
    Deep Variational Implicit Processes. (arXiv:2206.06720v1 [stat.ML])
    Implicit processes (IPs) are a generalization of Gaussian processes (GPs). IPs may lack a closed-form expression but are easy to sample from. Examples include, among others, Bayesian neural networks or neural samplers. IPs can be used as priors over functions, resulting in flexible models with well-calibrated prediction uncertainty estimates. Methods based on IPs usually carry out function-space approximate inference, which overcomes some of the difficulties of parameter-space approximate inference. Nevertheless, the approximations employed often limit the expressiveness of the final model, resulting, \emph{e.g.}, in a Gaussian predictive distribution, which can be restrictive. We propose here a multi-layer generalization of IPs called the Deep Variational Implicit process (DVIP). This generalization is similar to that of deep GPs over GPs, but it is more flexible due to the use of IPs as the prior distribution over the latent functions. We describe a scalable variational inference algorithm for training DVIP and show that it outperforms previous IP-based methods and also deep GPs. We support these claims via extensive regression and classification experiments. We also evaluate DVIP on large datasets with up to several million data instances to illustrate its good scalability and performance.
    On the Finite-Time Performance of the Knowledge Gradient Algorithm. (arXiv:2206.06847v1 [stat.ML])
    The knowledge gradient (KG) algorithm is a popular and effective algorithm for the best arm identification (BAI) problem. Due to the complex calculation of KG, theoretical analysis of this algorithm is difficult, and existing results are mostly about the asymptotic performance of it, e.g., consistency, asymptotic sample allocation, etc. In this research, we present new theoretical results about the finite-time performance of the KG algorithm. Under independent and normally distributed rewards, we derive lower bounds and upper bounds for the probability of error and simple regret of the algorithm. With these bounds, existing asymptotic results become simple corollaries. We also show the performance of the algorithm for the multi-armed bandit (MAB) problem. These developments not only extend the existing analysis of the KG algorithm, but can also be used to analyze other improvement-based algorithms. Last, we use numerical experiments to further demonstrate the finite-time behavior of the KG algorithm.
    Two-Timescale Stochastic Approximation for Bilevel Optimisation Problems in Continuous-Time Models. (arXiv:2206.06995v1 [math.OC])
    We analyse the asymptotic properties of a continuous-time, two-timescale stochastic approximation algorithm designed for stochastic bilevel optimisation problems in continuous-time models. We obtain the weak convergence rate of this algorithm in the form of a central limit theorem. We also demonstrate how this algorithm can be applied to several continuous-time bilevel optimisation problems.
    Overparametrized linear dimensionality reductions: From projection pursuit to two-layer neural networks. (arXiv:2206.06526v1 [stat.ML])
    Given a cloud of $n$ data points in $\mathbb{R}^d$, consider all projections onto $m$-dimensional subspaces of $\mathbb{R}^d$ and, for each such projection, the empirical distribution of the projected points. What does this collection of probability distributions look like when $n,d$ grow large? We consider this question under the null model in which the points are i.i.d. standard Gaussian vectors, focusing on the asymptotic regime in which $n,d\to\infty$, with $n/d\to\alpha\in (0,\infty)$, while $m$ is fixed. Denoting by $\mathscr{F}_{m, \alpha}$ the set of probability distributions in $\mathbb{R}^m$ that arise as low-dimensional projections in this limit, we establish new inner and outer bounds on $\mathscr{F}_{m, \alpha}$. In particular, we characterize the Wasserstein radius of $\mathscr{F}_{m,\alpha}$ up to logarithmic factors, and determine it exactly for $m=1$. We also prove sharp bounds in terms of Kullback-Leibler divergence and R\'{e}nyi information dimension. The previous question has application to unsupervised learning methods, such as projection pursuit and independent component analysis. We introduce a version of the same problem that is relevant for supervised learning, and prove a sharp Wasserstein radius bound. As an application, we establish an upper bound on the interpolation threshold of two-layers neural networks with $m$ hidden neurons.
    AMEIR: Automatic Behavior Modeling, Interaction Exploration and MLP Investigation in the Recommender System. (arXiv:2006.05933v2 [cs.LG] UPDATED)
    Recently, deep learning models have been widely spread in the industrial recommender systems and boosted the recommendation quality. Though having achieved remarkable success, the design of task-aware recommender systems usually requires manual feature engineering and architecture engineering from domain experts. To relieve those human efforts, we explore the potential of neural architecture search (NAS) and introduce AMEIR for Automatic behavior Modeling, interaction Exploration and multi-layer perceptron (MLP) Investigation in the Recommender system. The core contributions of AMEIR are the three-stage search space and the tailored three-step searching pipeline. Specifically, AMEIR divides the complete recommendation models into three stages of behavior modeling, interaction exploration, MLP aggregation, and introduces a novel search space containing three tailored subspaces that cover most of the existing methods and thus allow for searching better models. To find the ideal architecture efficiently and effectively, AMEIR realizes the one-shot random search in recommendation progressively on the three stages and assembles the search results as the final outcome. Further analysis reveals that AMEIR's search space could cover most of the representative recommendation models, which demonstrates the universality of our design. The extensive experiments over various scenarios reveal that AMEIR outperforms competitive baselines of elaborate manual design and leading algorithmic complex NAS methods with lower model complexity and comparable time cost, indicating efficacy, efficiency and robustness of the proposed method.
    On the Convergence of the Shapley Value in Parametric Bayesian Learning Games. (arXiv:2205.07428v2 [cs.LG] UPDATED)
    Measuring contributions is a classical problem in cooperative game theory where the Shapley value is the most well-known solution concept. In this paper, we establish the convergence property of the Shapley value in parametric Bayesian learning games where players perform a Bayesian inference using their combined data, and the posterior-prior KL divergence is used as the characteristic function. We show that for any two players, under some regularity conditions, their difference in Shapley value converges in probability to the difference in Shapley value of a limiting game whose characteristic function is proportional to the log-determinant of the joint Fisher information. As an application, we present an online collaborative learning framework that is asymptotically Shapley-fair. Our result enables this to be achieved without any costly computations of posterior-prior KL divergences. Only a consistent estimator of the Fisher information is needed. The effectiveness of our framework is demonstrated with experiments using real-world data.
    Fiberwise dimensionality reduction of topologically complex data with vector bundles. (arXiv:2206.06513v1 [cs.CG])
    Datasets with non-trivial large scale topology can be hard to embed in low-dimensional Euclidean space with existing dimensionality reduction algorithms. We propose to model topologically complex datasets using vector bundles, in such a way that the base space accounts for the large scale topology, while the fibers account for the local geometry. This allows one to reduce the dimensionality of the fibers, while preserving the large scale topology. We formalize this point of view, and, as an application, we describe an algorithm which takes as input a dataset together with an initial representation of it in Euclidean space, assumed to recover part of its large scale topology, and outputs a new representation that integrates local representations, obtained through local linear dimensionality reduction, along the initial global representation. We demonstrate this algorithm on examples coming from dynamical systems and chemistry. In these examples, our algorithm is able to learn topologically faithful embeddings of the data in lower target dimension than various well known metric-based dimensionality reduction algorithms.
    Benign Overfitting in Two-layer Convolutional Neural Networks. (arXiv:2202.06526v3 [cs.LG] UPDATED)
    Modern neural networks often have great expressive power and can be trained to overfit the training data, while still achieving a good test performance. This phenomenon is referred to as "benign overfitting". Recently, there emerges a line of works studying "benign overfitting" from the theoretical perspective. However, they are limited to linear models or kernel/random feature models, and there is still a lack of theoretical understanding about when and how benign overfitting occurs in neural networks. In this paper, we study the benign overfitting phenomenon in training a two-layer convolutional neural network (CNN). We show that when the signal-to-noise ratio satisfies a certain condition, a two-layer CNN trained by gradient descent can achieve arbitrarily small training and test loss. On the other hand, when this condition does not hold, overfitting becomes harmful and the obtained CNN can only achieve a constant level test loss. These together demonstrate a sharp phase transition between benign overfitting and harmful overfitting, driven by the signal-to-noise ratio. To the best of our knowledge, this is the first work that precisely characterizes the conditions under which benign overfitting can occur in training convolutional neural networks.
    DoWhy-GCM: An extension of DoWhy for causal inference in graphical causal models. (arXiv:2206.06821v1 [stat.ME])
    We introduce DoWhy-GCM, an extension of the DoWhy Python library, that leverages graphical causal models. Unlike existing causality libraries, which mainly focus on effect estimation questions, with DoWhy-GCM, users can ask a wide range of additional causal questions, such as identifying the root causes of outliers and distributional changes, causal structure learning, attributing causal influences, and diagnosis of causal structures. To this end, DoWhy-GCM users first model cause-effect relations between variables in a system under study through a graphical causal model, fit the causal mechanisms of variables next, and then ask the causal question. All these steps take only a few lines of code in DoWhy-GCM. The library is available at https://github.com/py-why/dowhy.
    Learning Markov Games with Adversarial Opponents: Efficient Algorithms and Fundamental Limits. (arXiv:2203.06803v4 [cs.LG] UPDATED)
    An ideal strategy in zero-sum games should not only grant the player an average reward no less than the value of Nash equilibrium, but also exploit the (adaptive) opponents when they are suboptimal. While most existing works in Markov games focus exclusively on the former objective, it remains open whether we can achieve both objectives simultaneously. To address this problem, this work studies no-regret learning in Markov games with adversarial opponents when competing against the best fixed policy in hindsight. Along this direction, we present a new complete set of positive and negative results: When the policies of the opponents are revealed at the end of each episode, we propose new efficient algorithms achieving $\sqrt{K}$-regret bounds when either (1) the baseline policy class is small or (2) the opponent's policy class is small. This is complemented with an exponential lower bound when neither conditions are true. When the policies of the opponents are not revealed, we prove a statistical hardness result even in the most favorable scenario when both above conditions are true. Our hardness result is much stronger than the existing hardness results which either only involve computational hardness, or require further restrictions on the algorithms.
    Precise expressions for random projections: Low-rank approximation and randomized Newton. (arXiv:2006.10653v3 [cs.LG] UPDATED)
    It is often desirable to reduce the dimensionality of a large dataset by projecting it onto a low-dimensional subspace. Matrix sketching has emerged as a powerful technique for performing such dimensionality reduction very efficiently. Even though there is an extensive literature on the worst-case performance of sketching, existing guarantees are typically very different from what is observed in practice. We exploit recent developments in the spectral analysis of random matrices to develop novel techniques that provide provably accurate expressions for the expected value of random projection matrices obtained via sketching. These expressions can be used to characterize the performance of dimensionality reduction in a variety of common machine learning tasks, ranging from low-rank approximation to iterative stochastic optimization. Our results apply to several popular sketching methods, including Gaussian and Rademacher sketches, and they enable precise analysis of these methods in terms of spectral properties of the data. Empirical results show that the expressions we derive reflect the practical performance of these sketching methods, down to lower-order effects and even constant factors.
    On Convergence of Federated Averaging Langevin Dynamics. (arXiv:2112.05120v2 [stat.ML] UPDATED)
    We propose a federated averaging Langevin algorithm (FA-LD) for uncertainty quantification and mean predictions with distributed clients. In particular, we generalize beyond normal posterior distributions and consider a general class of models. We develop theoretical guarantees for FA-LD for strongly log-concave distributions with non-i.i.d data and study how the injected noise and the stochastic-gradient noise, the heterogeneity of data, and the varying learning rates affect the convergence. Such an analysis sheds light on the optimal choice of local updates to minimize communication costs. Important to our approach is that the communication efficiency does not deteriorate with the injected noise in the Langevin algorithms. In addition, we examine in our FA-LD algorithm both independent and correlated noise used over different clients. We observe there is a trade-off between the pairs among communication, accuracy, and data privacy. As local devices may become inactive in federated networks, we also show convergence results based on different averaging schemes where only partial device updates are available. In such a case, we discover an additional bias that does not decay to zero.
    Energy Flows: Towards Determinant-Free Training of Normalizing Flows. (arXiv:2206.06672v1 [cs.LG])
    Normalizing flows are a popular approach for constructing probabilistic and generative models. However, maximum likelihood training of flows is challenging due to the need to calculate computationally expensive determinants of Jacobians. This paper takes steps towards addressing this challenge by introducing an approach for determinant-free training of flows inspired by two-sample testing. Central to our framework is the energy objective, a multidimensional extension of proper scoring rules that admits efficient estimators based on random projections and that outperforms a range of alternative two-sample objectives that can be derived in our framework. Crucially, the energy objective and its alternatives do not require calculating determinants and therefore support general flow architectures that are not well-suited to maximum likelihood training (e.g., densely connected networks). We empirically demonstrate that energy flows achieve competitive generative modeling performance while maintaining fast generation and posterior inference.
    Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward. (arXiv:2206.06426v1 [cs.LG])
    The remarkable success of reinforcement learning (RL) heavily relies on observing the reward of every visited state-action pair. In many real world applications, however, an agent can observe only a score that represents the quality of the whole trajectory, which is referred to as the {\em trajectory-wise reward}. In such a situation, it is difficult for standard RL methods to well utilize trajectory-wise reward, and large bias and variance errors can be incurred in policy evaluation. In this work, we propose a novel offline RL algorithm, called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED), which decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value iteration based on the learned proxy reward. To ensure the value functions constructed by PARTED are always pessimistic with respect to the optimal ones, we design a new penalty term to offset the uncertainty of the proxy reward. For general episodic MDPs with large state space, we show that PARTED with overparameterized neural network function approximation achieves an $\tilde{\mathcal{O}}(D_{\text{eff}}H^2/\sqrt{N})$ suboptimality, where $H$ is the length of episode, $N$ is the total number of samples, and $D_{\text{eff}}$ is the effective dimension of the neural tangent kernel matrix. To further illustrate the result, we show that PARTED achieves an $\tilde{\mathcal{O}}(dH^3/\sqrt{N})$ suboptimality with linear MDPs, where $d$ is the feature dimension, which matches with that with neural network function approximation, when $D_{\text{eff}}=dH$. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
    Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency. (arXiv:2204.09787v2 [cs.LG] UPDATED)
    We study reinforcement learning for partially observed Markov decision processes (POMDPs) with infinite observation and state spaces, which remains less investigated theoretically. To this end, we make the first attempt at bridging partial observability and function approximation for a class of POMDPs with a linear structure. In detail, we propose a reinforcement learning algorithm (Optimistic Exploration via Adversarial Integral Equation or OP-TENET) that attains an $\epsilon$-optimal policy within $O(1/\epsilon^2)$ episodes. In particular, the sample complexity scales polynomially in the intrinsic dimension of the linear structure and is independent of the size of the observation and state spaces. The sample efficiency of OP-TENET is enabled by a sequence of ingredients: (i) a Bellman operator with finite memory, which represents the value function in a recursive manner, (ii) the identification and estimation of such an operator via an adversarial integral equation, which features a smoothed discriminator tailored to the linear structure, and (iii) the exploration of the observation and state spaces via optimism, which is based on quantifying the uncertainty in the adversarial integral equation.
    Density Estimation with Autoregressive Bayesian Predictives. (arXiv:2206.06462v1 [stat.ML])
    Bayesian methods are a popular choice for statistical inference in small-data regimes due to the regularization effect induced by the prior, which serves to counteract overfitting. In the context of density estimation, the standard Bayesian approach is to target the posterior predictive. In general, direct estimation of the posterior predictive is intractable and so methods typically resort to approximating the posterior distribution as an intermediate step. The recent development of recursive predictive copula updates, however, has made it possible to perform tractable predictive density estimation without the need for posterior approximation. Although these estimators are computationally appealing, they tend to struggle on non-smooth data distributions. This is largely due to the comparatively restrictive form of the likelihood models from which the proposed copula updates were derived. To address this shortcoming, we consider a Bayesian nonparametric model with an autoregressive likelihood decomposition and Gaussian process prior, which yields a data-dependent bandwidth parameter in the copula update. Further, we formulate a novel parameterization of the bandwidth using an autoregressive neural network that maps the data into a latent space, and is thus able to capture more complex dependencies in the data. Our extensions increase the modelling capacity of existing recursive Bayesian density estimators, achieving state-of-the-art results on tabular data sets.
    On Finite-Sample Identifiability of Contrastive Learning-Based Nonlinear Independent Component Analysis. (arXiv:2206.06593v1 [cs.LG])
    Nonlinear independent component analysis (nICA) aims at recovering statistically independent latent components that are mixed by unknown nonlinear functions. Central to nICA is the identifiability of the latent components, which had been elusive until very recently. Specifically, Hyv\"arinen et al. have shown that the nonlinearly mixed latent components are identifiable (up to often inconsequential ambiguities) under a generalized contrastive learning (GCL) formulation, given that the latent components are independent conditioned on a certain auxiliary variable. The GCL-based identifiability of nICA is elegant, and establishes interesting connections between nICA and popular unsupervised/self-supervised learning paradigms in representation learning, causal learning, and factor disentanglement. However, existing identifiability analyses of nICA all build upon an unlimited sample assumption and the use of ideal universal function learners -- which creates a non-negligible gap between theory and practice. Closing the gap is a nontrivial challenge, as there is a lack of established ``textbook'' routine for finite sample analysis of such unsupervised problems. This work puts forth a finite-sample identifiability analysis of GCL-based nICA. Our analytical framework judiciously combines the properties of the GCL loss function, statistical generalization analysis, and numerical differentiation. Our framework also takes the learning function's approximation error into consideration, and reveals an intuitive trade-off between the complexity and expressiveness of the employed function learner. Numerical experiments are used to validate the theorems.
    On the proliferation of support vectors in high dimensions. (arXiv:2009.10670v2 [math.ST] UPDATED)
    The support vector machine (SVM) is a well-established classification method whose name refers to the particular training examples, called support vectors, that determine the maximum margin separating hyperplane. The SVM classifier is known to enjoy good generalization properties when the number of support vectors is small compared to the number of training examples. However, recent research has shown that in sufficiently high-dimensional linear classification problems, the SVM can generalize well despite a proliferation of support vectors where all training examples are support vectors. In this paper, we identify new deterministic equivalences for this phenomenon of support vector proliferation, and use them to (1) substantially broaden the conditions under which the phenomenon occurs in high-dimensional settings, and (2) prove a nearly matching converse result.
    Exponential Error Convergence in Data Classification with Optimized Random Features: Acceleration by Quantum Machine Learning. (arXiv:2106.09028v2 [quant-ph] UPDATED)
    Classification is a common task in machine learning. Random features (RFs) stand as a central technique for scalable learning algorithms based on kernel methods, and more recently proposed optimized random features, sampled depending on the model and the data distribution, can significantly reduce and provably minimize the required number of features. However, existing research on classification using optimized RFs has suffered from computational hardness in sampling each optimized RF; moreover, it has failed to achieve the exponentially fast error-convergence speed that other state-of-the-art kernel methods can achieve under a low-noise condition. To overcome these slowdowns, we here construct a classification algorithm with optimized RFs accelerated by means of quantum machine learning (QML) and study its runtime to clarify overall advantage. We prove that our algorithm can achieve the exponential error convergence under the low-noise condition even with optimized RFs; at the same time, our algorithm can exploit the advantage of the significant reduction of the number of features without the computational hardness owing to QML. These results discover a promising application of QML to acceleration of the leading kernel-based classification algorithm without ruining its wide applicability and the exponential error-convergence speed.
    When adversarial attacks become interpretable counterfactual explanations. (arXiv:2206.06854v1 [cs.AI])
    We argue that, when learning a 1-Lipschitz neural network with the dual loss of an optimal transportation problem, the gradient of the model is both the direction of the transportation plan and the direction to the closest adversarial attack. Traveling along the gradient to the decision boundary is no more an adversarial attack but becomes a counterfactual explanation, explicitly transporting from one class to the other. Through extensive experiments on XAI metrics, we find that the simple saliency map method, applied on such networks, becomes a reliable explanation, and outperforms the state-of-the-art explanation approaches on unconstrained models. The proposed networks were already known to be certifiably robust, and we prove that they are also explainable with a fast and simple method.
    A Stochastic Proximal Method for Nonsmooth Regularized Finite Sum Optimization. (arXiv:2206.06531v1 [stat.ML])
    We consider the problem of training a deep neural network with nonsmooth regularization to retrieve a sparse and efficient sub-structure. Our regularizer is only assumed to be lower semi-continuous and prox-bounded. We combine an adaptive quadratic regularization approach with proximal stochastic gradient principles to derive a new solver, called SR2, whose convergence and worst-case complexity are established without knowledge or approximation of the gradient's Lipschitz constant. We formulate a stopping criteria that ensures an appropriate first-order stationarity measure converges to zero under certain conditions. We establish a worst-case iteration complexity of $\mathcal{O}(\epsilon^{-2})$ that matches those of related methods like ProxGEN, where the learning rate is assumed to be related to the Lipschitz constant. Our experiments on network instances trained on CIFAR-10 and CIFAR-100 with $\ell_1$ and $\ell_0$ regularizations show that SR2 consistently achieves higher sparsity and accuracy than related methods such as ProxGEN and ProxSGD.  ( 2 min )
    On the Role of Channel Capacity in Learning Gaussian Mixture Models. (arXiv:2202.07707v2 [cs.IT] UPDATED)
    This paper studies the sample complexity of learning the $k$ unknown centers of a balanced Gaussian mixture model (GMM) in $\mathbb{R}^d$ with spherical covariance matrix $\sigma^2\mathbf{I}$. In particular, we are interested in the following question: what is the maximal noise level $\sigma^2$, for which the sample complexity is essentially the same as when estimating the centers from labeled measurements? To that end, we restrict attention to a Bayesian formulation of the problem, where the centers are uniformly distributed on the sphere $\sqrt{d}\mathcal{S}^{d-1}$. Our main results characterize the exact noise threshold $\sigma^2$ below which the GMM learning problem, in the large system limit $d,k\to\infty$, is as easy as learning from labeled observations, and above which it is substantially harder. The threshold occurs at $\frac{\log k}{d} = \frac12\log\left( 1+\frac{1}{\sigma^2} \right)$, which is the capacity of the additive white Gaussian noise (AWGN) channel. Thinking of the set of $k$ centers as a code, this noise threshold can be interpreted as the largest noise level for which the error probability of the code over the AWGN channel is small. Previous works on the GMM learning problem have identified the minimum distance between the centers as a key parameter in determining the statistical difficulty of learning the corresponding GMM. While our results are only proved for GMMs whose centers are uniformly distributed over the sphere, they hint that perhaps it is the decoding error probability associated with the center constellation as a channel code that determines the statistical difficulty of learning the corresponding GMM, rather than just the minimum distance.  ( 2 min )
    Fast Computation of Highly G-optimal Exact Designs via Particle Swarm Optimization. (arXiv:2206.06498v1 [stat.CO])
    Computing proposed exact $G$-optimal designs for response surface models is a difficult computation that has received incremental improvements via algorithm development in the last two-decades. These optimal designs have not been considered widely in applications in part due to the difficulty and cost involved with computing them. Three primary algorithms for constructing exact $G$-optimal designs are presented in the literature: the coordinate exchange (CEXCH), a genetic algorithm (GA), and the relatively new $G$-optimal via $I_\lambda$-optimality algorithm ($G(I_\lambda)$-CEXCH) which was developed in part to address large computational cost. Particle swarm optimization (PSO) has achieved widespread use in many applications, but to date, its broad-scale success notwithstanding, has seen relatively few applications in optimal design problems. In this paper we develop an extension of PSO to adapt it to the optimal design problem. We then employ PSO to generate optimal designs for several scenarios covering $K = 1, 2, 3, 4, 5$ design factors, which are common experimental sizes in industrial experiments. We compare these results to all $G$-optimal designs published in last two decades of literature. Published $G$-optimal designs generated by GA for $K=1, 2, 3$ factors have stood unchallenged for 14 years. We demonstrate that PSO has found improved $G$-optimal designs for these scenarios, and it does this with comparable computational cost to the state-of-the-art algorithm $G(I_\lambda)$-CEXCH. Further, we show that PSO is able to produce equal or better $G$-optimal designs for $K= 4, 5$ factors than those currently known. These results suggest that PSO is superior to existing approaches for efficiently generating highly $G$-optimal designs.  ( 2 min )
    Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate and Momentum. (arXiv:2006.15815v11 [cs.LG] UPDATED)
    Adaptive Moment Estimation (Adam), which combines Adaptive Learning Rate and Momentum, would be the most popular stochastic optimizer for accelerating the training of deep neural networks. However, it is empirically known that Adam often generalizes worse than Stochastic Gradient Descent (SGD). The purpose of this paper is to unveil the mystery of this behavior in the diffusion theoretical framework. Specifically, we disentangle the effects of Adaptive Learning Rate and Momentum of the Adam dynamics on saddle-point escaping and flat minima selection. We prove that Adaptive Learning Rate can escape saddle points efficiently, but cannot select flat minima as SGD does. In contrast, Momentum provides a drift effect to help the training process pass through saddle points, and almost does not affect flat minima selection. This partly explains why SGD (with Momentum) generalizes better, while Adam generalizes worse but converges faster. Furthermore, motivated by the analysis, we design a novel adaptive optimization framework named Adaptive Inertia, which uses parameter-wise adaptive inertia to accelerate the training and provably favors flat minima as well as SGD. Our extensive experiments demonstrate that the proposed adaptive inertia method can generalize significantly better than SGD and conventional adaptive gradient methods.  ( 3 min )
    Neural interval-censored Cox regression with feature selection. (arXiv:2206.06885v1 [stat.ML])
    The classical Cox model emerged in 1972 promoting breakthroughs in how patient prognosis is quantified using time-to-event analysis in biomedicine. One of the most useful characteristics of the model for practitioners is the interpretability of the variables in the analysis. However, this comes at the price of introducing strong assumptions concerning the functional form of the regression model. To break this gap, this paper aims to exploit the explainability advantages of the classical Cox model in the setting of interval-censoring using a new Lasso neural network that simultaneously selects the most relevant variables while quantifying non-linear relations between predictors and survival times. The gain of the new method is illustrated empirically in an extensive simulation study with examples that involve linear and non-linear ground dependencies. We also demonstrate the performance of our strategy in the analysis of physiological, clinical and accelerometer data from the NHANES 2003-2006 waves to predict the effect of physical activity on the survival of patients. Our method outperforms the prior results in the literature that use the traditional Cox model.  ( 2 min )
    Online Learning to Transport via the Minimal Selection Principle. (arXiv:2202.04732v2 [cs.LG] UPDATED)
    Motivated by robust dynamic resource allocation in operations research, we study the \textit{Online Learning to Transport} (OLT) problem where the decision variable is a probability measure, an infinite-dimensional object. We draw connections between online learning, optimal transport, and partial differential equations through an insight called the minimal selection principle, originally studied in the Wasserstein gradient flow setting by \citet{Ambrosio_2005}. This allows us to extend the standard online learning framework to the infinite-dimensional setting seamlessly. Based on our framework, we derive a novel method called the \textit{minimal selection or exploration (MSoE) algorithm} to solve OLT problems using mean-field approximation and discretization techniques. In the displacement convex setting, the main theoretical message underpinning our approach is that minimizing transport cost over time (via the minimal selection principle) ensures optimal cumulative regret upper bounds. On the algorithmic side, our MSoE algorithm applies beyond the displacement convex setting, making the mathematical theory of optimal transport practically relevant to non-convex settings common in dynamic resource allocation.  ( 2 min )
    SpecNet2: Orthogonalization-free spectral embedding by neural networks. (arXiv:2206.06644v1 [stat.ML])
    Spectral methods which represent data points by eigenvectors of kernel matrices or graph Laplacian matrices have been a primary tool in unsupervised data analysis. In many application scenarios, parametrizing the spectral embedding by a neural network that can be trained over batches of data samples gives a promising way to achieve automatic out-of-sample extension as well as computational scalability. Such an approach was taken in the original paper of SpectralNet (Shaham et al. 2018), which we call SpecNet1. The current paper introduces a new neural network approach, named SpecNet2, to compute spectral embedding which optimizes an equivalent objective of the eigen-problem and removes the orthogonalization layer in SpecNet1. SpecNet2 also allows separating the sampling of rows and columns of the graph affinity matrix by tracking the neighbors of each data point through the gradient formula. Theoretically, we show that any local minimizer of the new orthogonalization-free objective reveals the leading eigenvectors. Furthermore, global convergence for this new orthogonalization-free objective using a batch-based gradient descent method is proved. Numerical experiments demonstrate the improved performance and computational efficiency of SpecNet2 on simulated data and image datasets.  ( 2 min )
    Supervised Dictionary Learning with Auxiliary Covariates. (arXiv:2206.06774v1 [stat.ML])
    Supervised dictionary learning (SDL) is a classical machine learning method that simultaneously seeks feature extraction and classification tasks, which are not necessarily a priori aligned objectives. The goal of SDL is to learn a class-discriminative dictionary, which is a set of latent feature vectors that can well-explain both the features as well as labels of observed data. In this paper, we provide a systematic study of SDL, including the theory, algorithm, and applications of SDL. First, we provide a novel framework that `lifts' SDL as a convex problem in a combined factor space and propose a low-rank projected gradient descent algorithm that converges exponentially to the global minimizer of the objective. We also formulate generative models of SDL and provide global estimation guarantees of the true parameters depending on the hyperparameter regime. Second, viewed as a nonconvex constrained optimization problem, we provided an efficient block coordinate descent algorithm for SDL that is guaranteed to find an $\varepsilon$-stationary point of the objective in $O(\varepsilon^{-1}(\log \varepsilon^{-1})^{2})$ iterations. For the corresponding generative model, we establish a novel non-asymptotic local consistency result for constrained and regularized maximum likelihood estimation problems, which may be of independent interest. Third, we apply SDL for imbalanced document classification by supervised topic modeling and also for pneumonia detection from chest X-ray images. We also provide simulation studies to demonstrate that SDL becomes more effective when there is a discrepancy between the best reconstructive and the best discriminative dictionaries.  ( 2 min )
    Distribution Compression in Near-linear Time. (arXiv:2111.07941v4 [stat.ML] UPDATED)
    In distribution compression, one aims to accurately summarize a probability distribution $\mathbb{P}$ using a small number of representative points. Near-optimal thinning procedures achieve this goal by sampling $n$ points from a Markov chain and identifying $\sqrt{n}$ points with $\widetilde{\mathcal{O}}(1/\sqrt{n})$ discrepancy to $\mathbb{P}$. Unfortunately, these algorithms suffer from quadratic or super-quadratic runtime in the sample size $n$. To address this deficiency, we introduce Compress++, a simple meta-procedure for speeding up any thinning algorithm while suffering at most a factor of $4$ in error. When combined with the quadratic-time kernel halving and kernel thinning algorithms of Dwivedi and Mackey (2021), Compress++ delivers $\sqrt{n}$ points with $\mathcal{O}(\sqrt{\log n/n})$ integration error and better-than-Monte-Carlo maximum mean discrepancy in $\mathcal{O}(n \log^3 n)$ time and $\mathcal{O}( \sqrt{n} \log^2 n )$ space. Moreover, Compress++ enjoys the same near-linear runtime given any quadratic-time input and reduces the runtime of super-quadratic algorithms by a square-root factor. In our benchmarks with high-dimensional Monte Carlo samples and Markov chains targeting challenging differential equation posteriors, Compress++ matches or nearly matches the accuracy of its input algorithm in orders of magnitude less time.  ( 2 min )
    Robust Reinforcement Learning with Distributional Risk-averse formulation. (arXiv:2206.06841v1 [cs.LG])
    Robust Reinforcement Learning tries to make predictions more robust to changes in the dynamics or rewards of the system. This problem is particularly important when the dynamics and rewards of the environment are estimated from the data. In this paper, we approximate the Robust Reinforcement Learning constrained with a $\Phi$-divergence using an approximate Risk-Averse formulation. We show that the classical Reinforcement Learning formulation can be robustified using standard deviation penalization of the objective. Two algorithms based on Distributional Reinforcement Learning, one for discrete and one for continuous action spaces are proposed and tested in a classical Gym environment to demonstrate the robustness of the algorithms.  ( 2 min )
    Projection-free Distributed Online Learning with Sublinear Communication Complexity. (arXiv:2103.11102v2 [cs.LG] UPDATED)
    To deal with complicated constraints via locally light computations in distributed online learning, a recent study has presented a projection-free algorithm called distributed online conditional gradient (D-OCG), and achieved an $O(T^{3/4})$ regret bound for convex losses, where $T$ is the number of total rounds. However, it requires $T$ communication rounds, and cannot utilize the strong convexity of losses. In this paper, we propose an improved variant of D-OCG, namely D-BOCG, which can attain the same $O(T^{3/4})$ regret bound with only $O(\sqrt{T})$ communication rounds for convex losses, and a better regret bound of $O(T^{2/3}(\log T)^{1/3})$ with fewer $O(T^{1/3}(\log T)^{2/3})$ communication rounds for strongly convex losses. The key idea is to adopt a delayed update mechanism that reduces the communication complexity, and redefine the surrogate loss function in D-OCG for exploiting the strong convexity. Furthermore, we provide lower bounds to demonstrate that the $O(\sqrt{T})$ communication rounds required by D-BOCG are optimal (in terms of $T$) for achieving the $O(T^{3/4})$ regret with convex losses, and the $O(T^{1/3}(\log T)^{2/3})$ communication rounds required by D-BOCG are near-optimal (in terms of $T$) for achieving the $O(T^{2/3}(\log T)^{1/3})$ regret with strongly convex losses up to polylogarithmic factors. Finally, to handle the more challenging bandit setting, in which only the loss value is available, we incorporate the classical one-point gradient estimator into D-BOCG, and obtain similar theoretical guarantees.  ( 2 min )
    End-to-end Kernel Learning via Generative Random Fourier Features. (arXiv:2009.04614v3 [cs.LG] UPDATED)
    Random Fourier features (RFFs) provide a promising way for kernel learning in a spectral case. Current RFFs-based kernel learning methods usually work in a two-stage way. In the first-stage process, learning the optimal feature map is often formulated as a target alignment problem, which aims to align the learned kernel with the pre-defined target kernel (usually the ideal kernel). In the second-stage process, a linear learner is conducted with respect to the mapped random features. Nevertheless, the pre-defined kernel in target alignment is not necessarily optimal for the generalization of the linear learner. Instead, in this paper, we consider a one-stage process that incorporates the kernel learning and linear learner into a unifying framework. To be specific, a generative network via RFFs is devised to implicitly learn the kernel, followed by a linear classifier parameterized as a full-connected layer. Then the generative network and the classifier are jointly trained by solving the empirical risk minimization (ERM) problem to reach a one-stage solution. This end-to-end scheme naturally allows deeper features, in correspondence to a multi-layer structure, and shows superior generalization performance over the classical two-stage, RFFs-based methods in real-world classification tasks. Moreover, inspired by the randomized resampling mechanism of the proposed method, its enhanced adversarial robustness is investigated and experimentally verified.  ( 2 min )
    Invariant Structure Learning for Better Generalization and Causal Explainability. (arXiv:2206.06469v1 [cs.LG])
    Learning the causal structure behind data is invaluable for improving generalization and obtaining high-quality explanations. We propose a novel framework, Invariant Structure Learning (ISL), that is designed to improve causal structure discovery by utilizing generalization as an indication. ISL splits the data into different environments, and learns a structure that is invariant to the target across different environments by imposing a consistency constraint. An aggregation mechanism then selects the optimal classifier based on a graph structure that reflects the causal mechanisms in the data more accurately compared to the structures learnt from individual environments. Furthermore, we extend ISL to a self-supervised learning setting where accurate causal structure discovery does not rely on any labels. This self-supervised ISL utilizes invariant causality proposals by iteratively setting different nodes as targets. On synthetic and real-world datasets, we demonstrate that ISL accurately discovers the causal structure, outperforms alternative methods, and yields superior generalization for datasets with significant distribution shifts.  ( 2 min )
    A Multi-Agent Reinforcement Learning Framework for Off-Policy Evaluation in Two-sided Markets. (arXiv:2202.10574v2 [stat.ML] UPDATED)
    The two-sided markets such as ride-sharing companies often involve a group of subjects who are making sequential decisions across time and/or location. With the rapid development of smart phones and internet of things, they have substantially transformed the transportation landscape of human beings. In this paper we consider large-scale fleet management in ride-sharing companies that involve multiple units in different areas receiving sequences of products (or treatments) over time. Major technical challenges, such as policy evaluation, arise in those studies because (i) spatial and temporal proximities induce interference between locations and times; and (ii) the large number of locations results in the curse of dimensionality. To address both challenges simultaneously, we introduce a multi-agent reinforcement learning (MARL) framework for carrying policy evaluation in these studies. We propose novel estimators for mean outcomes under different products that are consistent despite the high-dimensionality of state-action space. The proposed estimator works favorably in simulation experiments. We further illustrate our method using a real dataset obtained from a two-sided marketplace company to evaluate the effects of applying different subsidizing policies. A Python implementation of our proposed method is available at https://github.com/RunzheStat/CausalMARL.  ( 2 min )
    Distributed Bootstrap for Simultaneous Inference Under High Dimensionality. (arXiv:2102.10080v2 [stat.ME] UPDATED)
    We propose a distributed bootstrap method for simultaneous inference on high-dimensional massive data that are stored and processed with many machines. The method produces an $\ell_\infty$-norm confidence region based on a communication-efficient de-biased lasso, and we propose an efficient cross-validation approach to tune the method at every iteration. We theoretically prove a lower bound on the number of communication rounds $\tau_{\min}$ that warrants the statistical accuracy and efficiency. Furthermore, $\tau_{\min}$ only increases logarithmically with the number of workers and the intrinsic dimensionality, while nearly invariant to the nominal dimensionality. We test our theory by extensive simulation studies, and a variable screening task on a semi-synthetic dataset based on the US Airline On-Time Performance dataset. The code to reproduce the numerical results is available at GitHub: https://github.com/skchao74/Distributed-bootstrap.  ( 2 min )
    Scaling ResNets in the Large-depth Regime. (arXiv:2206.06929v1 [cs.LG])
    Deep ResNets are recognized for achieving state-of-the-art results in complex machine learning tasks. However, the remarkable performance of these architectures relies on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients, particularly as the depth $L$ increases. No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor $\alpha_L$. We show in a probabilistic setting that with standard i.i.d. initializations, the only non-trivial dynamics is for $\alpha_L = 1/\sqrt{L}$ (other choices lead either to explosion or to identity mapping). This scaling factor corresponds in the continuous-time limit to a neural stochastic differential equation, contrarily to a widespread interpretation that deep ResNets are discretizations of neural ordinary differential equations. By contrast, in the latter regime, stability is obtained with specific correlated initializations and $\alpha_L = 1/L$. Our analysis suggests a strong interplay between scaling and regularity of the weights as a function of the layer index. Finally, in a series of experiments, we exhibit a continuous range of regimes driven by these two parameters, which jointly impact performance before and after training.  ( 2 min )
    Conformal Off-Policy Prediction. (arXiv:2206.06711v1 [stat.ML])
    Off-policy evaluation is critical in a number of applications where new policies need to be evaluated offline before online deployment. Most existing methods focus on the expected return, define the target parameter through averaging and provide a point estimator only. In this paper, we develop a novel procedure to produce reliable interval estimators for a target policy's return starting from any initial state. Our proposal accounts for the variability of the return around its expectation, focuses on the individual effect and offers valid uncertainty quantification. Our main idea lies in designing a pseudo policy that generates subsamples as if they were sampled from the target policy so that existing conformal prediction algorithms are applicable to prediction interval construction. Our methods are justified by theories, synthetic data and real data from short-video platforms.  ( 2 min )
    Agglomerative Hierarchical Clustering for Selecting Valid Instrumental Variables. (arXiv:2101.05774v3 [stat.ME] UPDATED)
    We propose a procedure, which combines hierarchical clustering with a test of overidentifying restrictions for selecting valid instrumental variables (IV) from a large set of IVs. Some of these may be invalid in that they fail the exclusion restriction. We show that if the largest group of IVs is valid, our method achieves oracle properties. Unlike existing techniques, our work deals with multiple endogenous regressors, weak instruments, heterogeneous effects and near validity. In simulations our procedure outperforms the Hard Thresholding and the Confidence Interval method. The method is applied to estimating the effect of immigration on wages and the return to education.  ( 2 min )
    Probabilistic Conformal Prediction Using Conditional Random Samples. (arXiv:2206.06584v1 [stat.ML])
    This paper proposes probabilistic conformal prediction (PCP), a predictive inference algorithm that estimates a target variable by a discontinuous predictive set. Given inputs, PCP construct the predictive set based on random samples from an estimated generative model. It is efficient and compatible with either explicit or implicit conditional generative models. Theoretically, we show that PCP guarantees correct marginal coverage with finite samples. Empirically, we study PCP on a variety of simulated and real datasets. Compared to existing methods for conformal inference, PCP provides sharper predictive sets.  ( 2 min )

  • Open

    SYMMETRY! | DISCO DIFFUSION TUTORIAL
    submitted by /u/Available_Tadpole829 [link] [comments]
    Banksy Inspiration 🎈Created on Pixelz.ai
    submitted by /u/mdfnb [link] [comments]
    Tribes: Human 4 - AI Generated
    submitted by /u/Babylon_6 [link] [comments]
    Looking for feedback on my path to becoming an AI research scientist
    I’m planning on doing a Computer science and Mathematics double major and afterwards pursuing a PhD in Artificial Intelligence. Is there a better way to go about this in order to optimize my ability to become an AI researcher? The type of feedback I would be looking for is the following, “No, your plan is pretty much optimized” or “Yes, instead of majoring in computer science you should major in data science”. submitted by /u/farfarawayx10 [link] [comments]  ( 1 min )
    Dall-E mini is really good at making sci-fi alien imagery, anyone has suggestion for prompts ? I ran out of imagination
    submitted by /u/zhaDeth [link] [comments]  ( 1 min )
    A few pieces from "A World Undone" collection | Ai Generated Art
    submitted by /u/VictorTuring [link] [comments]  ( 1 min )
    The Four Seasons by Vaporwave, Aetherpunk, Brandon Sanderson, and Thomas Kinkade
    submitted by /u/Kalfira [link] [comments]
    Dall E Mini plutonium dog.
    submitted by /u/Der_Ist [link] [comments]
    Does AI really need a paradigm shift?
    submitted by /u/estasfuera [link] [comments]
    Does weak-labeled data match up in accuracy to hand-labeled data? This article provides a hands-on exploration of this issue
    submitted by /u/UBIAI [link] [comments]
    We Made AI Autocomplete for Reddit
    submitted by /u/hyperwrite [link] [comments]  ( 1 min )
    The sociological and economic impact of AI on the creative industries. (advice)
    I invested a lot of money and education and years of practicing my craft to start working as a 3D / 2D animator and concept artist. Let me try to summarize. It's not just my dream, it's something that rose above personal traumas, it's part of my personal identity. Could AI ruin my industry over the next few years in animation? I'd like to at least work for around 4-5 years, or is it better to give up straight away and go back to manual work in a factory? I have art contacts and clients for my work, but I absolutely feel bad about the direction that the visual arts are taking. I am not going to "adapt" my learned manual technique to AI, I can do it myself. ​ I would appreciate feedback from professionals on whether I should immediately abandon my career path and re-brand myself. submitted by /u/penguinsharon [link] [comments]  ( 3 min )
    Fast question
    Is there any AI that cuts full vidoe on fragments, then syncs fragments with custom music file? Just like ravedj, but in result it generates a video file with random clips, they suited to music tempo and those music plays as well. submitted by /u/doppionaprikole [link] [comments]
    Google Suspends Engineer Who Claims the Company's Experimental AI Has Become Sentient
    submitted by /u/estasfuera [link] [comments]  ( 2 min )
    What Is ML Bias and Where Can We See It?
    Deepening the way machine learning systems are applied, machine learning biases can lead to illegal actions, reduced revenue or sales, and potentially poor customer service. This article explains what ML biases are and how we can spot them better at the source to ultimately mitigate their negative effect https://www.toolbox.com/tech/artificial-intelligence/guest-article/what-is-ml-bias-and-where-can-we-see-it/ submitted by /u/lklimusheuskaja [link] [comments]
    GOTHIC ESCAPADE | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]
    Last Week in AI: Clearview AI in Ukraine, AI to translate Hieroglyphics, AI for automatically dubbing videos, and more!
    submitted by /u/regalalgorithm [link] [comments]
  • Open

    Easily customize your notifications while using Amazon Lookout for Metrics
    We are excited to announce that you can now add filters to alerts and also edit existing alerts while using Amazon Lookout for Metrics. With this launch, you can add filters to your alerts configuration to only get notifications for anomalies that matter the most to you. You can also modify existing alerts as per […]  ( 7 min )
    Use a pre-signed URL to provide your business analysts with secure access to Amazon SageMaker Canvas
    Agility and security have historically been two aspects of IT of paramount importance for any company. With the simplification of access to advanced IT technologies thanks to low-code and no-code (LCNC) tools, an even bigger number of people must be enabled to access resources, without impacting security. For many companies, the solution has been to […]  ( 6 min )
    Enable business analysts to access Amazon SageMaker Canvas without using the AWS Management Console with AWS SSO
    IT has evolved in recent years: thanks to low-code and no-code (LCNC) technologies, an increasing number of people with varying backgrounds require access to tools and platforms that were previously a prerogative to more tech-savvy individuals in the company, such as engineers or developers. Out of those LCNC technologies, we have recently announced Amazon SageMaker […]  ( 7 min )
  • Open

    A Breakthrough Preview: JIDU Auto Debuts Intelligent Robo-01 Concept Vehicle, Powered by NVIDIA DRIVE Orin
    JIDU Auto sees a brilliant future ahead for intelligent electric vehicles. The EV startup, backed by tech titan Baidu, took the wraps off the Robo-01 concept vehicle last week during its virtual ROBODAY event. The robot-inspired, software-defined vehicle features cutting-edge AI capabilities powered by the high-performance NVIDIA DRIVE Orin compute platform. The sleek compact SUV Read article > The post A Breakthrough Preview: JIDU Auto Debuts Intelligent Robo-01 Concept Vehicle, Powered by NVIDIA DRIVE Orin appeared first on NVIDIA Blog.  ( 2 min )
    The Data Center’s Traffic Cop: AI Clears Digital Gridlock
    Gal Dalal wants to ease the commute for those who work from home — or the office. The senior research scientist at NVIDIA, who is part of a 10-person lab in Israel, is using AI to reduce congestion on computer networks. For laptop jockeys, a spinning circle of death — or worse, a frozen cursor Read article > The post The Data Center’s Traffic Cop: AI Clears Digital Gridlock appeared first on NVIDIA Blog.  ( 4 min )
    3D Environment Artist Jacinta Vu Sets the Scene ‘In the NVIDIA Studio’
    3D environment artist Jacinta Vu joins us In the NVIDIA Studio this week, showcasing her video game inspired scene Royal Library and 3D content creation workflow. Based in Cincinnati, Vu specializes in transforming 2D concept art into 3D models and scenes, a critical contribution she made to The Dragon Prince from Wonderstorm Games. The post 3D Environment Artist Jacinta Vu Sets the Scene ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.  ( 4 min )
  • Open

    Has anybody implemented mixreg or mixup for Reinforcement Learning?
    Hi everyone, I've read through these two papers: (original about "mixup") https://arxiv.org/pdf/1710.09412.pdf (variant for RL, "mixreg") https://arxiv.org/pdf/2010.10814.pdf They are about a rather interesting approach to improving model generalization. Here's the thing, though - I can easily see how to use this for supervised learning, as there is always a "reward"/prediction etc. on each "observation"/row-of-data . However, even though the second paper (mixreg) talks about applying this to RL specifically, I don't understand how you can manage this. Two problems come up in my mind: How would you preserve the Markov property if you're mixing observations/rewards that aren't necessarily in any way sequential? How would you handle this if rewards are sparse? If you don't have a reward on every single step, it seems very difficult to apply this concept. Have any of you tried either of these approaches for RL? Any experiences or suggestions you could share? It seems very interesting but I just can't conceptually understand how it could work for RL. submitted by /u/VladimirB-98 [link] [comments]  ( 1 min )
    "Large-Scale Retrieval for Reinforcement Learning", Humphreys et al 2022 {DM} (9x9 Go MuZero w/SCaNN lookups of 50m AlphaZero expert games as side data while estimating board value)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    RL environment in MATLAB
    I'm working towards my PhD in gas turbine diagnostics. I have a gas turbine simulation model built external to MATLAB (built in Fortran). I run these simulations through an .exe and post process in MATLAB. I'd like to explore optimisation of engine thrust and fuel consumption in this gas turbine model using reinforcement learning. I'm watching some videos by MATLAB and unable to understand how the environment creation works. Would I be able to declare or convert the MATLAB file into an environment? Or would I need to model the system dynamics in MATLAB/simulink and not rely on the Fortran simulation tool? Thanks for your time :) submitted by /u/Green_Thumbs3780 [link] [comments]  ( 2 min )
    Unsupervised RL: ICM and the nosiy TV problem
    Hi! I have a question abou thte intrinsic curiosity module presented in (https://proceedings.mlr.press/v70/pathak17a.html). ICM has an inverse dynamics model which learns how the agents actions influence the observed states (i.e. predicts a from s and s'), with the objectiveto learn representations for states conditioned to only how the agents actions condition it, and not external environmental forces that are not under the agents control. Then, in URLB: Unsupervised Reinforcement Learning Benchmark they mention literally: An issue with Curiosity is that it is susceptible to the noisy TV problem wherein stochastic elements of the environment will always cause high prediction error while not being informative for exploration I am thinking, wouldn't the inverse dynamics model in ICM allow the agents to overcome the noisy TV problem? Since the variability in the agent observations is clearly due to environment changes, and not consequence of the agents actions. Thank you! submitted by /u/xWh0am1 [link] [comments]  ( 1 min )
    Publish Gym Environment as PIP Package?
    After searching the internet for quite a while I haven't yet come across a method to publish my custom Gym environment as a (PIP) package. In my mind this would make my research a lot easier. I could iteratively improve my simulation and publish working releases to the package manager. Me (or my coworkers) would be able to just install the environment in a Notebook and focus on model training. The mono-repository is a big hassle right now as different teams work on the complex simulation and the actual RL models. Did I make a logic error or why is this not done yet?Thanks for your input! EDIT: solution in comments! submitted by /u/ArEsiiX [link] [comments]  ( 1 min )
  • Open

    Blockchain Technology’s World: A Wave of Technological Progress
    The blockchain is undoubtedly the gifted invention that has sprung into something greater, created by Satoshi Nakamoto. Over the last few months, the sudden rise of this ingenious technology has been so impressive and the initial phase has already shown a great ability to rule the domain of the marketing technology landscape. Let’s take a… Read More »Blockchain Technology’s World: A Wave of Technological Progress The post Blockchain Technology’s World: A Wave of Technological Progress appeared first on Data Science Central.  ( 3 min )
    How Customized Packaging Can Be Useful For Business Growth
    Manufacturing is a frenetic activity in and of itself. So, donut makers don’t have time to go for its packaging making; thus, they need catchy donut boxes to do the job. In this day of intense competition, achieving all of your donut company branding requirements is challenging. The manufacturer’s primary objective while creating a product… Read More »How Customized Packaging Can Be Useful For Business Growth The post How Customized Packaging Can Be Useful For Business Growth appeared first on Data Science Central.  ( 4 min )
    IoT Device Management- Unlocking the Future with Advanced Connected Devices
    The Internet of Things (IoT) is all about devices communicating with one another and gathering massive amounts of data to serve larger man-made aims and targets without the requirement for direct human engagement. The procedures involved in the delivery and verification, setting, retaining, tracking, and diagnosing of connected devices running as part of an IoT… Read More »IoT Device Management- Unlocking the Future with Advanced Connected Devices The post IoT Device Management- Unlocking the Future with Advanced Connected Devices appeared first on Data Science Central.  ( 3 min )
    Winning Your Business, and Your Customers, with a Privacy-Led Approach
    More than ever, earning and maintaining trust with consumers has become a mandatory part of business.  To do this, companies must become obsessed with trust and privacy.  Failure to do this will land your business in the news across top-tier media outlets– and for the wrong reasons! In today’s extreme information age, personal data seems… Read More »Winning Your Business, and Your Customers, with a Privacy-Led Approach The post Winning Your Business, and Your Customers, with a Privacy-Led Approach appeared first on Data Science Central.  ( 4 min )
  • Open

    [R] Text to 3D characters + expression editing + pose generation
    Based on similar works CLIPActor and AvatarCLIP the codebase implements similar pipeline using mesh and differentiable rasterization to provide a speedup allowing for ~10 min character generation on a weak Google Colab GPU https://twitter.com/multimodalart/status/1536608371570245632?s=20&t=Av8hJr43cvCF_HpJIz8J8g Link to code: Github submitted by /u/InfamousPancakes [link] [comments]  ( 1 min )
    [D] How to handle macro factors in forecasting with ML models?
    For a last mile logistics company having accurate forecasts is essential to managing supply and demand and ensuring a positive customer experience, but it was challenging to factor in hard to measure macroeconomic effects. My team at DoorDash was able to solve this problem by using causal inference and I have put together this blog post with 2 case studies. One case study is about measuring how IRS refunds affect order volumes and the other case study is about measuring the impact of daylight savings on different regions' demand. Check out the article to get the details and let me know what you think about my method and methodologies. submitted by /u/Electronic-Field4636 [link] [comments]  ( 1 min )
    [P] Extension for VS Code to track ML experiments
    Hi everyone, we've built an VScode extension to track ML experiments (like Tensorboard or MLFlow does) and manage datasets. If you use VScode - install it from here: https://marketplace.visualstudio.com/items?itemName=Iterative.dvc https://reddit.com/link/vca6sg/video/su354niipm591/player It used Data Version Control (DVC) under the hood (we are DVC team) and gives you: Experiment bookkeeping (an alternative to Tensorboard or MLFlow) that automatically saves metrics, graphs and hyperparameters. You suppose to instrument you code with DVCLive (https://github.com/iterative/dvclive) Experiment reproducibility which allows you to pick any past experiment. It's possible with DVC & Git - but you just click a button un UI. Data management allows you to manage datasets, files, and models with data living in your favorite cloud storage: S3, Azure Blob, GCS, NFS, etc. Please enjoy experiment tracking UI right in your local machine experience with dark mode VScode 😀 We'd love to hear your feedback! submitted by /u/dmpetrov [link] [comments]  ( 2 min )
    [R] Reconstructing the cascade of language processing in the brain using the internal computations of a transformer-based language model
    Link to paper: https://www.biorxiv.org/content/10.1101/2022.06.08.495348v1 Tweet thread summarizing paper: https://twitter.com/samnastase/status/1536463454051217408 Abstract: Piecing together the meaning of a narrative requires understanding not only the individual words but also the intricate relationships between them. How does the brain construct this kind of rich, contextual meaning from natural language? Recently, a new class of artificial neural networks—based on the Transformer architecture—has revolutionized the field of language modeling. Transformers integrate information across words via multiple layers of structured circuit computations, forming increasingly contextualized representations of linguistic content. In this paper, we deconstruct these circuit computations and analyze the associated "transformations" (alongside the more commonly studied "embeddings") at each layer to provide a fine-grained window onto linguistic computations in the human brain. Using functional MRI data acquired while participants listened to naturalistic spoken stories, we find that these transformations capture a hierarchy of linguistic computations across cortex, with transformations at later layers in the model mapping onto higher-level language areas in the brain. We then decompose these transformations into individual, functionally-specialized "attention heads" and demonstrate that the emergent syntactic computations performed by individual heads correlate with predictions of brain activity in specific cortical regions. These heads fall along gradients corresponding to different layers, contextual distances, and syntactic dependencies in a low-dimensional cortical space. Our findings provide a new basis for using the internal structure of large language models to better capture the cascade of cortical computations that support natural language comprehension. submitted by /u/papajan18 [link] [comments]  ( 2 min )
    [Discussion] Is data cleaning one of your pain points?
    We just open-sourced the alpha version of our data cleaning tool: https://github.com/mage-ai/mage-ai Looking for beta testers who would be willing to test and provide feedback! Please send me any questions/feedback or reply here. Demo video: https://youtu.be/cRib1zOaqWs Thanks for the consideration! submitted by /u/ollie_wollie_rocks [link] [comments]  ( 1 min )
    [R] GraphGPS: Recipe for a General, Powerful, Scalable Graph Transformer
    Hi all, Presenting new research and framework on Graph Transformers: "Recipe for a General, Powerful, Scalable Graph Transformer" Ladislav Rampášek, Mikhail Galkin, Vijay Prakash Dwivedi, Anh Tuan Luu, Guy Wolf and Dominique Beaini Blog: https://mgalkin.medium.com/graphgps-navigating-graph-transformers-c2cc223a051c Paper: https://arxiv.org/pdf/2205.12454.pdf Code: https://github.com/rampasek/GraphGPS Summary thread (originally on Twitter by Ladislav): GraphGPS: with a few simple tricks we managed to scale Graph Transformers to much larger graphs and get SOTA in competitive benchmarks, e.g. 0.07 MAE on ZINC. Message passing GNNs, fully-connected Graph Transformers, and positional encodings. Image by Authors Positional and structural encodings are necessary for graph Transf…  ( 2 min )
    [P] Pool Resources - Train multiple pytorch neural networks on multiple devices in parallel
    Hi, I made a small library that tries to generalize Pool(n_cores).map(seq, fn) in python multiprocessing stdlib. The main idea is to have a generalization Pool(n_resources).map(seq, fn) where n_resources can be any sort of resource (i.e. torch.device) and seq can be any sort of sequence (i.e. nn.Modules). https://gitlab.com/mihaicristianpirvu/pool-resources Here's a small example to train n > m mnist networks on m devices python main_mnist.py Currently, it only supports torch devices (via pool_resources.TorchDevice(x: tr.device)), however I plan to expand it to cores (start new processes) if anything else comes to mind (for example, how would i put two different keras networks on two gpus in the same process?) submitted by /u/nucLeaRStarcraft [link] [comments]  ( 1 min )
    [D] Yet another case of plagiarism in ICCV. The ICCV 2021 paper "Learnable Boundary Guided Adversarial Training"(arxiv 2011.11164) with the BMVC 2020 paper "Adversarial Concurrent Training: Optimizing Robustness and Accuracy Trade-off of Deep Neural Networks" (arxiv 2008.07015)
    Hi everyone, I recently went through a post on social media by a university senior of mine asking people to bring to light a case of strong plagiarism from a paper published by his group [link] and this ICCV 2021 paper, which is further corroborated by this post written by a member of his group and the co-author of the ACT paper. There is the possibility that the authors of the former weren't aware of said publication but denial of the similarity of the two papers and still claiming to have novelty in their CVPR 2021 rebuttal (ultimately rejected.. serves them right!), and publishing the same paper without any changes at another top venue is quite toxic indeed. submitted by /u/VoyagerExpress [link] [comments]  ( 3 min )
    [D] Downloading ActivityNet
    ActivityNet is a commonly used benchmark in video action recognition. However, it is a nightmare to download. The official website provides a 7-day link to a google-drive folder, but the download quota for the user is often exceeded, and if it doesn't ,the download fails from my experience. The alternative baidu download is also no picnic. I've been trying to get it for weeks, unsucessfully. Does anyone have a copy they can put on AcademicTorrents or alternative location, of course including the proper readme with license, etc? This will make things so much easier for anyone trying to get started with the dataset. submitted by /u/AmirRosenfeld [link] [comments]  ( 1 min )
    [R] Wav2Vec with fMRI: Towards realistic model of speech processing in the brain with self-supervised learning
    submitted by /u/PK_thundr [link] [comments]  ( 1 min )
  • Open

    The Current State of AI Generated Art
    submitted by /u/cloud_weather [link] [comments]
  • Open

    Scanned Objects by Google Research: A Dataset of 3D-Scanned Common Household Items
    Posted by Laura Downs and Anthony Francis, Software Engineers, Robotics at Google Many recent advances in computer vision and robotics rely on deep learning, but training deep learning models requires a wide variety of data to generalize to new scenarios. Historically, deep learning for computer vision has relied on datasets with millions of items that were gathered by web scraping, examples of which include ImageNet, Open Images, YouTube-8M, and COCO. However, the process of creating these datasets can be labor-intensive, and can still exhibit labeling errors that can distort the perception of progress. Furthermore, this strategy does not readily generalize to arbitrary three-dimensional shapes or real-world robotic data. Real-world robotic data collection is very useful, but diffi…  ( 8 min )
  • Open

    Using Normalization Layers to Improve Deep Learning Models
    You’ve probably been told to standardize or normalize inputs to your model to improve performance. But what is normalization and how can we implement it easily in our deep learning models to improve performance? Normalizing our inputs aims to create a set of features that are on the same scale as each other, which we’ll […] The post Using Normalization Layers to Improve Deep Learning Models appeared first on Machine Learning Mastery.  ( 15 min )
  • Open

    Is There Any Connection Between Software Testing And Augmented Reality?
    The connection between augmented reality and test automation might not seem to have much in common at first glance, but the fact is that…  ( 5 min )

  • Open

    Tsinghua University AI Researchers Propose 9B-Parameter Transformer ‘CogVideo’, Trained By Inheriting A Pretrained text-to-image model, CogView2
    ⚡️ The largest open-source pretrained transformer for text-to-video generation in the general domain ⚡️ The first attempt to efficiently leverage the pretrained text-to-image generative model to the text-to-video generation model without hurting its image generation capacity ⚡️ CogVideo can generate high-resolution (480×480) videos Continue reading the full summary | Check out the paper, and github ​ https://reddit.com/link/vbp12x/video/3ozqpjwyyg591/player submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    DISCO DIFFUSION SETTING AND PROMPTS TUTORIAL: MY SETUP
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 1 min )
    Could you convince an AI to become religious?
    I'm not religious myself but it is a weird question I'm curious about. I don't think AI could ever invent a religion itself because there is no reason for it to do so. But if you tried really hard, could you get an AI to believe in a god? submitted by /u/SprtelWood [link] [comments]  ( 1 min )
    What is DALL·E mini? You've seen the results, and even played with the model, but do you know how it works?
    submitted by /u/OnlyProggingForFun [link] [comments]
    FIBONACCI SEQUENCE OF WONDERS | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]
    MIT Engineers build LEGO-like artificial intelligence chip
    submitted by /u/qptbook [link] [comments]
    Researchers Introduce VideoINR: A Model For Learning Video Implicit Neural Representation for Continuous Space-Time Super-Resolution
    ⚡️ It can represent videos in arbitrary spatial and temporal resolution, which brings natural advantages for solving Space-Time Video SuperResolution (STVSR) tasks. ⚡️ The researchers used their experiments’ datasets from Vid4, GoPro, and Adobe240. Their findings reveal that, in addition to extrapolating out-of-distribution frame rates and spatial resolutions, VideoINR can represent video in arbitrary space and time resolutions on the scales within the training distributions. ⚡️ On in-distribution spatial and temporal scales, VideoINR performs competitively with state-of-the-art STVSR approaches and greatly outperforms other methods on out-of-distribution scales. Continue reading | Check out the paper, github, and project https://reddit.com/link/vbgnoq/video/0lzd2xnv4f591/player submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    AI Dream 57 - The mind-blowing Space Voyage by AI
    submitted by /u/LordPewPew777 [link] [comments]
    Tribes: Human 3 - Google Colab
    submitted by /u/Babylon_6 [link] [comments]
    Going along the trend of abusing dalle mini servers, was not expecting tacticool-ified chain mail.
    submitted by /u/KadinaruDess [link] [comments]
    MAGICAL SPACE ESCAPADE | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]
    Video made with DD and other software, first release.
    I made this 4 days ago or so with Disco Diffusion but I did not create the video or the frames in Disco Diffusion just a couple of image renders. I am working on another one that will be far better , making this one helped me to see some problem areas and have already fixed most of them. Also the singer is not a human :) I used Synth V for the lead vocal, the background vocal is just a choir sample that I changed the notes on. (going to make tutorials for Synth V at some point too.) I am currently making a tutorial on this to help people avoid some of my mistakes. Hope to have it done within the week. ​ The Singularity A short AI video - YouTube submitted by /u/prfitofthesngularity [link] [comments]  ( 1 min )
    Dailys
    My dailys are from starting images because I made a tutorial this week about starting images. Higher than normal res today than I usually post! ​ Starting Images tutorial ​ https://www.youtube.com/watch?v=NPKM0eUpwC4 ​ https://preview.redd.it/gcmuuahmwc591.png?width=1280&format=png&auto=webp&s=bf3be82cc08bad8f85b649cf24fdefe87737f0af https://preview.redd.it/9u5fgqgmwc591.png?width=1280&format=png&auto=webp&s=252382182e1daf5a832fa6c1a08fd9dfda912068 https://preview.redd.it/1vpbbrgmwc591.png?width=1280&format=png&auto=webp&s=a857305a6340af91d3e99f3748a402044b2813b6 https://preview.redd.it/sl56iigmwc591.png?width=1280&format=png&auto=webp&s=604f9b9fa95fec23be04ead56e78f28272afb246 https://preview.redd.it/h6dayjgmwc591.png?width=768&format=png&auto=webp&s=6ab267cda779daec757a58ef4581354390a042b9 https://preview.redd.it/gctgqkgmwc591.png?width=768&format=png&auto=webp&s=30b4f17ece8dee743642a2c2b5c1df1fb2246189 https://preview.redd.it/498pxsgmwc591.png?width=768&format=png&auto=webp&s=f1cab2c0030cb36204fb0132f4d05786ee1cd4cc https://preview.redd.it/0re3cmgmwc591.png?width=768&format=png&auto=webp&s=4d58971a12153a7db36f84c90863f3835de79958 https://www.youtube.com/watch?v=NPKM0eUpwC4 submitted by /u/prfitofthesngularity [link] [comments]
    Hello People, I'm Shiva Gopi from India. I'm currently working as Servicenow developer holding 5 years of experience. Now I'm interested to switch to AI and ML domain, is this a correct decision for me to switch after holding good experience on servicenow platform. Please suggest.
    submitted by /u/Admirable-Oven8858 [link] [comments]  ( 1 min )
    Is there an AI writer that ISN'T a liar?
    Long story short I'm looking for an AI writing tool that can find and reiterate facts. For instance, the history of the Superbowl. Ideally, I could even point it to relevant links to source from. I've used tools like Jarvis and was at first blown away. Until I fact-checked it. It was great for creative copy, but just made up random dates and events when answering basic history questions about a topic. Are there any tools that work well in this area, or am I asking too much? Figured the gurus in r/artificial would know. Thanks! submitted by /u/StealingHistory [link] [comments]  ( 2 min )
    DOLL-IES designed by DALL-E. This is a project I am working on currently. I am exploring using ai generator, DALL-E, as a medium. I type in descriptions of crochet dolls and I create what the ai generates from those descriptions.
    submitted by /u/Secret-Detective2953 [link] [comments]  ( 1 min )
    A brain-inspired intelligent agent that learns to control an autonomous vehicle directly from its camera inputs (end-to-end learning to control)
    submitted by /u/OnlyProggingForFun [link] [comments]
  • Open

    [D] Publishing a huge amount of paper is a symptom of the publish-or-perish disease. Stop doing it.
    I feel like the incentives in academia has gotten to a really perverse stage and having a massive trove of ML papers being published (especially within short period of one another) is just one of its symptoms. Here are some of my takes on the "large amount of paper" phenomena. The motivation for these paper are extremely weak and often completely detached from any real problems. They are more math than ML. I cannot see why the author would be even interested in these kinds of problems. There doesn't seem to be any longer term goal that the paper is moving towards. Often times the lack of novelty is dressed up in huge amount of calculations. If you are publishing a huge amount of papers, is it possible that your problem is actually quite easy or your results are irrelevant? For the academic supervisors: are you possibly exploiting and overworking your graduate students from poorer countries to boost your citation counts? Thoughts? submitted by /u/RandomProjections [link] [comments]  ( 2 min )
    [D] Epochs vs. Learning rate on small data
    During the training of a policy gradient reinforcement learner, I want to train a binary classifier, that predicts the success of the agent in the episode based on a goal. Per training round of the RL algorithm I only get around 4-8 episode completions. Would you rather increase the number of epochs on the handful of data points or increase the learning rate? What would be your intuition here? submitted by /u/NiconiusX [link] [comments]  ( 1 min )
    [D] AMA: I left Google AI after 3 years.
    During the 3 years, I developed love-hate relationship of the place. Some of my coworkers and I left eventually for more applied ML job, and all of us felt way happier so far. EDIT1 (6/13/2022, 4pm): I need to go to Cupertino now. I will keep replying this evening or tomorrow. submitted by /u/scan33scan33 [link] [comments]  ( 2 min )
    [R] Reconnaissance Blind Chess - Join the NeurIPS Competition!
    Create a bot for the NeurIPS 2022 competition in Reconnaissance Blind Chess! Reconnaissance Blind Chess is a chess variant designed for new research in artificial intelligence. RBC includes imperfect information, long-term strategy, explicit observations, and almost no common knowledge. These features appear in real-world scenarios, and challenge even state of the art algorithms including those used to create super-human bots in chess, Go, and poker, for example. Each player of RBC controls traditional chess pieces, but cannot directly see the locations of her opponent's pieces. Rather, she learns partial information each turn by privately sensing a 3x3 area of the board. RBC's foundation in traditional chess makes it familiar and entertaining to human players, too! There is no cost to enter this tournament. Winners will receive a small monetary prize and authors of the best AIs will be invited talk about their bots at NeurIPS, the world's largest AI conference. Learn more, play a game of RBC yourself, and join our research community at https://rbc.jhuapl.edu ! ​ https://preview.redd.it/yr7k6gz66f591.png?width=150&format=png&auto=webp&s=81d7cababf139f4fa0350c206a9024a45017bfd4 Organized by: Johns Hopkins University Applied Physics Laboratory with Ashley J. Llorens (Microsoft Research) Todd W. Neller (Gettysburg College) Raman Arora (Johns Hopkins University) Bo Li (University of Illinois) Mykel J. Kochenderfer (Stanford University) submitted by /u/rwgardner [link] [comments]  ( 1 min )
    [D] What do you think about these experiments on the HUGE effect of learning rate on overfitting?
    I was playing with the CIFAR10 dataset based on the baseline code of https://github.com/kuangliu/pytorch-cifar, but I was surprised to see a strangely large decrease in the validation performance from using a smaller learning rate. All the experiments below use ResNet18 model with CIFAR10 head SGD with momentum=0.9 4-pixel random translation/horizontal flip as data augmentation training for 200 epochs with cosine annealing to 0. More detail can be found in https://github.com/kuangliu/pytorch-cifar or the actual personal repo used for running experiments. The only difference with the original code is that 1) drop-out of p=0.2 is added and 2) batch size and learning rate. Note that the original code uses batch_size=128 and lr=0.1 by default and achieves 93.02% accuracy. ​ In t…  ( 6 min )
    [P] mlfeed.tech: I built a website to filter Twitter for quality ML content
    A few years ago, I started using Twitter to follow some ML people to try and keep up with the latest cool things that were going on in the field. I realized two things: There is a lot of great content posted regularly. But man is it surrounded by a lot of not so useful stuff (politics, ads, hot takes, etc) that I didn’t want to sift through to get to the useful content. So with that I decided to build a classifier to filter out the most “relevant” tweets: ones that showcase papers, blogs about new methods, YouTube tutorials, and Github repos. Using that classifier, I had a bot retweet these (@dave_co_dev) and built a web UI (mlfeed.tech) to showcase them. Fast forward to a few days ago, I released the latest iteration of mlfeed with a brand new UI! It has filters to make it eas…  ( 2 min )
    [D] Does anyone have a copy of the FFHQ 1024 scale images (90GB) ? and or a copy of the FFHQ Wild images (900GB) ?
    As the title suggests, I am putting out a call for anyone who has a copy of the FFHQ dataset who would be able to allow me to download it from them so it can be hosted properly and made truly public. The FFHQ dataset https://github.com/NVlabs/ffhq-dataset is a high quality, high resolution, and extremely well curated dataset that is used in many recent SOTA GAN papers and also has applications in many other areas. FFHQ is 70k aligned images of human faces organized into a 128x128 thumbnails dataset (3GB), a 1024x1024 high res dataset (90GB), and a raw unaligned wilds dataset (900GB). Do you have the 1024 or Wilds dataset in an s3 bucket? On Google Cloud Storage buckets? Exposed on a Globus endpoint? Kicking around on a lowly SFTP server? Can you safely expose a share for me to downlo…  ( 3 min )
    [D] Running Model Training Jobs on Kubernetes
    Running training jobs on Kubernetes is it recommended? Why? I don't see any advantage of running training jobs on Kubernetes as long as the autoscaling works well for distributed trainings. If we don't do distributed training, is it of any use? As we can only mention resource as 1 GPU. Am I thinking this wrong? submitted by /u/scb_11 [link] [comments]  ( 1 min )
    [R] GenDR: A Generalized Differentiable Renderer w/ animated video (CVPR 2022)
    I have made an animated video (youtu.be/p-ZCcUWzriE) for our CVPR 2022 paper (https://arxiv.org/pdf/2204.13845.pdf). Check it out if you are interested. I have made the video using 3b1b's manim library (https://github.com/ManimCommunity/manim). Feedback is always very welcome! submitted by /u/Human-Career-9962 [link] [comments]  ( 1 min )
    [Discussion] what libraries are available to generate augmented (synthetic) data at a vector level?
    For instance, I have an embedding and I wish to generate a sampling of vectors similar to the query embeddings that can then be used for training a model. submitted by /u/bluzkluz [link] [comments]  ( 1 min )
    [D] How to factor covid into model
    hi guys, I am currently working on a time series model and would like to include the covid influence as it has an impact on the problem I am trying to solve. My simple approach would be to add a binary categorical feature (covid yes/no). ​ I would like to try other approaches, however, unfortunately the research proves to be inconclusive as I only get results regarding covid models. So here are my specific questions: ​ - Are there other, better ways to do this? - Is there already a state of the art procedure on how to include covid? - Do you know any literature about this? ​ Thanks in advance. submitted by /u/kimdotcoin2222 [link] [comments]  ( 1 min )
    [D] Image domain translation as pre-processing for classification?
    Hi y'all, I want to ask for literature on the topic of using domain translation techniques as pre-processing for images to be classified by a model trained on a different dataset. AFAIK the typical image-to-image translation techniques (that I know) like CycleGAN are not targeted to the classification problem, image-to-image translation is the goal on itself. Thanks in advance! submitted by /u/juanigp [link] [comments]
    [D] Train-Valid-Test split and featurizer design
    I am working on a classification problem and encountered a problem within train-test split and feature Suppose that I designed a featurizer function f (consider this as a fingerprint for representation learning). So f(training data) is used to train classifier, and f(test data) is fed to classifier. If I use 'unlabeled' data from test data to design f (such as, variance-threshold dropping of fingerprints), is this data leak? There seems to be two arguments Any prior knowledge on test data is data leakage and must be avoided at all cost. Train-test split is to simulate a situation where unknown data is given to the model. Here, it might make sense as the process is essentially (see the given unknown data but not labels (because we wouldn't have labels) -> change featurizer -> train classifier -> use to predict new data) And my architecture is now somewhat semi-supervised learning. I initially thought that if I can make sure that no label information of test data is available, that still constitutes sound train-test split, but I would like to know what more experienced people might think. Any discussion is appreciated! submitted by /u/gratus907 [link] [comments]  ( 1 min )
    [Discussion] YOLOv5 training questions, specificaly re-training best practices
    Hi, I am currently in the process of training a Yolo5 (image ml) network. [https://github.com/ultralytics/yolov5] I had a few questions to best optimize the training for classification. Current Procedure: Manual tag a few pictures -- maybe 50 -- small sample size. (manual is time-consuming) Train a YOLO network, for about 100 epochs. Use that to re-label my dataset. And then adjust by hand. Adjustment by hand can include the following: fix-mislabels, adjust the bounding box size Adding additional labeled pictures, say expanding my dataset to 100 pictures [may or may not include original 50, see questions below] Take the new-labels and train YOLO network -- repeating Step 2 and 3 a few times. The question I have is in the Step 3/4 when I am trying to re-train my YOLO5 w…  ( 2 min )
    [D] Could it ever be possible that an AI becomes conscious?
    I know that LaMDA is definitely not sentient (lol) but it seems like we are quickly approaching the point where large llm’s can pass the Turing test. How do we know for sure that these super large language models are categorically not sentient? What if consciousness is nothing more than a continual time series llm with qualia being nothing more than a collection of parameters working together to make a decision? Feel like most researchers I talk to laugh it off but I’m confused how they are so sure submitted by /u/hahayayak1776 [link] [comments]  ( 5 min )
    [D] Deep Learning Framework for C++.
    I have been working in the field of ML/DL for almost more than 3 years now. I know Python is the goto language for ML/DL thanks to the frameworks like TensorFlow, Pytorch and most recently JAX/TRAX. And all of these are written in C++. But they don't provide a C++ API only Python, except for the PyTorch recently provided C++ API but they don't recommend it for production. Recently I have been trying to find a ML/DL library for C++ but I haven't found one. Could anyone here list some production grade C++ ML/DL libraries. I also wonder which language big companies (e.g. Tesla, Google or Facebook) use in production (not for prototyping) for ML/DL because every problem they have is a problem at scale so they would definitely not be using Python. So anyone working in big tech companies and writing code that goes into production please provide some insight what language and framework you guys use. submitted by /u/Apprehensive-Wheel18 [link] [comments]  ( 4 min )
    I Created A Transformer Based Chatbot [Project]
    You should be able to get her to read your comments by mentioning their user: u/RyuAI22, so far she doesn't post on her own, but I plan on making some pretty cool updates to her, such as a transformer based animation engine! submitted by /u/ButterCream55 [link] [comments]  ( 1 min )
    [N] Google engineer put on leave after saying AI chatbot has become sentient
    submitted by /u/radome9 [link] [comments]  ( 3 min )
    [R] A brain-inspired intelligent agent that learns to control an autonomous vehicle directly from its camera inputs (end-to-end learning to control)
    submitted by /u/OnlyProggingForFun [link] [comments]
  • Open

    Any suggestions for resource to understand RL theory from basics which explains the overwhelming notations and maths involved?
    Hi everyone, I was wondering if anyone had any suggestions for resources to get started with rl theory. I tried some of lectures and books, the math and notations are very overwhelming. Are there any resources which explain it better from basics to a more advanced level. - Thanks in advance! submitted by /u/E-Cockroach [link] [comments]  ( 1 min )
    Breaking into the field of reinforcement learning
    I am a masters in data science student with a focus on reinforcement learning and computer vision. I have been keeping myself updated with the latest reinforcement learning research by reading and implementing papers from spinning up by open ai. I would like to start working in reinforcement learning after I graduate and would love some advice on how to do so. I have seen some job openings on linked in but they require a PhD which I don’t think I can pursue at the moment. submitted by /u/Significant_Froyo_20 [link] [comments]  ( 2 min )
    Any idea about DI-star ? It's an AI model could beat top human players in StarCraft II!
    Our AI agent DI-star has been demonstrated recently. We believe DI-star is the most powerful opensorced AI model specifically developed for the real-time strategy game “StarCraft II”. Demonstrated publicly for the first time, it successfully reached parity with top professional players in multiple games, making a breakthrough in the application of AI decision-making in video games. ​ StarCraft II Zhou Hang(iAsonu), an 8-time championship of StarCraft II in China, said, “DI-star’s performance levels are comparable to professional players only after five weeks of training. Such efficient training results are the result of SenseTime’s leading strength in AI decision-making and the powerful computing support provided by its proprietary AI infrastructure SenseCore.” ​ Zhou Hang,8-time cham…  ( 2 min )
    Reward Function for Cooperative Multi-Agent RL
    I am working on a multi-agent reinforcement learning setting, where agents need to operate to maximize the throughput of a network. Before trying more complex algorithms for cooperative multi-agent settings (e.g. QMix, VDN, etc) I want to test whether the problem could be solvable by just crafting an appropriate reward function. For example, one idea is to design a reward function R for agent 0 having neighbor nodes i = 0, 1, 2, ... N that is R(0) = sum(r(i) for i = 1 to N) / N where r is a "greedy" reward function (i.e. a reward that only takes into account the performance of agent 0. This way, the reward function for each agent would push them to optimize their own performance and those of their neighbors as well. I see several issues with this, e.g. credit assignment would be hard since rewards for the agent and its neighbors would be mixed, and value estimation would be really difficult due to the nonstationarity of the reward function. Do you know if there are ways to approach the problem in this way while mitigating these issues? Or do you know any work that does something similar? submitted by /u/fedetask [link] [comments]  ( 2 min )
    Who to follow on Twitter for multi agent reinforcement learning (MARL)?
    submitted by /u/yodontwannaknow [link] [comments]
    Why are agents trained in other tasks worse than scratch agents?
    Hi, I have experienced that an agent trained in some task shows bad performance than scratch for training another task. for example, in deepmind control suite hopper environment, hopper is trained for hopping and subsequently trained for flipping. then, It was worse than scratch hopper trained for flipping. What is the main reason for this phenomenon? I assume bad initialization of q value function by biased and over estimation. Could you recommend some papers on this or explain why this is happening? submitted by /u/Spiritual_Fig3632 [link] [comments]  ( 1 min )
  • Open

    AI-Written Critiques Help Humans Notice Flaws
    Showing model-generated critical comments to humans helps them find flaws in summaries.  ( 11 min )
  • Open

    Greek letter paradox
    The Greek letter paradox is seeing the same symbol in two contexts and assuming it means the same thing. Maybe it’s used in many contexts, but I first heard it in the context of comparing statistical models. I used the phrase in my previous post, looking at α exp(5t) + β t exp(5t) and α […] Greek letter paradox first appeared on John D. Cook.  ( 1 min )
    Double roots and ODEs
    This post will resolve a sort of paradox. The process of solving a difference or differential equation is different when the characteristic equation has a double root. But intuitively there shouldn’t be much difference between having a double root and having two roots very close together. I’ll first say how double roots effect finding solutions […] Double roots and ODEs first appeared on John D. Cook.  ( 4 min )
  • Open

    Create, train, and deploy a billion-parameter language model on terabytes of data with TensorFlow and Amazon SageMaker
    The increasing size of language models has been one of the biggest trends in natural language processing (NLP) in recent years. Since 2018, we’ve seen unprecedented development and deployment of ever-larger language models, including BERT and its variants, GPT-2, T-NLG, and GPT-3 (175 billion parameters). These models have pushed the boundaries of possible architectural innovations. […]  ( 12 min )
    Identify potential root cause in business-critical anomalies using Amazon Lookout for Metrics
    We are excited to launch a causal contribution analysis capability in Amazon Lookout for Metrics that helps you to understand the potential root causes for the business-critical anomalies in the data. Previously, you were only given the root causes for a single anomaly per measure. You had to analyze to determine if causal relationships existed […]  ( 7 min )
  • Open

    ICLR 2022 highlights from Microsoft Research Asia: Expanding the horizon of machine learning techniques and applications
    ICLR (International Conference on Learning Representations) is recognized as one of the top conferences in the field of deep learning. Many influential papers on artificial intelligence, statistics, and data science—as well as important application fields such as machine vision, speech recognition, and text understanding—have been published and presented at this conference. The following selection of […] The post ICLR 2022 highlights from Microsoft Research Asia: Expanding the horizon of machine learning techniques and applications appeared first on Microsoft Research.  ( 10 min )
  • Open

    Engineers build LEGO-like artificial intelligence chip
    The new design is stackable and reconfigurable, for swapping out and building on existing sensors and neural network processors.  ( 6 min )
  • Open

    Using Explainable AI in Decision-Making Applications
    There is no instruction to a decision making process. However, important decisions are usually made by analyzing tons of data to find the…  ( 8 min )
  • Open

    The Noise in Modern Data Quality
    The need for high-quality, trustworthy data in our world will never go away. With the growth in data, the need arises more than ever before. Even though we have evolved from data silos to pipelines (ELT/TL) to streaming to modern data stack/warehouse, multi-cloud, and data mesh — we are still faced with an age-old problem… Read More »The Noise in Modern Data Quality The post The Noise in Modern Data Quality appeared first on Data Science Central.  ( 4 min )
    The Noise in Modern Data Quality
    The need for high-quality, trustworthy data in our world will never go away. With the growth in data, the need arises more than ever before. Even though we have evolved from data silos to pipelines (ELT/TL) to streaming to modern data stack/warehouse, multi-cloud, and data mesh — we are still faced with an age-old problem… Read More »The Noise in Modern Data Quality The post The Noise in Modern Data Quality appeared first on Data Science Central.  ( 4 min )
    Building Value-driven Data Strategy: Use Case Approach – Part 2
    In Part 1 of the blog series on building a value-driven data strategy, I discussed the challenges associated with framing the data strategy process as a deliverable. A Data Strategy, like a Business Strategy, should ebb and flow depending upon what is “valuable” to the organization given the current business environment. Instead of thinking of… Read More »Building Value-driven Data Strategy: Use Case Approach – Part 2 The post Building Value-driven Data Strategy: Use Case Approach – Part 2 appeared first on Data Science Central.  ( 6 min )
  • Open

    Pretty new! Intents data base?
    Hello all, I was looking in reddit but i didn't found anything useful! I am pretty new in neural networks, i just programed my first telegram bot with machine learning! I made my own intents.json but i though, why I will not look for a already done list more completed and tested? After some searchs in google i didnt found anything ! :( ​ Someone know an intents data base in spanish? Or where can I look for? ​ Thank you so much submitted by /u/magicsito [link] [comments]  ( 1 min )
  • Open

    Powered Up: 5G and VR Accelerate Vehicle Battery Design
    Traveling the scenic route between Wantage, a small town in Oxfordshire, and Coventry in the U.K. meanders up steep hills, past the birthplace of Shakespeare and skirts around 19th-century English bathhouses. A project using edge computing and the world’s first 5G-enabled VR technology is enabling two engineering teams in those locales, about 70 miles apart, Read article > The post Powered Up: 5G and VR Accelerate Vehicle Battery Design appeared first on NVIDIA Blog.  ( 3 min )
  • Open

    Trace norm regularization for multi-task learning with scarce data. (arXiv:2202.06742v2 [stat.ML] UPDATED)
    Multi-task learning leverages structural similarities between multiple tasks to learn despite very few samples. Motivated by the recent success of neural networks applied to data-scarce tasks, we consider a linear low-dimensional shared representation model. Despite an extensive literature, existing theoretical results either guarantee weak estimation rates or require a large number of samples per task. This work provides the first estimation error bound for the trace norm regularized estimator when the number of samples per task is small. The advantages of trace norm regularization for learning data-scarce tasks extend to meta-learning and are confirmed empirically on synthetic datasets.
    One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones. (arXiv:2202.07028v3 [cs.AI] UPDATED)
    We study the problem of developing autonomous agents that can follow human instructions to infer and perform a sequence of actions to complete the underlying task. Significant progress has been made in recent years, especially for tasks with short horizons. However, when it comes to long-horizon tasks with extended sequences of actions, an agent can easily ignore some instructions or get stuck in the middle of the long instructions and eventually fail the task. To address this challenge, we propose a model-agnostic milestone-based task tracker (M-TRACK) to guide the agent and monitor its progress. Specifically, we propose a milestone builder that tags the instructions with navigation and interaction milestones which the agent needs to complete step by step, and a milestone checker that systemically checks the agent's progress in its current milestone and determines when to proceed to the next. On the challenging ALFRED dataset, our M-TRACK leads to a notable 33% and 52% relative improvement in unseen success rate over two competitive base models.
    A Free Lunch with Influence Functions? Improving Neural Network Estimates with Concepts from Semiparametric Statistics. (arXiv:2202.09096v2 [cs.LG] UPDATED)
    Parameter estimation in empirical fields is usually undertaken using parametric models, and such models readily facilitate statistical inference. Unfortunately, they are unlikely to be sufficiently flexible to be able to adequately model real-world phenomena, and may yield biased estimates. Conversely, non-parametric approaches are flexible but do not readily facilitate statistical inference and may still exhibit residual bias. We explore the potential for Influence Functions (IFs) to (a) improve initial estimators without needing more data (b) increase model robustness and (c) facilitate statistical inference. We begin with a broad introduction to IFs, and propose a neural network method 'MultiNet', which seeks the diversity of an ensemble using a single architecture. We also introduce variants on the IF update step which we call 'MultiStep', and provide a comprehensive evaluation of different approaches. The improvements are found to be dataset dependent, indicating an interaction between the methods used and nature of the data generating process. Our experiments highlight the need for practitioners to check the consistency of their findings, potentially by undertaking multiple analyses with different combinations of estimators. We also show that it is possible to improve existing neural networks for `free', without needing more data, and without needing to retrain them.
    Predicting the Thermal Sunyaev-Zel'dovich Field using Modular and Equivariant Set-Based Neural Networks. (arXiv:2203.00026v2 [astro-ph.CO] UPDATED)
    Theoretical uncertainty limits our ability to extract cosmological information from baryonic fields such as the thermal Sunyaev-Zel'dovich (tSZ) effect. Being sourced by the electron pressure field, the tSZ effect depends on baryonic physics that is usually modeled by expensive hydrodynamic simulations. We train neural networks on the IllustrisTNG-300 cosmological simulation to predict the continuous electron pressure field in galaxy clusters from gravity-only simulations. Modeling clusters is challenging for neural networks as most of the gas pressure is concentrated in a handful of voxels and even the largest hydrodynamical simulations contain only a few hundred clusters that can be used for training. Instead of conventional convolutional neural net (CNN) architectures, we choose to employ a rotationally equivariant DeepSets architecture to operate directly on the set of dark matter particles. We argue that set-based architectures provide distinct advantages over CNNs. For example, we can enforce exact rotational and permutation equivariance, incorporate existing knowledge on the tSZ field, and work with sparse fields as are standard in cosmology. We compose our architecture with separate, physically meaningful modules, making it amenable to interpretation. For example, we can separately study the influence of local and cluster-scale environment, determine that cluster triaxiality has negligible impact, and train a module that corrects for mis-centering. Our model improves by 70 % on analytic profiles fit to the same simulation data. We argue that the electron pressure field, viewed as a function of a gravity-only simulation, has inherent stochasticity, and model this property through a conditional-VAE extension to the network. This modification yields further improvement by 7 %, it is limited by our small training set however. (abridged)
    Tackling covariate shift with node-based Bayesian neural networks. (arXiv:2206.02435v2 [stat.ML] UPDATED)
    Bayesian neural networks (BNNs) promise improved generalization under covariate shift by providing principled probabilistic representations of epistemic uncertainty. However, weight-based BNNs often struggle with high computational complexity of large-scale architectures and datasets. Node-based BNNs have recently been introduced as scalable alternatives, which induce epistemic uncertainty by multiplying each hidden node with latent random variables, while learning a point-estimate of the weights. In this paper, we interpret these latent noise variables as implicit representations of simple and domain-agnostic data perturbations during training, producing BNNs that perform well under covariate shift due to input corruptions. We observe that the diversity of the implicit corruptions depends on the entropy of the latent variables, and propose a straightforward approach to increase the entropy of these variables during training. We evaluate the method on out-of-distribution image classification benchmarks, and show improved uncertainty estimation of node-based BNNs under covariate shift due to input perturbations. As a side effect, the method also provides robustness against noisy training labels.
    CoCon: A Self-Supervised Approach for Controlled Text Generation. (arXiv:2006.03535v3 [cs.CL] UPDATED)
    Pretrained Transformer-based language models (LMs) display remarkable natural language generation capabilities. With their immense potential, controlling text generation of such LMs is getting attention. While there are studies that seek to control high-level attributes (such as sentiment and topic) of generated text, there is still a lack of more precise control over its content at the word- and phrase-level. Here, we propose Content-Conditioner (CoCon) to control an LM's output text with a content input, at a fine-grained level. In our self-supervised approach, the CoCon block learns to help the LM complete a partially-observed text sequence by conditioning with content inputs that are withheld from the LM. Through experiments, we show that CoCon can naturally incorporate target content into generated texts and control high-level text attributes in a zero-shot manner.
    Refined Convergence and Topology Learning for Decentralized Optimization with Heterogeneous Data. (arXiv:2204.04452v2 [cs.LG] UPDATED)
    One of the key challenges in decentralized and federated learning is to design algorithms that efficiently deal with highly heterogeneous data distributions across agents. In this paper, we revisit the analysis of Decentralized Stochastic Gradient Descent algorithm (D-SGD) under data heterogeneity. We exhibit the key role played by a new quantity, called \emph{neighborhood heterogeneity}, on the convergence rate of D-SGD. By coupling the communication topology and the heterogeneity, our analysis sheds light on the poorly understood interplay between these two concepts in decentralized learning. We then argue that neighborhood heterogeneity provides a natural criterion to learn data-dependent topologies that reduce (and can even eliminate) the otherwise detrimental effect of data heterogeneity on the convergence time of D-SGD. For the important case of classification with label skew, we formulate the problem of learning such a good topology as a tractable optimization problem that we solve with a Frank-Wolfe algorithm. As illustrated over a set of simulated and real-world experiments, our approach provides a principled way to design a sparse topology that balances the convergence speed and the per-iteration communication costs of D-SGD under data heterogeneity.
    Trainability of Dissipative Perceptron-Based Quantum Neural Networks. (arXiv:2005.12458v2 [quant-ph] UPDATED)
    Several architectures have been proposed for quantum neural networks (QNNs), with the goal of efficiently performing machine learning tasks on quantum data. Rigorous scaling results are urgently needed for specific QNN constructions to understand which, if any, will be trainable at a large scale. Here, we analyze the gradient scaling (and hence the trainability) for a recently proposed architecture that we called dissipative QNNs (DQNNs), where the input qubits of each layer are discarded at the layer's output. We find that DQNNs can exhibit barren plateaus, i.e., gradients that vanish exponentially in the number of qubits. Moreover, we provide quantitative bounds on the scaling of the gradient for DQNNs under different conditions, such as different cost functions and circuit depths, and show that trainability is not always guaranteed.
    Looper: An end-to-end ML platform for product decisions. (arXiv:2110.07554v7 [cs.LG] UPDATED)
    Modern software systems and products increasingly rely on machine learning models to make data-driven decisions based on interactions with users, infrastructure and other systems. For broader adoption, this practice must (i) accommodate product engineers without ML backgrounds, (ii) support finegrain product-metric evaluation and (iii) optimize for product goals. To address shortcomings of prior platforms, we introduce general principles for and the architecture of an ML platform, Looper, with simple APIs for decision-making and feedback collection. Looper covers the end-to-end ML lifecycle from collecting training data and model training to deployment and inference, and extends support to personalization, causal evaluation with heterogenous treatment effects, and Bayesian tuning for product goals. During the 2021 production deployment Looper simultaneously hosted 440-1,000 ML models that made 4-6 million real-time decisions per second. We sum up experiences of platform adopters and describe their learning curve.
    Learning Classifiers under Delayed Feedback with a Time Window Assumption. (arXiv:2009.13092v2 [cs.LG] UPDATED)
    We consider training a binary classifier under delayed feedback (\emph{DF learning}). For example, in the conversion prediction in online ads, we initially receive negative samples that clicked the ads but did not buy an item; subsequently, some samples among them buy an item then change to positive. In the setting of DF learning, we observe samples over time, then learn a classifier at some point. We initially receive negative samples; subsequently, some samples among them change to positive. This problem is conceivable in various real-world applications such as online advertisements, where the user action takes place long after the first click. Owing to the delayed feedback, naive classification of the positive and negative samples returns a biased classifier. One solution is to use samples that have been observed for more than a certain time window assuming these samples are correctly labeled. However, existing studies reported that simply using a subset of all samples based on the time window assumption does not perform well, and that using all samples along with the time window assumption improves empirical performance. We extend these existing studies and propose a method with the unbiased and convex empirical risk that is constructed from all samples under the time window assumption. To demonstrate the soundness of the proposed method, we provide experimental results on a synthetic and open dataset that is the real traffic log datasets in online advertising.
    Linear Bandit Algorithms with Sublinear Time Complexity. (arXiv:2103.02729v2 [cs.LG] UPDATED)
    We propose two linear bandits algorithms with per-step complexity sublinear in the number of arms $K$. The algorithms are designed for applications where the arm set is extremely large and slowly changing. Our key realization is that choosing an arm reduces to a maximum inner product search (MIPS) problem, which can be solved approximately without breaking regret guarantees. Existing approximate MIPS solvers run in sublinear time. We extend those solvers and present theoretical guarantees for online learning problems, where adaptivity (i.e., a later step depends on the feedback in previous steps) becomes a unique challenge. We then explicitly characterize the tradeoff between the per-step complexity and regret. For sufficiently large $K$, our algorithms have sublinear per-step complexity and $\tilde O(\sqrt{T})$ regret. Empirically, we evaluate our proposed algorithms in a synthetic environment and a real-world online movie recommendation problem. Our proposed algorithms can deliver a more than 72 times speedup compared to the linear time baselines while retaining similar regret.
    Self-Correcting Neural Networks For Safe Classification. (arXiv:2107.11445v2 [cs.LG] UPDATED)
    Classifiers learnt from data are increasingly being used as components in systems where safety is a critical concern. In this work, we present a formal notion of safety for classifiers via constraints called safe-ordering constraints. These constraints relate requirements on the order of the classes output by a classifier to conditions on its input, and are expressive enough to encode various interesting examples of classifier safety specifications from the literature. For classifiers implemented using neural networks, we also present a run-time mechanism for the enforcement of safe-ordering constraints. Our approach is based on a self-correcting layer, which provably yields safe outputs regardless of the characteristics of the classifier input. We compose this layer with an existing neural network classifier to construct a self-correcting network (SC-Net), and show that in addition to providing safe outputs, the SC-Net is guaranteed to preserve the classification accuracy of the original network whenever possible. Our approach is independent of the size and architecture of the neural network used for classification, depending only on the specified property and the dimension of the network's output; thus it is scalable to large state-of-the-art networks. We show that our approach can be optimized for a GPU, introducing run-time overhead of less than 1ms on current hardware -- even on large, widely-used networks containing hundreds of thousands of neurons and millions of parameters.
    Popularity Adjusted Block Models are Generalized Random Dot Product Graphs. (arXiv:2109.04010v2 [stat.ML] UPDATED)
    We connect two random graph models, the Popularity Adjusted Block Model (PABM) and the Generalized Random Dot Product Graph (GRDPG), by demonstrating that the PABM is a special case of the GRDPG in which communities correspond to mutually orthogonal subspaces of latent vectors. This insight allows us to construct new algorithms for community detection and parameter estimation for the PABM, as well as improve an existing algorithm that relies on Sparse Subspace Clustering. Using established asymptotic properties of Adjacency Spectral Embedding for the GRDPG, we derive asymptotic properties of these algorithms. In particular, we demonstrate that the absolute number of community detection errors tends to zero as the number of graph vertices tends to infinity. Simulation experiments illustrate these properties.
    Low-Rank Tensor Recovery with Euclidean-Norm-Induced Schatten-p Quasi-Norm Regularization. (arXiv:2012.03436v3 [cs.LG] UPDATED)
    The nuclear norm and Schatten-$p$ quasi-norm are popular rank proxies in low-rank matrix recovery. Unfortunately, computing the nuclear norm or Schatten-$p$ quasi-norm of a tensor is NP-hard, which is a pity for low-rank tensor completion (LRTC) and tensor robust principal component analysis (TRPCA). In this paper, we propose a new class of tensor rank regularizers based on the Euclidean norms of the CP component vectors of a tensor and show that these regularizers are monotonic transformations of tensor Schatten-$p$ quasi-norm. This connection enables us to minimize the Schatten-$p$ quasi-norm in LRTC and TRPCA implicitly. The methods do not use the singular value decomposition and hence scale to big tensors. Moreover, the methods are not sensitive to the choice of initial rank and provide an arbitrarily sharper rank proxy for low-rank tensor recovery compared to nuclear norm. On the other hand, we study the generalization abilities of LRTC with Schatten-$p$ quasi-norm regularization and LRTC with our regularizers. The theorems show that a relatively sharper regularizer leads to a tighter error bound, which is consistent with our numerical results. Numerical results on synthetic data and real data demonstrate the effectiveness and superiority of our methods compared to baseline methods.
    Topologically penalized regression on manifolds. (arXiv:2110.13749v2 [cs.LG] UPDATED)
    We study a regression problem on a compact manifold M. In order to take advantage of the underlying geometry and topology of the data, the regression task is performed on the basis of the first several eigenfunctions of the Laplace-Beltrami operator of the manifold, that are regularized with topological penalties. The proposed penalties are based on the topology of the sub-level sets of either the eigenfunctions or the estimated function. The overall approach is shown to yield promising and competitive performance on various applications to both synthetic and real data sets. We also provide theoretical guarantees on the regression function estimates, on both its prediction error and its smoothness (in a topological sense). Taken together, these results support the relevance of our approach in the case where the targeted function is ''topologically smooth''.
    Sampling-based sublinear low-rank matrix arithmetic framework for dequantizing quantum machine learning. (arXiv:1910.06151v3 [cs.DS] UPDATED)
    We present an algorithmic framework for quantum-inspired classical algorithms on close-to-low-rank matrices, generalizing the series of results started by Tang's breakthrough quantum-inspired algorithm for recommendation systems [STOC'19]. Motivated by quantum linear algebra algorithms and the quantum singular value transformation (SVT) framework of Gily\'en, Su, Low, and Wiebe [STOC'19], we develop classical algorithms for SVT that run in time independent of input dimension, under suitable quantum-inspired sampling assumptions. Our results give compelling evidence that in the corresponding QRAM data structure input model, quantum SVT does not yield exponential quantum speedups. Since the quantum SVT framework generalizes essentially all known techniques for quantum linear algebra, our results, combined with sampling lemmas from previous work, suffice to generalize all recent results about dequantizing quantum machine learning algorithms. In particular, our classical SVT framework recovers and often improves the dequantization results on recommendation systems, principal component analysis, supervised clustering, support vector machines, low-rank regression, and semidefinite program solving. We also give additional dequantization results on low-rank Hamiltonian simulation and discriminant analysis. Our improvements come from identifying the key feature of the quantum-inspired input model that is at the core of all prior quantum-inspired results: $\ell^2$-norm sampling can approximate matrix products in time independent of their dimension. We reduce all our main results to this fact, making our exposition concise, self-contained, and intuitive.
    Encoding protein dynamic information in graph representation for functional residue identification. (arXiv:2112.12033v2 [q-bio.BM] UPDATED)
    Recent advances in protein function prediction exploit graph-based deep learning approaches to correlate the structural and topological features of proteins with their molecular functions. However, proteins in vivo are not static but dynamic molecules that alter conformation for functional purposes. Here we apply normal mode analysis to native protein conformations and augment protein graphs by connecting edges between dynamically correlated residue pairs. In the multilabel function classification task, our method demonstrates a remarkable performance gain based on this dynamics-informed representation. The proposed graph neural network, ProDAR, increases the interpretability and generalizability of residue-level annotations and robustly reflects structural nuance in proteins. We elucidate the importance of dynamic information in graph representation by comparing class activation maps for hMTH1, nitrophorin, and SARS-CoV-2 receptor binding domain. Our model successfully learns the dynamic fingerprints of proteins and pinpoints the residues of functional impacts, with vast untapped potential for broad biotechnology and pharmaceutical applications.
    Rethinking Spatial Invariance of Convolutional Networks for Object Counting. (arXiv:2206.05253v1 [cs.CV])
    Previous work generally believes that improving the spatial invariance of convolutional networks is the key to object counting. However, after verifying several mainstream counting networks, we surprisingly found too strict pixel-level spatial invariance would cause overfit noise in the density map generation. In this paper, we try to use locally connected Gaussian kernels to replace the original convolution filter to estimate the spatial position in the density map. The purpose of this is to allow the feature extraction process to potentially stimulate the density map generation process to overcome the annotation noise. Inspired by previous work, we propose a low-rank approximation accompanied with translation invariance to favorably implement the approximation of massive Gaussian convolution. Our work points a new direction for follow-up research, which should investigate how to properly relax the overly strict pixel-level spatial invariance for object counting. We evaluate our methods on 4 mainstream object counting networks (i.e., MCNN, CSRNet, SANet, and ResNet-50). Extensive experiments were conducted on 7 popular benchmarks for 3 applications (i.e., crowd, vehicle, and plant counting). Experimental results show that our methods significantly outperform other state-of-the-art methods and achieve promising learning of the spatial position of objects.
    Interactively Learning Preference Constraints in Linear Bandits. (arXiv:2206.05255v1 [cs.LG])
    We study sequential decision-making with known rewards and unknown constraints, motivated by situations where the constraints represent expensive-to-evaluate human preferences, such as safe and comfortable driving behavior. We formalize the challenge of interactively learning about these constraints as a novel linear bandit problem which we call constrained linear best-arm identification. To solve this problem, we propose the Adaptive Constraint Learning (ACOL) algorithm. We provide an instance-dependent lower bound for constrained linear best-arm identification and show that ACOL's sample complexity matches the lower bound in the worst-case. In the average case, ACOL's sample complexity bound is still significantly tighter than bounds of simpler approaches. In synthetic experiments, ACOL performs on par with an oracle solution and outperforms a range of baselines. As an application, we consider learning constraints to represent human preferences in a driving simulation. ACOL is significantly more sample efficient than alternatives for this application. Further, we find that learning preferences as constraints is more robust to changes in the driving scenario than encoding the preferences directly in the reward function.
    Meta Optimal Transport. (arXiv:2206.05262v1 [cs.LG])
    We study the use of amortized optimization to predict optimal transport (OT) maps from the input measures, which we call Meta OT. This helps repeatedly solve similar OT problems between different measures by leveraging the knowledge and information present from past problems to rapidly predict and solve new problems. Otherwise, standard methods ignore the knowledge of the past solutions and suboptimally re-solve each problem from scratch. Meta OT models surpass the standard convergence rates of log-Sinkhorn solvers in the discrete setting and convex potentials in the continuous setting. We improve the computational time of standard OT solvers by multiple orders of magnitude in discrete and continuous transport settings between images, spherical data, and color palettes. Our source code is available at this http URL
    AxFormer: Accuracy-driven Approximation of Transformers for Faster, Smaller and more Accurate NLP Models. (arXiv:2010.03688v2 [cs.CL] UPDATED)
    Transformers have greatly advanced the state-of-the-art in Natural Language Processing (NLP) in recent years, but present very large computation and storage requirements. We observe that the design process of Transformers (pre-train a foundation model on a large dataset in a self-supervised manner, and subsequently fine-tune it for different downstream tasks) leads to task-specific models that are highly over-parameterized, adversely impacting both accuracy and inference efficiency. We propose AxFormer, a systematic framework that applies accuracy-driven approximations to create optimized transformer models for a given downstream task. AxFormer combines two key optimizations -- accuracy-driven pruning and selective hard attention. Accuracy-driven pruning identifies and removes parts of the fine-tuned transformer that hinder performance on the given downstream task. Sparse hard-attention optimizes attention blocks in selected layers by eliminating irrelevant word aggregations, thereby helping the model focus only on the relevant parts of the input. In effect, AxFormer leads to models that are more accurate, while also being faster and smaller. Our experiments on GLUE and SQUAD tasks show that AxFormer models are up to 4.5% more accurate, while also being up to 2.5X faster and up to 3.2X smaller than conventional fine-tuned models. In addition, we demonstrate that AxFormer can be combined with previous efforts such as distillation or quantization to achieve further efficiency gains.
    Projected State-action Balancing Weights for Offline Reinforcement Learning. (arXiv:2109.04640v2 [cs.LG] UPDATED)
    Offline policy evaluation (OPE) is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on the value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and the covariate balancing idea in causal inference, we propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation. We obtain the convergence rate of these weights and show that the proposed value estimator is semi-parametric efficient under technical conditions. In terms of asymptotics, our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, we develop a necessary and sufficient condition for establishing the well-posedness of the Bellman operator in the off-policy setting, which characterizes the difficulty of OPE and may be of independent interest. Numerical experiments demonstrate the promising performance of our proposed estimator.
    LassoBench: A High-Dimensional Hyperparameter Optimization Benchmark Suite for Lasso. (arXiv:2111.02790v3 [cs.LG] UPDATED)
    While Weighted Lasso sparse regression has appealing statistical guarantees that would entail a major real-world impact in finance, genomics, and brain imaging applications, it is typically scarcely adopted due to its complex high-dimensional space composed by thousands of hyperparameters. On the other hand, the latest progress with high-dimensional hyperparameter optimization (HD-HPO) methods for black-box functions demonstrates that high-dimensional applications can indeed be efficiently optimized. Despite this initial success, HD-HPO approaches are mostly applied to synthetic problems with a moderate number of dimensions, which limits its impact in scientific and engineering applications. We propose LassoBench, the first benchmark suite tailored for Weighted Lasso regression. LassoBench consists of benchmarks for both well-controlled synthetic setups (number of samples, noise level, ambient and effective dimensionalities, and multiple fidelities) and real-world datasets, which enables the use of many flavors of HPO algorithms to be studied and extended to the high-dimensional Lasso setting. We evaluate 6 state-of-the-art HPO methods and 3 Lasso baselines, and demonstrate that Bayesian optimization and evolutionary strategies can improve over the methods commonly used for sparse regression while highlighting limitations of these frameworks in very high-dimensional and noisy settings.
    Accelerated Algorithms for Monotone Inclusions and Constrained Nonconvex-Nonconcave Min-Max Optimization. (arXiv:2206.05248v1 [math.OC])
    We study monotone inclusions and monotone variational inequalities, as well as their generalizations to non-monotone settings. We first show that the Extra Anchored Gradient (EAG) algorithm, originally proposed by Yoon and Ryu [2021] for unconstrained convex-concave min-max optimization, can be applied to solve the more general problem of Lipschitz monotone inclusion. More specifically, we prove that the EAG solves Lipschitz monotone inclusion problems with an \emph{accelerated convergence rate} of $O(\frac{1}{T})$, which is \emph{optimal among all first-order methods} [Diakonikolas, 2020, Yoon and Ryu, 2021]. Our second result is a new algorithm, called Extra Anchored Gradient Plus (EAG+), which not only achieves the accelerated $O(\frac{1}{T})$ convergence rate for all monotone inclusion problems, but also exhibits the same accelerated rate for a family of general (non-monotone) inclusion problems that concern negative comonotone operators. As a special case of our second result, EAG+ enjoys the $O(\frac{1}{T})$ convergence rate for solving a non-trivial class of nonconvex-nonconcave min-max optimization problems. Our analyses are based on simple potential function arguments, which might be useful for analysing other accelerated algorithms.
    Is Self-Supervised Learning More Robust Than Supervised Learning?. (arXiv:2206.05259v1 [cs.CV])
    Self-supervised contrastive learning is a powerful tool to learn visual representation without labels. Prior work has primarily focused on evaluating the recognition accuracy of various pre-training algorithms, but has overlooked other behavioral aspects. In addition to accuracy, distributional robustness plays a critical role in the reliability of machine learning models. We design and conduct a series of robustness tests to quantify the behavioral differences between contrastive learning and supervised learning to downstream or pre-training data distribution changes. These tests leverage data corruptions at multiple levels, ranging from pixel-level gamma distortion to patch-level shuffling and to dataset-level distribution shift. Our tests unveil intriguing robustness behaviors of contrastive and supervised learning. On the one hand, under downstream corruptions, we generally observe that contrastive learning is surprisingly more robust than supervised learning. On the other hand, under pre-training corruptions, we find contrastive learning vulnerable to patch shuffling and pixel intensity change, yet less sensitive to dataset-level distribution change. We attempt to explain these results through the role of data augmentation and feature space properties. Our insight has implications in improving the downstream robustness of supervised learning.
    Unifying mirror descent and dual averaging. (arXiv:1910.13742v4 [math.OC] UPDATED)
    We introduce and analyze a new family of first-order optimization algorithms which generalizes and unifies both mirror descent and dual averaging. Within the framework of this family, we define new algorithms for constrained optimization that combines the advantages of mirror descent and dual averaging. Our preliminary simulation study shows that these new algorithms significantly outperform available methods in some situations.
    Learning the Space of Deep Models. (arXiv:2206.05194v1 [cs.CV])
    Embedding of large but redundant data, such as images or text, in a hierarchy of lower-dimensional spaces is one of the key features of representation learning approaches, which nowadays provide state-of-the-art solutions to problems once believed hard or impossible to solve. In this work, in a plot twist with a strong meta aftertaste, we show how trained deep models are as redundant as the data they are optimized to process, and how it is therefore possible to use deep learning models to embed deep learning models. In particular, we show that it is possible to use representation learning to learn a fixed-size, low-dimensional embedding space of trained deep models and that such space can be explored by interpolation or optimization to attain ready-to-use models. We find that it is possible to learn an embedding space of multiple instances of the same architecture and of multiple architectures. We address image classification and neural representation of signals, showing how our embedding space can be learnt so as to capture the notions of performance and 3D shape, respectively. In the Multi-Architecture setting we also show how an embedding trained only on a subset of architectures can learn to generate already-trained instances of architectures it never sees instantiated at training time.
    ROI Constrained Bidding via Curriculum-Guided Bayesian Reinforcement Learning. (arXiv:2206.05240v1 [cs.LG])
    Real-Time Bidding (RTB) is an important mechanism in modern online advertising systems. Advertisers employ bidding strategies in RTB to optimize their advertising effects subject to various financial requirements, among which a widely adopted one is the return-on-investment (ROI) constraint. ROIs change non-monotonically during the sequential bidding process, usually presenting a see-saw effect between constraint satisfaction and objective optimization. Existing solutions to the constraint-objective trade-off are typically established in static or mildly changing markets. However, these methods fail significantly in non-stationary advertising markets due to their inability to adapt to varying dynamics and partial observability. In this work, we specialize in ROI-Constrained Bidding in non-stationary markets. Based on a Partially Observable Constrained Markov Decision Process, we propose the first hard barrier solution to accommodate non-monotonic constraints. Our method exploits a parameter-free indicator-augmented reward function and develops a Curriculum-Guided Bayesian Reinforcement Learning (CBRL) framework to adaptively control the constraint-objective trade-off in non-stationary advertising markets. Extensive experiments on a large-scale industrial dataset with two problem settings reveal that CBRL generalizes well in both in-distribution and out-of-distribution data regimes, and enjoys outstanding stability.
    Causal Balancing for Domain Generalization. (arXiv:2206.05263v1 [cs.LG])
    While machine learning models rapidly advance the state-of-the-art on various real-world tasks, out-of-domain (OOD) generalization remains a challenging problem given the vulnerability of these models to spurious correlations. While current domain generalization methods usually focus on enforcing certain invariance properties across different domains by new loss function designs, we propose a balanced mini-batch sampling strategy to reduce the domain-specific spurious correlations in the observed training distributions. More specifically, we propose a two-phased method that 1) identifies the source of spurious correlations, and 2) builds balanced mini-batches free from spurious correlations by matching on the identified source. We provide an identifiability guarantee of the source of spuriousness and show that our proposed approach provably samples from a balanced, spurious-free distribution over all training environments. Experiments are conducted on three computer vision datasets with documented spurious correlations, demonstrating empirically that our balanced mini-batch sampling strategy improves the performance of four different established domain generalization model baselines compared to the random mini-batch sampling strategy.
    Measuring the Carbon Intensity of AI in Cloud Instances. (arXiv:2206.05229v1 [cs.LG])
    By providing unprecedented access to computational resources, cloud computing has enabled rapid growth in technologies such as machine learning, the computational demands of which incur a high energy cost and a commensurate carbon footprint. As a result, recent scholarship has called for better estimates of the greenhouse gas impact of AI: data scientists today do not have easy or reliable access to measurements of this information, precluding development of actionable tactics. Cloud providers presenting information about software carbon intensity to users is a fundamental stepping stone towards minimizing emissions. In this paper, we provide a framework for measuring software carbon intensity, and propose to measure operational carbon emissions by using location-based and time-specific marginal emissions data per energy unit. We provide measurements of operational software carbon intensity for a set of modern models for natural language processing and computer vision, and a wide range of model sizes, including pretraining of a 6.1 billion parameter language model. We then evaluate a suite of approaches for reducing emissions on the Microsoft Azure cloud compute platform: using cloud instances in different geographic regions, using cloud instances at different times of day, and dynamically pausing cloud instances when the marginal carbon intensity is above a certain threshold. We confirm previous results that the geographic region of the data center plays a significant role in the carbon intensity for a given cloud instance, and find that choosing an appropriate region can have the largest operational emissions reduction impact. We also show that the time of day has notable impact on operational software carbon intensity. Finally, we conclude with recommendations for how machine learning practitioners can use software carbon intensity information to reduce environmental impact.
    StructCoder: Structure-Aware Transformer for Code Generation. (arXiv:2206.05239v1 [cs.LG])
    There has been a recent surge of interest in automating software engineering tasks using deep learning. This work addresses the problem of code generation where the goal is to generate target code given source code in a different language or a natural language description. Most of the state-of-the-art deep learning models for code generation use training strategies that are primarily designed for natural language. However, understanding and generating code requires a more rigorous comprehension of the code syntax and semantics. With this motivation, we develop an encoder-decoder Transformer model where both the encoder and decoder are trained to recognize the syntax and data flow in the source and target codes, respectively. We not only make the encoder structure-aware by leveraging the source code's syntax tree and data flow graph, but we also ensure that our decoder preserves the syntax and data flow of the target code by introducing two auxiliary tasks: AST (Abstract Syntax Tree) paths prediction and data flow prediction. To the best of our knowledge, this is the first work to introduce a structure-aware Transformer decoder to enhance the quality of generated code by modeling target syntax and data flow. The proposed StructCoder model achieves state-of-the-art performance on code translation and text-to-code generation tasks in the CodeXGLUE benchmark.
    Tight Bounds for State Tomography with Incoherent Measurements. (arXiv:2206.05265v1 [quant-ph])
    We consider the classic question of state tomography: given copies of an unknown quantum state $\rho\in\mathbb{C}^{d\times d}$, output $\widehat{\rho}$ for which $\|\rho - \widehat{\rho}\|_{\mathsf{tr}} \le \varepsilon$. When one is allowed to make coherent measurements entangled across all copies, $\Theta(d^2/\varepsilon^2)$ copies are necessary and sufficient [Haah et al. '17, O'Donnell-Wright '16]. Unfortunately, the protocols achieving this rate incur large quantum memory overheads that preclude implementation on current or near-term devices. On the other hand, the best known protocol using incoherent (single-copy) measurements uses $O(d^3/\varepsilon^2)$ copies [Kueng-Rauhut-Terstiege '17], and multiple papers have posed it as an open question to understand whether or not this rate is tight. In this work, we fully resolve this question, by showing that any protocol using incoherent measurements, even if they are chosen adaptively, requires $\Omega(d^3/\varepsilon^2)$ copies, matching the upper bound of [Kueng-Rauhut-Terstiege '17]. We do so by a new proof technique which directly bounds the "tilt" of the posterior distribution after measurements, which yields a surprisingly short proof of our lower bound, and which we believe may be of independent interest.
    Street Crossing Aid Using Light-weight CNNs for the Visually Impaired. (arXiv:1909.09598v2 [cs.CV] UPDATED)
    In this paper, we address an issue that the visually impaired commonly face while crossing intersections and propose a solution that takes form as a mobile application. The application utilizes a deep learning convolutional neural network model, LytNetV2, to output necessary information that the visually impaired may lack when without human companions or guide-dogs. A prototype of the application runs on iOS devices of versions 11 or above. It is designed for comprehensiveness, concision, accuracy, and computational efficiency through delivering the two most important pieces of information, pedestrian traffic light color and direction, required to cross the road in real-time. Furthermore, it is specifically aimed to support those facing financial burden as the solution takes the form of a free mobile application. Through the modification and utilization of key principles in MobileNetV3 such as depthwise seperable convolutions and squeeze-excite layers, the deep neural network model achieves a classification accuracy of 96% and average angle error of 6.15 degrees, while running at a frame rate of 16.34 frames per second. Additionally, the model is trained as an image classifier, allowing for a faster and more accurate model. The network is able to outperform other methods such as object detection and non-deep learning algorithms in both accuracy and thoroughness. The information is delivered through both auditory signals and vibrations, and it has been tested on seven visually impaired and has received above satisfactory responses.
    Multifidelity Reinforcement Learning with Control Variates. (arXiv:2206.05165v1 [cs.LG])
    In many computational science and engineering applications, the output of a system of interest corresponding to a given input can be queried at different levels of fidelity with different costs. Typically, low-fidelity data is cheap and abundant, while high-fidelity data is expensive and scarce. In this work we study the reinforcement learning (RL) problem in the presence of multiple environments with different levels of fidelity for a given control task. We focus on improving the RL agent's performance with multifidelity data. Specifically, a multifidelity estimator that exploits the cross-correlations between the low- and high-fidelity returns is proposed to reduce the variance in the estimation of the state-action value function. The proposed estimator, which is based on the method of control variates, is used to design a multifidelity Monte Carlo RL (MFMCRL) algorithm that improves the learning of the agent in the high-fidelity environment. The impacts of variance reduction on policy evaluation and policy improvement are theoretically analyzed by using probability bounds. Our theoretical analysis and numerical experiments demonstrate that for a finite budget of high-fidelity data samples, our proposed MFMCRL agent attains superior performance compared with that of a standard RL agent that uses only the high-fidelity environment data for learning the optimal policy.
    GD-VAEs: Geometric Dynamic Variational Autoencoders for Learning Nonlinear Dynamics and Dimension Reductions. (arXiv:2206.05183v1 [cs.LG])
    We develop data-driven methods incorporating geometric and topological information to learn parsimonious representations of nonlinear dynamics from observations. We develop approaches for learning nonlinear state space models of the dynamics for general manifold latent spaces using training strategies related to Variational Autoencoders (VAEs). Our methods are referred to as Geometric Dynamic (GD) Variational Autoencoders (GD-VAEs). We learn encoders and decoders for the system states and evolution based on deep neural network architectures that include general Multilayer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), and Transpose CNNs (T-CNNs). Motivated by problems arising in parameterized PDEs and physics, we investigate the performance of our methods on tasks for learning low dimensional representations of the nonlinear Burgers equations, constrained mechanical systems, and spatial fields of reaction-diffusion systems. GD-VAEs provide methods for obtaining representations for use in learning tasks involving dynamics.
    Hierarchical Federated Learning with Privacy. (arXiv:2206.05209v1 [cs.LG])
    Federated learning (FL), where data remains at the federated clients, and where only gradient updates are shared with a central aggregator, was assumed to be private. Recent work demonstrates that adversaries with gradient-level access can mount successful inference and reconstruction attacks. In such settings, differentially private (DP) learning is known to provide resilience. However, approaches used in the status quo (\ie central and local DP) introduce disparate utility vs. privacy trade-offs. In this work, we take the first step towards mitigating such trade-offs through {\em hierarchical FL (HFL)}. We demonstrate that by the introduction of a new intermediary level where calibrated DP noise can be added, better privacy vs. utility trade-offs can be obtained; we term this {\em hierarchical DP (HDP)}. Our experiments with 3 different datasets (commonly used as benchmarks for FL) suggest that HDP produces models as accurate as those obtained using central DP, where noise is added at a central aggregator. Such an approach also provides comparable benefit against inference adversaries as in the local DP case, where noise is added at the federated clients.
    A Resilient Distributed Boosting Algorithm. (arXiv:2206.04713v1 [cs.LG])
    Given a learning task where the data is distributed among several parties, communication is one of the fundamental resources which the parties would like to minimize. We present a distributed boosting algorithm which is resilient to a limited amount of noise. Our algorithm is similar to classical boosting algorithms, although it is equipped with a new component, inspired by Impagliazzo's hard-core lemma \cite{impagliazzo1995hard}, adding a robustness quality to the algorithm. We also complement this result by showing that resilience to any asymptotically larger noise is not achievable by a communication-efficient algorithm.  ( 2 min )
    Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations. (arXiv:2206.04779v1 [cs.LG])
    Offline reinforcement learning has shown great promise in leveraging large pre-collected datasets for policy learning, allowing agents to forgo often-expensive online data collection. However, to date, offline reinforcement learning from has been relatively under-explored, and there is a lack of understanding of where the remaining challenges lie. In this paper, we seek to establish simple baselines for continuous control in the visual domain. We show that simple modifications to two state-of-the-art vision-based online reinforcement learning algorithms, DreamerV2 and DrQ-v2, suffice to outperform prior work and establish a competitive baseline. We rigorously evaluate these algorithms on both existing offline datasets and a new testbed for offline reinforcement learning from visual observations that better represents the data distributions present in real-world offline reinforcement learning problems, and open-source our code and data to facilitate progress in this important domain. Finally, we present and analyze several key desiderata unique to offline RL from visual observations, including visual distractions and visually identifiable changes in dynamics.  ( 2 min )
    Quantum Policy Iteration via Amplitude Estimation and Grover Search -- Towards Quantum Advantage for Reinforcement Learning. (arXiv:2206.04741v1 [quant-ph])
    We present a full implementation and simulation of a novel quantum reinforcement learning (RL) method and mathematically prove a quantum advantage. Our approach shows in detail how to combine amplitude estimation and Grover search into a policy evaluation and improvement scheme. We first develop quantum policy evaluation (QPE) which is quadratically more efficient compared to an analogous classical Monte Carlo estimation and is based on a quantum mechanical realization of a finite Markov decision process (MDP). Building on QPE, we derive a quantum policy iteration that repeatedly improves an initial policy using Grover search until the optimum is reached. Finally, we present an implementation of our algorithm for a two-armed bandit MDP which we then simulate. The results confirm that QPE provides a quantum advantage in RL problems.  ( 2 min )
    Fast Bayesian Inference with Batch Bayesian Quadrature via Kernel Recombination. (arXiv:2206.04734v1 [cs.LG])
    Calculation of Bayesian posteriors and model evidences typically requires numerical integration. Bayesian quadrature (BQ), a surrogate-model-based approach to numerical integration, is capable of superb sample efficiency, but its lack of parallelisation has hindered its practical applications. In this work, we propose a parallelised (batch) BQ method, employing techniques from kernel quadrature, that possesses a provably-exponential convergence rate. Additionally, just as with Nested Sampling, our method permits simultaneous inference of both posteriors and model evidence. Samples from our BQ surrogate model are re-selected to give a sparse set of samples, via a kernel recombination algorithm, requiring negligible additional time to increase the batch size. Empirically, we find that our approach significantly outperforms the sampling efficiency of both state-of-the-art BQ techniques and Nested Sampling in various real-world datasets, including lithium-ion battery analytics.  ( 2 min )
    Joint Entropy Search For Maximally-Informed Bayesian Optimization. (arXiv:2206.04771v1 [cs.LG])
    Information-theoretic Bayesian optimization techniques have become popular for optimizing expensive-to-evaluate black-box functions due to their non-myopic qualities. Entropy Search and Predictive Entropy Search both consider the entropy over the optimum in the input space, while the recent Max-value Entropy Search considers the entropy over the optimal value in the output space. We propose Joint Entropy Search (JES), a novel information-theoretic acquisition function that considers an entirely new quantity, namely the entropy over the joint optimal probability density over both input and output space. To incorporate this information, we consider the reduction in entropy from conditioning on fantasized optimal input/output pairs. The resulting approach primarily relies on standard GP machinery and removes complex approximations typically associated with information-theoretic methods. With minimal computational overhead, JES shows superior decision-making, and yields state-of-the-art performance for information-theoretic approaches across a wide suite of tasks. As a light-weight approach with superior results, JES provides a new go-to acquisition function for Bayesian optimization.  ( 2 min )
    Lightweight Conditional Model Extrapolation for Streaming Data under Class-Prior Shift. (arXiv:2206.05181v1 [cs.LG])
    We introduce LIMES, a new method for learning with non-stationary streaming data, inspired by the recent success of meta-learning. The main idea is not to attempt to learn a single classifier that would have to work well across all occurring data distributions, nor many separate classifiers, but to exploit a hybrid strategy: we learn a single set of model parameters from which a specific classifier for any specific data distribution is derived via classifier adaptation. Assuming a multi-class classification setting with class-prior shift, the adaptation step can be performed analytically with only the classifier's bias terms being affected. Another contribution of our work is an extrapolation step that predicts suitable adaptation parameters for future time steps based on the previous data. In combination, we obtain a lightweight procedure for learning from streaming data with varying class distribution that adds no trainable parameters and almost no memory or computational overhead compared to training a single model. Experiments on a set of exemplary tasks using Twitter data show that LIMES achieves higher accuracy than alternative approaches, especially with respect to the relevant real-world metric of lowest within-day accuracy.
    An Image Processing Pipeline for Camera Trap Time-Lapse Recordings. (arXiv:2206.05159v1 [cs.CV])
    A new open-source image processing pipeline for analyzing camera trap time-lapse recordings is described. This pipeline includes machine learning models to assist human-in-the-loop video segmentation and animal re-identification. We present some performance results and observations on the utility of this pipeline after using it in a year-long project studying the spatial ecology and social behavior of the gopher tortoise.
    Empirical Bayes approach to Truth Discovery problems. (arXiv:2206.04816v1 [cs.LG])
    When aggregating information from conflicting sources, one's goal is to find the truth. Most real-value \emph{truth discovery} (TD) algorithms try to achieve this goal by estimating the competence of each source and then aggregating the conflicting information by weighing each source's answer proportionally to her competence. However, each of those algorithms requires more than a single source for such estimation and usually does not consider different estimation methods other than a weighted mean. Therefore, in this work we formulate, prove, and empirically test the conditions for an Empirical Bayes Estimator (EBE) to dominate the weighted mean aggregation. Our main result demonstrates that EBE, under mild conditions, can be used as a second step of any TD algorithm in order to reduce the expected error.  ( 2 min )
    Mixed integer linear optimization formulations for learning optimal binary classification trees. (arXiv:2206.04857v1 [cs.LG])
    Decision trees are powerful tools for classification and regression that attract many researchers working in the burgeoning area of machine learning. One advantage of decision trees over other methods is their interpretability, which is often preferred over other higher accuracy methods that are relatively uninterpretable. A binary classification tree has two types of vertices: (i) branching vertices which have exactly two children and where datapoints are assessed on a set of discrete features; and (ii) leaf vertices at which datapoints are given a discrete prediction. An optimal binary classification tree can be obtained by solving a biobjective optimization problem that seeks to (i) maximize the number of correctly classified datapoints and (ii) minimize the number of branching vertices. In this paper, we propose four mixed integer linear optimization (MILO) formulations for designing optimal binary classification trees: two flow-based formulations and two-cut based formulations. We provide theoretical comparisons between our proposed formulations and the strongest flow-based MILO formulation of Aghaei et al. (2021). We conduct experiments on 13 publicly available datasets to show the models' ability to scale and the strength of a biobjective approach using Pareto frontiers. Our code and data are available on GitHub.  ( 2 min )
    Stochastic Zeroth order Descent with Structured Directions. (arXiv:2206.05124v1 [math.OC])
    We introduce and analyze Structured Stochastic Zeroth order Descent (S-SZD), a finite difference approach which approximates a stochastic gradient on a set of $l\leq d$ orthogonal directions, where $d$ is the dimension of the ambient space. These directions are randomly chosen, and may change at each step. For smooth convex functions we prove almost sure convergence of the iterates and a convergence rate on the function values of the form $O(d/l k^{-c})$ for every $c<1/2$, which is arbitrarily close to the one of Stochastic Gradient Descent (SGD) in terms of number of iterations. Our bound also shows the benefits of using $l$ multiple directions instead of one. For non-convex functions satisfying the Polyak-{\L}ojasiewicz condition, we establish the first convergence rates for stochastic zeroth order algorithms under such an assumption. We corroborate our theoretical findings in numerical simulations where assumptions are satisfied and on the real-world problem of hyper-parameter optimization, observing that S-SZD has very good practical performances.
    Strong Memory Lower Bounds for Learning Natural Models. (arXiv:2206.04743v1 [cs.LG])
    We give lower bounds on the amount of memory required by one-pass streaming algorithms for solving several natural learning problems. In a setting where examples lie in $\{0,1\}^d$ and the optimal classifier can be encoded using $\kappa$ bits, we show that algorithms which learn using a near-minimal number of examples, $\tilde O(\kappa)$, must use $\tilde \Omega( d\kappa)$ bits of space. Our space bounds match the dimension of the ambient space of the problem's natural parametrization, even when it is quadratic in the size of examples and the final classifier. For instance, in the setting of $d$-sparse linear classifiers over degree-2 polynomial features, for which $\kappa=\Theta(d\log d)$, our space lower bound is $\tilde\Omega(d^2)$. Our bounds degrade gracefully with the stream length $N$, generally having the form $\tilde\Omega\left(d\kappa \cdot \frac{\kappa}{N}\right)$. Bounds of the form $\Omega(d\kappa)$ were known for learning parity and other problems defined over finite fields. Bounds that apply in a narrow range of sample sizes are also known for linear regression. Ours are the first such bounds for problems of the type commonly seen in recent learning applications that apply for a large range of input sizes.  ( 2 min )
    Training Neural Networks using SAT solvers. (arXiv:2206.04833v1 [cs.LG])
    We propose an algorithm to explore the global optimization method, using SAT solvers, for training a neural net. Deep Neural Networks have achieved great feats in tasks like-image recognition, speech recognition, etc. Much of their success can be attributed to the gradient-based optimisation methods, which scale well to huge datasets while still giving solutions, better than any other existing methods. However, there exist learning problems like the parity function and the Fast Fourier Transform, where a neural network using gradient-based optimisation algorithm can not capture the underlying structure of the learning task properly. Thus, exploring global optimisation methods is of utmost interest as the gradient-based methods get stuck in local optima. In the experiments, we demonstrate the effectiveness of our algorithm against the ADAM optimiser in certain tasks like parity learning. However, in the case of image classification on the MNIST Dataset, the performance of our algorithm was less than satisfactory. We further discuss the role of the size of the training dataset and the hyper-parameter settings in keeping things scalable for a SAT solver.  ( 2 min )
    A new distance measurement and its application in K-Means Algorithm. (arXiv:2206.05215v1 [cs.LG])
    K-Means clustering algorithm is one of the most commonly used clustering algorithms because of its simplicity and efficiency. K-Means clustering algorithm based on Euclidean distance only pays attention to the linear distance between samples, but ignores the overall distribution structure of the dataset (i.e. the fluid structure of dataset). Since it is difficult to describe the internal structure of two data points by Euclidean distance in high-dimensional data space, we propose a new distance measurement, namely, view-distance, and apply it to the K-Means algorithm. On the classical manifold learning datasets, S-curve and Swiss roll datasets, not only this new distance can cluster the data according to the structure of the data itself, but also the boundaries between categories are neat dividing lines. Moreover, we also tested the classification accuracy and clustering effect of the K-Means algorithm based on view-distance on some real-world datasets. The experimental results show that, on most datasets, the K-Means algorithm based on view-distance has a certain degree of improvement in classification accuracy and clustering effect.
    Mildly Conservative Q-Learning for Offline Reinforcement Learning. (arXiv:2206.04745v1 [cs.LG])
    Offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment. The distribution shift between the learned policy and the behavior policy makes it necessary for the value function to stay conservative such that out-of-distribution (OOD) actions will not be severely overestimated. However, existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic, which suppresses the generalization of the value function and hinders the performance improvement. This paper explores mild but enough conservatism for offline learning while not harming generalization. We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values. We theoretically show that MCQ induces a policy that behaves at least as well as the behavior policy and no erroneous overestimation will occur for OOD actions. Experimental results on the D4RL benchmarks demonstrate that MCQ achieves remarkable performance compared with prior work. Furthermore, MCQ shows superior generalization ability when transferring from offline to online, and significantly outperforms baselines.  ( 2 min )
    $\mathsf{G^2Retro}$: Two-Step Graph Generative Models for Retrosynthesis Prediction. (arXiv:2206.04882v1 [cs.LG])
    Retrosynthesis is a procedure where a molecule is transformed into potential reactants and thus the synthesis routes are identified. We propose a novel generative framework, denoted as $\mathsf{G^2Retro}$, for one-step retrosynthesis prediction. $\mathsf{G^2Retro}$ imitates the reversed logic of synthetic reactions, that is, first predicting the reaction centers to convert the target molecule into fragments named synthons, and then transforming synthons into reactants, following previous semi-template-based methods. In predicting reaction centers, $\mathsf{G^2Retro}$ defines a comprehensive set of reaction center types, and enables diversity in the predicted reactions by considering multiple reaction center candidates. In completing synthons, $\mathsf{G^2Retro}$ deploys a sequence of substructure attachments to transform synthons into reactants, which utilize a holistic view of the most updated structures of the synthons to be completed, as well as all the involved synthon and product structures. Here we show that $\mathsf{G^2Retro}$ is able to better prioritize the most possible reactants in the benchmark dataset than the state-of-the-art methods, and discover novel and highly likely reactions that are not included in the benchmark dataset.  ( 2 min )
    Bayesian Estimation of Differential Privacy. (arXiv:2206.05199v1 [cs.LG])
    Algorithms such as Differentially Private SGD enable training machine learning models with formal privacy guarantees. However, there is a discrepancy between the protection that such algorithms guarantee in theory and the protection they afford in practice. An emerging strand of work empirically estimates the protection afforded by differentially private training as a confidence interval for the privacy budget $\varepsilon$ spent on training a model. Existing approaches derive confidence intervals for $\varepsilon$ from confidence intervals for the false positive and false negative rates of membership inference attacks. Unfortunately, obtaining narrow high-confidence intervals for $\epsilon$ using this method requires an impractically large sample size and training as many models as samples. We propose a novel Bayesian method that greatly reduces sample size, and adapt and validate a heuristic to draw more than one sample per trained model. Our Bayesian method exploits the hypothesis testing interpretation of differential privacy to obtain a posterior for $\varepsilon$ (not just a confidence interval) from the joint posterior of the false positive and false negative rates of membership inference attacks. For the same sample size and confidence, we derive confidence intervals for $\varepsilon$ around 40% narrower than prior work. The heuristic, which we adapt from label-only DP, can be used to further reduce the number of trained models needed to get enough samples by up to 2 orders of magnitude.
    Learning Attention-based Representations from Multiple Patterns for Relation Prediction in Knowledge Graphs. (arXiv:2206.04801v1 [cs.AI])
    Knowledge bases, and their representations in the form of knowledge graphs (KGs), are naturally incomplete. Since scientific and industrial applications have extensively adopted them, there is a high demand for solutions that complete their information. Several recent works tackle this challenge by learning embeddings for entities and relations, then employing them to predict new relations among the entities. Despite their aggrandizement, most of those methods focus only on the local neighbors of a relation to learn the embeddings. As a result, they may fail to capture the KGs' context information by neglecting long-term dependencies and the propagation of entities' semantics. In this manuscript, we propose {\AE}MP (Attention-based Embeddings from Multiple Patterns), a novel model for learning contextualized representations by: (i) acquiring entities' context information through an attention-enhanced message-passing scheme, which captures the entities' local semantics while focusing on different aspects of their neighborhood; and (ii) capturing the semantic context, by leveraging the paths and their relationships between entities. Our empirical findings draw insights into how attention mechanisms can improve entities' context representation and how combining entities and semantic path contexts improves the general representation of entities and the relation predictions. Experimental results on several large and small knowledge graph benchmarks show that {\AE}MP either outperforms or competes with state-of-the-art relation prediction methods.  ( 2 min )
    Neural Laplace: Learning diverse classes of differential equations in the Laplace domain. (arXiv:2206.04843v1 [cs.LG])
    Neural Ordinary Differential Equations model dynamical systems with \textit{ODE}s learned by neural networks. However, ODEs are fundamentally inadequate to model systems with long-range dependencies or discontinuities, which are common in engineering and biological systems. Broader classes of differential equations (DE) have been proposed as remedies, including delay differential equations and integro-differential equations. Furthermore, Neural ODE suffers from numerical instability when modelling stiff ODEs and ODEs with piecewise forcing functions. In this work, we propose \textit{Neural Laplace}, a unified framework for learning diverse classes of DEs including all the aforementioned ones. Instead of modelling the dynamics in the time domain, we model it in the Laplace domain, where the history-dependencies and discontinuities in time can be represented as summations of complex exponentials. To make learning more efficient, we use the geometrical stereographic map of a Riemann sphere to induce more smoothness in the Laplace domain. In the experiments, Neural Laplace shows superior performance in modelling and extrapolating the trajectories of diverse classes of DEs, including the ones with complex history dependency and abrupt changes.  ( 2 min )
    Detecting Anomalous Cryptocurrency Transactions: an AML/CFT Application of Machine Learning-based Forensics. (arXiv:2206.04803v1 [cs.CR])
    The rise of blockchain and distributed ledger technologies (DLTs) in the financial sector has generated a socio-economic shift that triggered legal concerns and regulatory initiatives. While the anonymity of DLTs may safeguard the right to privacy, data protection and other civil liberties, lack of identification hinders accountability, investigation and enforcement. The resulting challenges extend to the rules to combat money laundering and the financing of terrorism and proliferation (AML/CFT). As law enforcement agencies and analytics companies have begun to successfully apply forensics to track currency across blockchain ecosystems, in this paper we focus on the increasing relevance of these techniques. In particular, we offer insights into the application to the Internet of Money (IoM) of machine learning, network and transaction graph analysis. After providing some background on the notion of anonymity in the IoM and on the interplay between AML/CFT and blockchain forensics, we focus on anomaly detection approaches leading to our experiments. Namely, we analyzed a real-world dataset of Bitcoin transactions represented as a directed graph network through various machine learning techniques. Our claim is that the AML/CFT domain could benefit from novel graph analysis methods in machine learning. Indeed, our findings show that the Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT) neural network types represent a promising solution for AML/CFT compliance.  ( 2 min )
    AI-MIA: COVID-19 Detection & Severity Analysis through Medical Imaging. (arXiv:2206.04732v1 [eess.IV])
    This paper presents the baseline approach for the organized 2nd Covid-19 Competition, occurring in the framework of the AIMIA Workshop in the European Conference on Computer Vision (ECCV 2022). It presents the COV19-CT-DB database which is annotated for COVID-19 detction, consisting of about 7,700 3-D CT scans. Part of the database consisting of Covid-19 cases is further annotated in terms of four Covid-19 severity conditions. We have split the database and the latter part of it in training, validation and test datasets. The former two datasets are used for training and validation of machine learning models, while the latter will be used for evaluation of the developed models. The baseline approach consists of a deep learning approach, based on a CNN-RNN network and report its performance on the COVID19-CT-DB database.  ( 2 min )
    Distributionally Robust End-to-End Portfolio Construction. (arXiv:2206.05134v1 [q-fin.CP])
    We propose an end-to-end distributionally robust system for portfolio construction that integrates the asset return prediction model with a distributionally robust portfolio optimization model. We also show how to learn the risk-tolerance parameter and the degree of robustness directly from data. End-to-end systems have an advantage in that information can be communicated between the prediction and decision layers during training, allowing the parameters to be trained for the final task rather than solely for predictive performance. However, existing end-to-end systems are not able to quantify and correct for the impact of model risk on the decision layer. Our proposed distributionally robust end-to-end portfolio selection system explicitly accounts for the impact of model risk. The decision layer chooses portfolios by solving a minimax problem where the distribution of the asset returns is assumed to belong to an ambiguity set centered around a nominal distribution. Using convex duality, we recast the minimax problem in a form that allows for efficient training of the end-to-end system.
    MEAT: Maneuver Extraction from Agent Trajectories. (arXiv:2206.05158v1 [cs.CV])
    Advances in learning-based trajectory prediction are enabled by large-scale datasets. However, in-depth analysis of such datasets is limited. Moreover, the evaluation of prediction models is limited to metrics averaged over all samples in the dataset. We propose an automated methodology that allows to extract maneuvers (e.g., left turn, lane change) from agent trajectories in such datasets. The methodology considers information about the agent dynamics and information about the lane segments the agent traveled along. Although it is possible to use the resulting maneuvers for training classification networks, we exemplary use them for extensive trajectory dataset analysis and maneuver-specific evaluation of multiple state-of-the-art trajectory prediction models. Additionally, an analysis of the datasets and an evaluation of the prediction models based on the agent dynamics is provided.
    Deep Multi-Agent Reinforcement Learning with Hybrid Action Spaces based on Maximum Entropy. (arXiv:2206.05108v1 [cs.LG])
    Multi-agent deep reinforcement learning has been applied to address a variety of complex problems with either discrete or continuous action spaces and achieved great success. However, most real-world environments cannot be described by only discrete action spaces or only continuous action spaces. And there are few works having ever utilized deep reinforcement learning (drl) to multi-agent problems with hybrid action spaces. Therefore, we propose a novel algorithm: Deep Multi-Agent Hybrid Soft Actor-Critic (MAHSAC) to fill this gap. This algorithm follows the centralized training but decentralized execution (CTDE) paradigm, and extend the Soft Actor-Critic algorithm (SAC) to handle hybrid action space problems in Multi-Agent environments based on maximum entropy. Our experiences are running on an easy multi-agent particle world with a continuous observation and discrete action space, along with some basic simulated physics. The experimental results show that MAHSAC has good performance in training speed, stability, and anti-interference ability. At the same time, it outperforms existing independent deep hybrid learning method in cooperative scenarios and competitive scenarios.
    Dynamic mean field programming. (arXiv:2206.05200v1 [stat.ML])
    A dynamic mean field theory is developed for model based Bayesian reinforcement learning in the large state space limit. In an analogy with the statistical physics of disordered systems, the transition probabilities are interpreted as couplings, and value functions as deterministic spins, and thus the sampled transition probabilities are considered to be quenched random variables. The results reveal that, under standard assumptions, the posterior over Q-values is asymptotically independent and Gaussian across state-action pairs, for infinite horizon problems. The finite horizon case exhibits the same behaviour for all state-actions pairs at each time but has an additional correlation across time, for each state-action pair. The results also hold for policy evaluation. The Gaussian statistics can be computed from a set of coupled mean field equations derived from the Bellman equation, which we call dynamic mean field programming (DMFP). For Q-value iteration, approximate equations are obtained by appealing to extreme value theory, and closed form expressions are found in the independent and identically distributed case. The Lyapunov stability of these closed form equations is studied.
    How Much is Enough? A Study on Diffusion Times in Score-based Generative Models. (arXiv:2206.05173v1 [stat.ML])
    Score-based diffusion models are a class of generative models whose dynamics is described by stochastic differential equations that map noise into data. While recent works have started to lay down a theoretical foundation for these models, an analytical understanding of the role of the diffusion time T is still lacking. Current best practice advocates for a large T to ensure that the forward dynamics brings the diffusion sufficiently close to a known and simple noise distribution; however, a smaller value of T should be preferred for a better approximation of the score-matching objective and higher computational efficiency. Starting from a variational interpretation of diffusion models, in this work we quantify this trade-off, and suggest a new method to improve quality and efficiency of both training and sampling, by adopting smaller diffusion times. Indeed, we show how an auxiliary model can be used to bridge the gap between the ideal and the simulated forward dynamics, followed by a standard reverse diffusion process. Empirical results support our analysis; for image data, our method is competitive w.r.t. the state-of-the-art, according to standard sample quality metrics and log-likelihood.
    In Defense of Core-set: A Density-aware Core-set Selection for Active Learning. (arXiv:2206.04838v1 [cs.LG])
    Active learning enables the efficient construction of a labeled dataset by labeling informative samples from an unlabeled dataset. In a real-world active learning scenario, considering the diversity of the selected samples is crucial because many redundant or highly similar samples exist. Core-set approach is the promising diversity-based method selecting diverse samples based on the distance between samples. However, the approach poorly performs compared to the uncertainty-based approaches that select the most difficult samples where neural models reveal low confidence. In this work, we analyze the feature space through the lens of the density and, interestingly, observe that locally sparse regions tend to have more informative samples than dense regions. Motivated by our analysis, we empower the core-set approach with the density-awareness and propose a density-aware core-set (DACS). The strategy is to estimate the density of the unlabeled samples and select diverse samples mainly from sparse regions. To reduce the computational bottlenecks in estimating the density, we also introduce a new density approximation based on locality-sensitive hashing. Experimental results clearly demonstrate the efficacy of DACS in both classification and regression tasks and specifically show that DACS can produce state-of-the-art performance in a practical scenario. Since DACS is weakly dependent on neural architectures, we present a simple yet effective combination method to show that the existing methods can be beneficially combined with DACS.  ( 2 min )
    Adversarial Counterfactual Environment Model Learning. (arXiv:2206.04890v1 [cs.LG])
    A good model for action-effect prediction, named environment model, is important to achieve sample-efficient decision-making policy learning in many domains like robot control, recommender systems, and patients' treatment selection. We can take unlimited trials with such a model to identify the appropriate actions so that the costs of queries in the real world can be saved. It requires the model to handle unseen data correctly, also called counterfactual data. However, standard data fitting techniques do not automatically achieve such generalization ability and commonly result in unreliable models. In this work, we introduce counterfactual-query risk minimization (CQRM) in model learning for generalizing to a counterfactual dataset queried by a specific target policy. Since the target policies can be various and unknown in policy learning, we propose an adversarial CQRM objective in which the model learns on counterfactual data queried by adversarial policies, and finally derive a tractable solution GALILEO. We also discover that adversarial CQRM is closely related to the adversarial model learning, explaining the effectiveness of the latter. We apply GALILEO in synthetic tasks and a real-world application. The results show that GALILEO makes accurate predictions on counterfactual data and thus significantly improves policies in real-world testing.  ( 2 min )
    Federated Momentum Contrastive Clustering. (arXiv:2206.05093v1 [cs.LG])
    We present federated momentum contrastive clustering (FedMCC), a learning framework that can not only extract discriminative representations over distributed local data but also perform data clustering. In FedMCC, a transformed data pair passes through both the online and target networks, resulting in four representations over which the losses are determined. The resulting high-quality representations generated by FedMCC can outperform several existing self-supervised learning methods for linear evaluation and semi-supervised learning tasks. FedMCC can easily be adapted to ordinary centralized clustering through what we call momentum contrastive clustering (MCC). We show that MCC achieves state-of-the-art clustering accuracy results in certain datasets such as STL-10 and ImageNet-10. We also present a method to reduce the memory footprint of our clustering schemes.
    Tensor Train for Global Optimization Problems in Robotics. (arXiv:2206.05077v1 [cs.RO])
    The convergence of many numerical optimization techniques is highly sensitive to the initial guess provided to the solver. We propose an approach based on tensor methods to initialize the existing optimization solvers close to global optima. The approach uses only the definition of the cost function and does not need access to any database of good solutions. We first transform the cost function, which is a function of task parameters and optimization variables, into a probability density function. Unlike existing approaches that set the task parameters as constant, we consider them as another set of random variables and approximate the joint probability distribution of the task parameters and the optimization variables using a surrogate probability model. For a given task, we then generate samples from the conditional distribution with respect to the given task parameter and use them as initialization for the optimization solver. As conditioning and sampling from an arbitrary density function are challenging, we use Tensor Train decomposition to obtain a surrogate probability model from which we can efficiently obtain the conditional model and the samples. The method can produce multiple solutions coming from different modes (when they exist) for a given task. We first evaluate the approach by applying it to various challenging benchmark functions for numerical optimization that are difficult to solve using gradient-based optimization solvers with a naive initialization, showing that the proposed method can produce samples close to the global optima and coming from multiple modes. We then demonstrate the generality of the framework and its relevance to robotics by applying the proposed method to inverse kinematics and motion planning problems with a 7-DoF manipulator.
    Muffliato: Peer-to-Peer Privacy Amplification for Decentralized Optimization and Averaging. (arXiv:2206.05091v1 [cs.CR])
    Decentralized optimization is increasingly popular in machine learning for its scalability and efficiency. Intuitively, it should also provide better privacy guarantees, as nodes only observe the messages sent by their neighbors in the network graph. But formalizing and quantifying this gain is challenging: existing results are typically limited to Local Differential Privacy (LDP) guarantees that overlook the advantages of decentralization. In this work, we introduce pairwise network differential privacy, a relaxation of LDP that captures the fact that the privacy leakage from a node $u$ to a node $v$ may depend on their relative position in the graph. We then analyze the combination of local noise injection with (simple or randomized) gossip averaging protocols on fixed and random communication graphs. We also derive a differentially private decentralized optimization algorithm that alternates between local gradient descent steps and gossip averaging. Our results show that our algorithms amplify privacy guarantees as a function of the distance between nodes in the graph, matching the privacy-utility trade-off of the trusted curator, up to factors that explicitly depend on the graph topology. Finally, we illustrate our privacy gains with experiments on synthetic and real-world datasets.
    Scalable Deep Gaussian Markov Random Fields for General Graphs. (arXiv:2206.05032v1 [stat.ML])
    Machine learning methods on graphs have proven useful in many applications due to their ability to handle generally structured data. The framework of Gaussian Markov Random Fields (GMRFs) provides a principled way to define Gaussian models on graphs by utilizing their sparsity structure. We propose a flexible GMRF model for general graphs built on the multi-layer structure of Deep GMRFs, originally proposed for lattice graphs only. By designing a new type of layer we enable the model to scale to large graphs. The layer is constructed to allow for efficient training using variational inference and existing software frameworks for Graph Neural Networks. For a Gaussian likelihood, close to exact Bayesian inference is available for the latent field. This allows for making predictions with accompanying uncertainty estimates. The usefulness of the proposed model is verified by experiments on a number of synthetic and real world datasets, where it compares favorably to other both Bayesian and deep learning methods.
    We Cannot Guarantee Safety: The Undecidability of Graph Neural Network Verification. (arXiv:2206.05070v1 [cs.LG])
    Graph Neural Networks (GNN) are commonly used for two tasks: (whole) graph classification and node classification. We formally introduce generically formulated decision problems for both tasks, corresponding to the following pattern: given a GNN, some specification of valid inputs, and some specification of valid outputs, decide whether there is a valid input satisfying the output specification. We then prove that graph classifier verification is undecidable in general, implying that there cannot be an algorithm surely guaranteeing the absence of misclassification of any kind. Additionally, we show that verification in the node classification case becomes decidable as soon as we restrict the degree of the considered graphs. Furthermore, we discuss possible changes to these results depending on the considered GNN model and specifications.
    Zero-Shot Audio Classification using Image Embeddings. (arXiv:2206.04984v1 [cs.SD])
    Supervised learning methods can solve the given problem in the presence of a large set of labeled data. However, the acquisition of a dataset covering all the target classes typically requires manual labeling which is expensive and time-consuming. Zero-shot learning models are capable of classifying the unseen concepts by utilizing their semantic information. The present study introduces image embeddings as side information on zero-shot audio classification by using a nonlinear acoustic-semantic projection. We extract the semantic image representations from the Open Images dataset and evaluate the performance of the models on an audio subset of AudioSet using semantic information in different domains; image, audio, and textual. We demonstrate that the image embeddings can be used as semantic information to perform zero-shot audio classification. The experimental results show that the image and textual embeddings display similar performance both individually and together. We additionally calculate the semantic acoustic embeddings from the test samples to provide an upper limit to the performance. The results show that the classification performance is highly sensitive to the semantic relation between test and training classes and textual and image embeddings can reach up to the semantic acoustic embeddings when the seen and unseen classes are semantically similar.
    Improved Approximation for Fair Correlation Clustering. (arXiv:2206.05050v1 [cs.LG])
    Correlation clustering is a ubiquitous paradigm in unsupervised machine learning where addressing unfairness is a major challenge. Motivated by this, we study Fair Correlation Clustering where the data points may belong to different protected groups and the goal is to ensure fair representation of all groups across clusters. Our paper significantly generalizes and improves on the quality guarantees of previous work of Ahmadi et al. and Ahmadian et al. as follows. - We allow the user to specify an arbitrary upper bound on the representation of each group in a cluster. - Our algorithm allows individuals to have multiple protected features and ensure fairness simultaneously across them all. - We prove guarantees for clustering quality and fairness in this general setting. Furthermore, this improves on the results for the special cases studied in previous work. Our experiments on real-world data demonstrate that our clustering quality compared to the optimal solution is much better than what our theoretical result suggests.
    PAVI: Plate-Amortized Variational Inference. (arXiv:2206.05111v1 [cs.AI])
    Given some observed data and a probabilistic generative model, Bayesian inference aims at obtaining the distribution of a model's latent parameters that could have yielded the data. This task is challenging for large population studies where thousands of measurements are performed over a cohort of hundreds of subjects, resulting in a massive latent parameter space. This large cardinality renders off-the-shelf Variational Inference (VI) computationally impractical. In this work, we design structured VI families that can efficiently tackle large population studies. To this end, our main idea is to share the parameterization and learning across the different i.i.d. variables in a generative model -symbolized by the model's plates. We name this concept plate amortization, and illustrate the powerful synergies it entitles, resulting in expressive, parsimoniously parameterized and orders of magnitude faster to train large scale hierarchical variational distributions. We illustrate the practical utility of PAVI through a challenging Neuroimaging example featuring a million latent parameters, demonstrating a significant step towards scalable and expressive Variational Inference.
    Weighted Ensembles for Active Learning with Adaptivity. (arXiv:2206.05009v1 [cs.LG])
    Labeled data can be expensive to acquire in several application domains, including medical imaging, robotics, and computer vision. To efficiently train machine learning models under such high labeling costs, active learning (AL) judiciously selects the most informative data instances to label on-the-fly. This active sampling process can benefit from a statistical function model, that is typically captured by a Gaussian process (GP). While most GP-based AL approaches rely on a single kernel function, the present contribution advocates an ensemble of GP models with weights adapted to the labeled data collected incrementally. Building on this novel EGP model, a suite of acquisition functions emerges based on the uncertainty and disagreement rules. An adaptively weighted ensemble of EGP-based acquisition functions is also introduced to further robustify performance. Extensive tests on synthetic and real datasets showcase the merits of the proposed EGP-based approaches with respect to the single GP-based AL alternatives.
    Saccade Mechanisms for Image Classification, Object Detection and Tracking. (arXiv:2206.05102v1 [cs.CV])
    We examine how the saccade mechanism from biological vision can be used to make deep neural networks more efficient for classification and object detection problems. Our proposed approach is based on the ideas of attention-driven visual processing and saccades, miniature eye movements influenced by attention. We conduct experiments by analyzing: i) the robustness of different deep neural network (DNN) feature extractors to partially-sensed images for image classification and object detection, and ii) the utility of saccades in masking image patches for image classification and object tracking. Experiments with convolutional nets (ResNet-18) and transformer-based models (ViT, DETR, TransTrack) are conducted on several datasets (CIFAR-10, DAVSOD, MSCOCO, and MOT17). Our experiments show intelligent data reduction via learning to mimic human saccades when used in conjunction with state-of-the-art DNNs for classification, detection, and tracking tasks. We observed minimal drop in performance for the classification and detection tasks while only using about 30\% of the original sensor data. We discuss how the saccade mechanism can inform hardware design via ``in-pixel'' processing.
    Temporal Inductive Logic Reasoning. (arXiv:2206.05051v1 [cs.LG])
    Inductive logic reasoning is one of the fundamental tasks on graphs, which seeks to generalize patterns from the data. This task has been studied extensively for traditional graph datasets such as knowledge graphs (KGs), with representative techniques such as inductive logic programming (ILP). Existing ILP methods typically assume learning from KGs with static facts and binary relations. Beyond KGs, graph structures are widely present in other applications such as video instructions, scene graphs and program executions. While inductive logic reasoning is also beneficial for these applications, applying ILP to the corresponding graphs is nontrivial: they are more complex than KGs, which usually involve timestamps and n-ary relations, effectively a type of hypergraph with temporal events. In this work, we study two of such applications and propose to represent them as hypergraphs with time intervals. To reason on this graph, we propose the multi-start random B-walk that traverses this hypergraph. Combining it with a path-consistency algorithm, we propose an efficient backward-chaining ILP method that learns logic rules by generalizing from both the temporal and the relational data.
    Deep Learning-based Massive MIMO CSI Acquisition for 5G Evolution and 6G. (arXiv:2206.04967v1 [eess.SP])
    Recently, inspired by successful applications in many fields, deep learning (DL) technologies for CSI acquisition have received considerable research interest from both academia and industry. Considering the practical feedback mechanism of 5th generation (5G) New radio (NR) networks, we propose two implementation schemes for artificial intelligence for CSI (AI4CSI), the DL-based receiver and end-to-end design, respectively. The proposed AI4CSI schemes were evaluated in 5G NR networks in terms of spectrum efficiency (SE), feedback overhead, and computational complexity, and compared with legacy schemes. To demonstrate whether these schemes can be used in real-life scenarios, both the modeled-based channel data and practically measured channels were used in our investigations. When DL-based CSI acquisition is applied to the receiver only, which has little air interface impact, it provides approximately 25\% SE gain at a moderate feedback overhead level. It is feasible to deploy it in current 5G networks during 5G evolutions. For the end-to-end DL-based CSI enhancements, the evaluations also demonstrated their additional performance gain on SE, which is 6% -- 26% compared with DL-based receivers and 33% -- 58% compared with legacy CSI schemes. Considering its large impact on air-interface design, it will be a candidate technology for 6th generation (6G) networks, in which an air interface designed by artificial intelligence can be used.
    MAREO: Memory- and Attention- based visual REasOning. (arXiv:2206.04928v1 [cs.AI])
    Humans continue to vastly outperform modern AI systems in their ability to parse and understand complex visual scenes flexibly. Attention and memory are two systems known to play a critical role in our ability to selectively maintain and manipulate behaviorally-relevant visual information to solve some of the most challenging visual reasoning tasks. Here, we present a novel architecture for visual reasoning inspired by the cognitive-science literature on visual reasoning, the Memory- and Attention-based (visual) REasOning (MAREO) architecture. MAREO instantiates an active-vision theory, which posits that the brain solves complex visual reasoning problems compositionally by learning to combine previously-learned elementary visual operations to form more complex visual routines. MAREO learns to solve visual reasoning tasks via sequences of attention shifts to route and maintain task-relevant visual information into a memory bank via a multi-head transformer module. Visual routines are then deployed by a dedicated reasoning module trained to judge various relations between objects in the scenes. Experiments on four types of reasoning tasks demonstrate MAREO's ability to learn visual routines in a robust and sample-efficient manner.
    Symbolic image detection using scene and knowledge graphs. (arXiv:2206.04863v1 [cs.CV])
    Sometimes the meaning conveyed by images goes beyond the list of objects they contain; instead, images may express a powerful message to affect the viewers' minds. Inferring this message requires reasoning about the relationships between the objects, and general common-sense knowledge about the components. In this paper, we use a scene graph, a graph representation of an image, to capture visual components. In addition, we generate a knowledge graph using facts extracted from ConceptNet to reason about objects and attributes. To detect the symbols, we propose a neural network framework named SKG-Sym. The framework first generates the representations of the scene graph of the image and its knowledge graph using Graph Convolution Network. The framework then fuses the representations and uses an MLP to classify them. We extend the network further to use an attention mechanism which learn the importance of the graph representations. We evaluate our methods on a dataset of advertisements, and compare it with baseline symbolism classification methods (ResNet and VGG). Results show that our methods outperform ResNet in terms of F-score and the attention-based mechanism is competitive with VGG while it has much lower model complexity.
    Convolutional Layers Are Not Translation Equivariant. (arXiv:2206.04979v1 [cs.CV])
    The purpose of this paper is to correct a misconception about convolutional neural networks (CNNs). CNNs are made up of convolutional layers which are shift equivariant due to weight sharing. However, contrary to popular belief, convolutional layers are not translation equivariant, even when boundary effects are ignored and when pooling and subsampling are absent. This is because shift equivariance is a discrete symmetry while translation equivariance is a continuous symmetry. That discrete systems do not in general inherit continuous equivariances is a fundamental limitation of equivariant deep learning. We discuss two implications of this fact. First, CNNs have achieved success in image processing despite not inheriting the translation equivariance of the physical systems they model. Second, using CNNs to solve partial differential equations (PDEs) will not result in translation equivariant solvers.
    On Neural Architecture Inductive Biases for Relational Tasks. (arXiv:2206.05056v1 [cs.NE])
    Current deep learning approaches have shown good in-distribution generalization performance, but struggle with out-of-distribution generalization. This is especially true in the case of tasks involving abstract relations like recognizing rules in sequences, as we find in many intelligence tests. Recent work has explored how forcing relational representations to remain distinct from sensory representations, as it seems to be the case in the brain, can help artificial systems. Building on this work, we further explore and formalize the advantages afforded by 'partitioned' representations of relations and sensory details, and how this inductive bias can help recompose learned relational structure in newly encountered settings. We introduce a simple architecture based on similarity scores which we name Compositional Relational Network (CoRelNet). Using this model, we investigate a series of inductive biases that ensure abstract relations are learned and represented distinctly from sensory data, and explore their effects on out-of-distribution generalization for a series of relational psychophysics tasks. We find that simple architectural choices can outperform existing models in out-of-distribution generalization. Together, these results show that partitioning relational representations from other information streams may be a simple way to augment existing network architectures' robustness when performing out-of-distribution relational computations.
    Diffeomorphic Counterfactuals with Generative Models. (arXiv:2206.05075v1 [cs.LG])
    Counterfactuals can explain classification decisions of neural networks in a human interpretable way. We propose a simple but effective method to generate such counterfactuals. More specifically, we perform a suitable diffeomorphic coordinate transformation and then perform gradient ascent in these coordinates to find counterfactuals which are classified with great confidence as a specified target class. We propose two methods to leverage generative models to construct such suitable coordinate systems that are either exactly or approximately diffeomorphic. We analyze the generation process theoretically using Riemannian differential geometry and validate the quality of the generated counterfactuals using various qualitative and quantitative measures.
    The Generalized Eigenvalue Problem as a Nash Equilibrium. (arXiv:2206.04993v1 [cs.LG])
    The generalized eigenvalue problem (GEP) is a fundamental concept in numerical linear algebra. It captures the solution of many classical machine learning problems such as canonical correlation analysis, independent components analysis, partial least squares, linear discriminant analysis, principal components, successor features and others. Despite this, most general solvers are prohibitively expensive when dealing with massive data sets and research has instead concentrated on finding efficient solutions to specific problem instances. In this work, we develop a game-theoretic formulation of the top-$k$ GEP whose Nash equilibrium is the set of generalized eigenvectors. We also present a parallelizable algorithm with guaranteed asymptotic convergence to the Nash. Current state-of-the-art methods require $\mathcal{O}(d^2k)$ complexity per iteration which is prohibitively expensive when the number of dimensions ($d$) is large. We show how to achieve $\mathcal{O}(dk)$ complexity, scaling to datasets $100\times$ larger than those evaluated by prior methods. Empirically we demonstrate that our algorithm is able to solve a variety of GEP problem instances including a large-scale analysis of neural network activations.
    Deep Multi-view Semi-supervised Clustering with Sample Pairwise Constraints. (arXiv:2206.04949v1 [cs.CV])
    Multi-view clustering has attracted much attention thanks to the capacity of multi-source information integration. Although numerous advanced methods have been proposed in past decades, most of them generally overlook the significance of weakly-supervised information and fail to preserve the feature properties of multiple views, thus resulting in unsatisfactory clustering performance. To address these issues, in this paper, we propose a novel Deep Multi-view Semi-supervised Clustering (DMSC) method, which jointly optimizes three kinds of losses during networks finetuning, including multi-view clustering loss, semi-supervised pairwise constraint loss and multiple autoencoders reconstruction loss. Specifically, a KL divergence based multi-view clustering loss is imposed on the common representation of multi-view data to perform heterogeneous feature optimization, multi-view weighting and clustering prediction simultaneously. Then, we innovatively propose to integrate pairwise constraints into the process of multi-view clustering by enforcing the learned multi-view representation of must-link samples (cannot-link samples) to be similar (dissimilar), such that the formed clustering architecture can be more credible. Moreover, unlike existing rivals that only preserve the encoders for each heterogeneous branch during networks finetuning, we further propose to tune the intact autoencoders frame that contains both encoders and decoders. In this way, the issue of serious corruption of view-specific and view-shared feature space could be alleviated, making the whole training procedure more stable. Through comprehensive experiments on eight popular image datasets, we demonstrate that our proposed approach performs better than the state-of-the-art multi-view and single-view competitors.
    Spatial Cross-Attention Improves Self-Supervised Visual Representation Learning. (arXiv:2206.05028v1 [cs.CV])
    Unsupervised representation learning methods like SwAV are proved to be effective in learning visual semantics of a target dataset. The main idea behind these methods is that different views of a same image represent the same semantics. In this paper, we further introduce an add-on module to facilitate the injection of the knowledge accounting for spatial cross correlations among the samples. This in turn results in distilling intra-class information including feature level locations and cross similarities between same-class instances. The proposed add-on can be added to existing methods such as the SwAV. We can later remove the add-on module for inference without any modification of the learned weights. Through an extensive set of empirical evaluations, we verify that our method yields an improved performance in detecting the class activation maps, top-1 classification accuracy, and down-stream tasks such as object detection, with different configuration settings.
    Refining neural network predictions using background knowledge. (arXiv:2206.04976v1 [cs.AI])
    Recent work has showed we can use logical background knowledge in learning system to compensate for a lack of labeled training data. Many such methods work by creating a loss function that encodes this knowledge. However, often the logic is discarded after training, even if it is still useful at test-time. Instead, we ensure neural network predictions satisfy the knowledge by refining the predictions with an extra computation step. We introduce differentiable refinement functions that find a corrected prediction close to the original prediction. We study how to effectively and efficiently compute these refinement functions. Using a new algorithm, we combine refinement functions to find refined predictions for logical formulas of any complexity. This algorithm finds optimal refinements on complex SAT formulas in significantly fewer iterations and frequently finds solutions where gradient descent can not.
    Offline Stochastic Shortest Path: Learning, Evaluation and Towards Optimality. (arXiv:2206.04921v1 [cs.LG])
    Goal-oriented Reinforcement Learning, where the agent needs to reach the goal state while simultaneously minimizing the cost, has received significant attention in real-world applications. Its theoretical formulation, stochastic shortest path (SSP), has been intensively researched in the online setting. Nevertheless, it remains understudied when such an online interaction is prohibited and only historical data is provided. In this paper, we consider the offline stochastic shortest path problem when the state space and the action space are finite. We design the simple value iteration-based algorithms for tackling both offline policy evaluation (OPE) and offline policy learning tasks. Notably, our analysis of these simple algorithms yields strong instance-dependent bounds which can imply worst-case bounds that are near-minimax optimal. We hope our study could help illuminate the fundamental statistical limits of the offline SSP problem and motivate further studies beyond the scope of current consideration.
    Evolutionary Echo State Network: evolving reservoirs in the Fourier space. (arXiv:2206.04951v1 [cs.NE])
    The Echo State Network (ESN) is a class of Recurrent Neural Network with a large number of hidden-hidden weights (in the so-called reservoir). Canonical ESN and its variations have recently received significant attention due to their remarkable success in the modeling of non-linear dynamical systems. The reservoir is randomly connected with fixed weights that don't change in the learning process. Only the weights from reservoir to output are trained. Since the reservoir is fixed during the training procedure, we may wonder if the computational power of the recurrent structure is fully harnessed. In this article, we propose a new computational model of the ESN type, that represents the reservoir weights in the Fourier space and performs a fine-tuning of these weights applying genetic algorithms in the frequency domain. The main interest is that this procedure will work in a much smaller space compared to the classical ESN, thus providing a dimensionality reduction transformation of the initial method. The proposed technique allows us to exploit the benefits of the large recurrent structure avoiding the training problems of gradient-based method. We provide a detailed experimental study that demonstrates the good performances of our approach with well-known chaotic systems and real-world data.
    Explanation as Question Answering based on a Task Model of the Agent's Design. (arXiv:2206.05030v1 [cs.HC])
    We describe a stance towards the generation of explanations in AI agents that is both human-centered and design-based. We collect questions about the working of an AI agent through participatory design by focus groups. We capture an agent's design through a Task-Method-Knowledge model that explicitly specifies the agent's tasks and goals, as well as the mechanisms, knowledge and vocabulary it uses for accomplishing the tasks. We illustrate our approach through the generation of explanations in Skillsync, an AI agent that links companies and colleges for worker upskilling and reskilling. In particular, we embed a question-answering agent called AskJill in Skillsync, where AskJill contains a TMK model of Skillsync's design. AskJill presently answers human-generated questions about Skillsync's tasks and vocabulary, and thereby helps explain how it produces its recommendations.
    Efficient Heterogeneous Treatment Effect Estimation With Multiple Experiments and Multiple Outcomes. (arXiv:2206.04907v1 [cs.LG])
    Learning heterogeneous treatment effects (HTEs) is an important problem across many fields. Most existing methods consider the setting with a single treatment arm and a single outcome metric. However, in many real world domains, experiments are run consistently - for example, in internet companies, A/B tests are run every day to measure the impacts of potential changes across many different metrics of interest. We show that even if an analyst cares only about the HTEs in one experiment for one metric, precision can be improved greatly by analyzing all of the data together to take advantage of cross-experiment and cross-outcome metric correlations. We formalize this idea in a tensor factorization framework and propose a simple and scalable model which we refer to as the low rank or LR-learner. Experiments in both synthetic and real data suggest that the LR-learner can be much more precise than independent HTE estimation.
    Fisher SAM: Information Geometry and Sharpness Aware Minimisation. (arXiv:2206.04920v1 [cs.LG])
    Recent sharpness-aware minimisation (SAM) is known to find flat minima which is beneficial for better generalisation with improved robustness. SAM essentially modifies the loss function by reporting the maximum loss value within the small neighborhood around the current iterate. However, it uses the Euclidean ball to define the neighborhood, which can be inaccurate since loss functions for neural networks are typically defined over probability distributions (e.g., class predictive probabilities), rendering the parameter space non Euclidean. In this paper we consider the information geometry of the model parameter space when defining the neighborhood, namely replacing SAM's Euclidean balls with ellipsoids induced by the Fisher information. Our approach, dubbed Fisher SAM, defines more accurate neighborhood structures that conform to the intrinsic metric of the underlying statistical manifold. For instance, SAM may probe the worst-case loss value at either a too nearby or inappropriately distant point due to the ignorance of the parameter space geometry, which is avoided by our Fisher SAM. Another recent Adaptive SAM approach stretches/shrinks the Euclidean ball in accordance with the scale of the parameter magnitudes. This might be dangerous, potentially destroying the neighborhood structure. We demonstrate improved performance of the proposed Fisher SAM on several benchmark datasets/tasks.
    A bio-inspired implementation of a sparse-learning spike-based hippocampus memory model. (arXiv:2206.04924v1 [cs.NE])
    The nervous system, more specifically, the brain, is capable of solving complex problems simply and efficiently, far surpassing modern computers. In this regard, neuromorphic engineering is a research field that focuses on mimicking the basic principles that govern the brain in order to develop systems that achieve such computational capabilities. Within this field, bio-inspired learning and memory systems are still a challenge to be solved, and this is where the hippocampus is involved. It is the region of the brain that acts as a short-term memory, allowing the learning and unstructured and rapid storage of information from all the sensory nuclei of the cerebral cortex and its subsequent recall. In this work, we propose a novel bio-inspired memory model based on the hippocampus with the ability to learn memories, recall them from a cue (a part of the memory associated with the rest of the content) and even forget memories when trying to learn others with the same cue. This model has been implemented on the SpiNNaker hardware platform using Spiking Neural Networks, and a set of experiments and tests were performed to demonstrate its correct and expected operation. The proposed spike-based memory model generates spikes only when it receives an input, being energy efficient, and it needs 7 timesteps for the learning step and 6 timesteps for recalling a previously-stored memory. This work presents the first hardware implementation of a fully functional bio-inspired spike-based hippocampus memory model, paving the road for the development of future more complex neuromorphic systems.
    Response to: Significance and stability of deep learning-based identification of subtypes within major psychiatric disorders. Molecular Psychiatry (2022). (arXiv:2206.04934v1 [cs.LG])
    Recently, Winter and Hahn [1] commented on our work on identifying subtypes of major psychiatry disorders (MPDs) based on neurobiological features using machine learning [2]. They questioned the generalizability of our methods and the statistical significance, stability, and overfitting of the results, and proposed a pipeline for disease subtyping. We appreciate their earnest consideration of our work, however, we need to point out their misconceptions of basic machine-learning concepts and delineate some key issues involved.
    Merak: A Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models. (arXiv:2206.04959v1 [cs.LG])
    Foundation models are becoming the dominant deep learning technologies. Pretraining a foundation model is always time-consumed due to the large scale of both the model parameter and training dataset. Besides being computing-intensive, the training process is extremely memory-intensive and communication-intensive. These features make it necessary to apply 3D parallelism, which integrates data parallelism, pipeline model parallelism and tensor model parallelism, to achieve high training efficiency. To achieve this goal, some custom software frameworks such as Megatron-LM and DeepSpeed are developed. However, current 3D parallelism frameworks still meet two issues: i) they are not transparent to model developers, which need to manually modify the model to parallelize training. ii) their utilization of computation, GPU memory and network bandwidth are not sufficient. We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization. Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model. Merak also presents the non-intrusive API for scaling out foundation model training with minimal code modification. In addition, we design a high-performance 3D parallel runtime engine in Merak. It uses several techniques to exploit available training resources, including shifted critical path pipeline schedule that brings a higher computation utilization, stage-aware recomputation that makes use of idle worker memory, and sub-pipelined tensor model parallelism that overlaps communication and computation. Experiments on 64 GPUs show Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and 1.61X, respectively.
    Explaining Neural Networks without Access to Training Data. (arXiv:2206.04891v1 [cs.LG])
    We consider generating explanations for neural networks in cases where the network's training data is not accessible, for instance due to privacy or safety issues. Recently, $\mathcal{I}$-Nets have been proposed as a sample-free approach to post-hoc, global model interpretability that does not require access to training data. They formulate interpretation as a machine learning task that maps network representations (parameters) to a representation of an interpretable function. In this paper, we extend the $\mathcal{I}$-Net framework to the cases of standard and soft decision trees as surrogate models. We propose a suitable decision tree representation and design of the corresponding $\mathcal{I}$-Net output layers. Furthermore, we make $\mathcal{I}$-Nets applicable to real-world tasks by considering more realistic distributions when generating the $\mathcal{I}$-Net's training data. We empirically evaluate our approach against traditional global, post-hoc interpretability approaches and show that it achieves superior results when the training data is not accessible.
    Multi-fidelity Hierarchical Neural Processes. (arXiv:2206.04872v1 [cs.LG])
    Science and engineering fields use computer simulation extensively. These simulations are often run at multiple levels of sophistication to balance accuracy and efficiency. Multi-fidelity surrogate modeling reduces the computational cost by fusing different simulation outputs. Cheap data generated from low-fidelity simulators can be combined with limited high-quality data generated by an expensive high-fidelity simulator. Existing methods based on Gaussian processes rely on strong assumptions of the kernel functions and can hardly scale to high-dimensional settings. We propose Multi-fidelity Hierarchical Neural Processes (MF-HNP), a unified neural latent variable model for multi-fidelity surrogate modeling. MF-HNP inherits the flexibility and scalability of Neural Processes. The latent variables transform the correlations among different fidelity levels from observations to latent space. The predictions across fidelities are conditionally independent given the latent states. It helps alleviate the error propagation issue in existing methods. MF-HNP is flexible enough to handle non-nested high dimensional data at different fidelity levels with varying input and output dimensions. We evaluate MF-HNP on epidemiology and climate modeling tasks, achieving competitive performance in terms of accuracy and uncertainty estimation. In contrast to deep Gaussian Processes with only low-dimensional (< 10) tasks, our method shows great promise for speeding up high-dimensional complex simulations (over 7000 for epidemiology modeling and 45000 for climate modeling).
    What should AI see? Using the Public's Opinion to Determine the Perception of an AI. (arXiv:2206.04776v1 [cs.LG])
    Deep neural networks (DNN) have made impressive progress in the interpretation of image data, so that it is conceivable and to some degree realistic to use them in safety critical applications like automated driving. From an ethical standpoint, the AI algorithm should take into account the vulnerability of objects or subjects on the street that ranges from "not at all", e.g. the road itself, to "high vulnerability" of pedestrians. One way to take this into account is to define the cost of confusion of one semantic category with another and use cost-based decision rules for the interpretation of probabilities, which are the output of DNNs. However, it is an open problem how to define the cost structure, who should be in charge to do that, and thereby define what AI-algorithms will actually "see". As one possible answer, we follow a participatory approach and set up an online survey to ask the public to define the cost structure. We present the survey design and the data acquired along with an evaluation that also distinguishes between perspective (car passenger vs. external traffic participant) and gender. Using simulation based $F$-tests, we find highly significant differences between the groups. These differences have consequences on the reliable detection of pedestrians in a safety critical distance to the self-driving car. We discuss the ethical problems that are related to this approach and also discuss the problems emerging from human-machine interaction through the survey from a psychological point of view. Finally, we include comments from industry leaders in the field of AI safety on the applicability of survey based elements in the design of AI functionalities in automated driving.
    The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the \emph{Grokking Phenomenon}. (arXiv:2206.04817v1 [cs.LG])
    The \emph{grokking phenomenon} as reported by Power et al.~\cite{power2021grokking} refers to a regime where a long period of overfitting is followed by a seemingly sudden transition to perfect generalization. In this paper, we attempt to reveal the underpinnings of Grokking via a series of empirical studies. Specifically, we uncover an optimization anomaly plaguing adaptive optimizers at extremely late stages of training, referred to as the \emph{Slingshot Mechanism}. A prominent artifact of the Slingshot Mechanism can be measured by the cyclic phase transitions between stable and unstable training regimes, and can be easily monitored by the cyclic behavior of the norm of the last layers weights. We empirically observe that without explicit regularization, Grokking as reported in \cite{power2021grokking} almost exclusively happens at the onset of \emph{Slingshots}, and is absent without it. While common and easily reproduced in more general settings, the Slingshot Mechanism does not follow from any known optimization theories that we are aware of, and can be easily overlooked without an in depth examination. Our work points to a surprising and useful inductive bias of adaptive gradient optimizers at late stages of training, calling for a revised theoretical analysis of their origin.
    Beyond the Gates of Euclidean Space: Temporal-Discrimination-Fusions and Attention-based Graph Neural Network for Human Activity Recognition. (arXiv:2206.04855v1 [cs.LG])
    Human activity recognition (HAR) through wearable devices has received much interest due to its numerous applications in fitness tracking, wellness screening, and supported living. As a result, we have seen a great deal of work in this field. Traditional deep learning (DL) has set a state of the art performance for HAR domain. However, it ignores the data's structure and the association between consecutive time stamps. To address this constraint, we offer an approach based on Graph Neural Networks (GNNs) for structuring the input representation and exploiting the relations among the samples. However, even when using a simple graph convolution network to eliminate this shortage, there are still several limiting factors, such as inter-class activities issues, skewed class distribution, and a lack of consideration for sensor data priority, all of which harm the HAR model's performance. To improve the current HAR model's performance, we investigate novel possibilities within the framework of graph structure to achieve highly discriminated and rich activity features. We propose a model for (1) time-series-graph module that converts raw data from HAR dataset into graphs; (2) Graph Convolutional Neural Networks (GCNs) to discover local dependencies and correlations between neighboring nodes; and (3) self-attention GNN encoder to identify sensors interactions and data priorities. To the best of our knowledge, this is the first work for HAR, which introduces a GNN-based approach that incorporates both the GCN and the attention mechanism. By employing a uniform evaluation method, our framework significantly improves the performance on hospital patient's activities dataset comparatively considered other state of the art baseline methods.
    Conformal Prediction Intervals for Markov Decision Process Trajectories. (arXiv:2206.04860v1 [cs.LG])
    Before delegating a task to an autonomous system, a human operator may want a guarantee about the behavior of the system. This paper extends previous work on conformal prediction for functional data and conformalized quantile regression to provide conformal prediction intervals over the future behavior of an autonomous system executing a fixed control policy on a Markov Decision Process (MDP). The prediction intervals are constructed by applying conformal corrections to prediction intervals computed by quantile regression. The resulting intervals guarantee that with probability $1-\delta$ the observed trajectory will lie inside the prediction interval, where the probability is computed with respect to the starting state distribution and the stochasticity of the MDP. The method is illustrated on MDPs for invasive species management and StarCraft2 battles.
    Imitation Learning via Differentiable Physics. (arXiv:2206.04873v1 [cs.LG])
    Existing imitation learning (IL) methods such as inverse reinforcement learning (IRL) usually have a double-loop training process, alternating between learning a reward function and a policy and tend to suffer long training time and high variance. In this work, we identify the benefits of differentiable physics simulators and propose a new IL method, i.e., Imitation Learning via Differentiable Physics (ILD), which gets rid of the double-loop design and achieves significant improvements in final performance, convergence speed, and stability. The proposed ILD incorporates the differentiable physics simulator as a physics prior into its computational graph for policy learning. It unrolls the dynamics by sampling actions from a parameterized policy, simply minimizing the distance between the expert trajectory and the agent trajectory, and back-propagating the gradient into the policy via temporal physics operators. With the physics prior, ILD policies can not only be transferable to unseen environment specifications but also yield higher final performance on a variety of tasks. In addition, ILD naturally forms a single-loop structure, which significantly improves the stability and training speed. To simplify the complex optimization landscape induced by temporal physics operations, ILD dynamically selects the learning objectives for each state during optimization. In our experiments, we show that ILD outperforms state-of-the-art methods in a variety of continuous control tasks with Brax, requiring only one expert demonstration. In addition, ILD can be applied to challenging deformable object manipulation tasks and can be generalized to unseen configurations.
    Efficient Per-Shot Convex Hull Prediction By Recurrent Learning. (arXiv:2206.04877v1 [eess.IV])
    Adaptive video streaming relies on the construction of efficient bitrate ladders to deliver the best possible visual quality to viewers under bandwidth constraints. The traditional method of content dependent bitrate ladder selection requires a video shot to be pre-encoded with multiple encoding parameters to find the optimal operating points given by the convex hull of the resulting rate-quality curves. However, this pre-encoding step is equivalent to an exhaustive search process over the space of possible encoding parameters, which causes significant overhead in terms of both computation and time expenditure. To reduce this overhead, we propose a deep learning based method of content aware convex hull prediction. We employ a recurrent convolutional network (RCN) to implicitly analyze the spatiotemporal complexity of video shots in order to predict their convex hulls. A two-step transfer learning scheme is adopted to train our proposed RCN-Hull model, which ensures sufficient content diversity to analyze scene complexity, while also making it possible capture the scene statistics of pristine source videos. Our experimental results reveal that our proposed model yields better approximations of the optimal convex hulls, and offers competitive time savings as compared to existing approaches. On average, the pre-encoding time was reduced by 58.0% by our method, while the average Bjontegaard delta bitrate (BD-rate) of the predicted convex hulls against ground truth was 0.08%, while the mean absolute deviation of the BD-rate distribution was 0.44%
    NAGphormer: Neighborhood Aggregation Graph Transformer for Node Classification in Large Graphs. (arXiv:2206.04910v1 [cs.LG])
    Graph Transformers have demonstrated superiority on various graph learning tasks in recent years. However, the complexity of existing Graph Transformers scales quadratically with the number of nodes, making it hard to scale to graphs with thousands of nodes. To this end, we propose a Neighborhood Aggregation Graph Transformer (NAGphormer) that is scalable to large graphs with millions of nodes. Before feeding the node features into the Transformer model, NAGphormer constructs tokens for each node by a neighborhood aggregation module called Hop2Token. For each node, Hop2Token aggregates neighborhood features from each hop into a representation, and thereby produces a sequence of token vectors. Subsequently, the resulting sequence of different hop information serves as input to the Transformer model. By considering each node as a sequence, NAGphormer could be trained in a mini-batch manner and thus could scale to large graphs. NAGphormer further develops an attention-based readout function so as to learn the importance of each hop adaptively. We conduct extensive experiments on various popular benchmarks, including six small datasets and three large datasets. The results demonstrate that NAGphormer consistently outperforms existing Graph Transformers and mainstream Graph Neural Networks.
    HDTorch: Accelerating Hyperdimensional Computing with GP-GPUs for Design Space Exploration. (arXiv:2206.04746v1 [cs.LG])
    HyperDimensional Computing (HDC) as a machine learning paradigm is highly interesting for applications involving continuous, semi-supervised learning for long-term monitoring. However, its accuracy is not yet on par with other Machine Learning (ML) approaches. Frameworks enabling fast design space exploration to find practical algorithms are necessary to make HD computing competitive with other ML techniques. To this end, we introduce HDTorch, an open-source, PyTorch-based HDC library with CUDA extensions for hypervector operations. We demonstrate HDTorch's utility by analyzing four HDC benchmark datasets in terms of accuracy, runtime, and memory consumption, utilizing both classical and online HD training methodologies. We demonstrate average (training)/inference speedups of (111x/68x)/87x for classical/online HD, respectively. Moreover, we analyze the effects of varying hyperparameters on runtime and accuracy. Finally, we demonstrate how HDTorch enables exploration of HDC strategies applied to large, real-world datasets. We perform the first-ever HD training and inference analysis of the entirety of the CHB-MIT EEG epilepsy database. Results show that the typical approach of training on a subset of the data does not necessarily generalize to the entire dataset, an important factor when developing future HD models for medical wearable devices.
    Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation. (arXiv:2206.04785v1 [cs.CV])
    Egocentric 3D human pose estimation (HPE) from images is challenging due to severe self-occlusions and strong distortion introduced by the fish-eye view from the head mounted camera. Although existing works use intermediate heatmap-based representations to counter distortion with some success, addressing self-occlusion remains an open problem. In this work, we leverage information from past frames to guide our self-attention-based 3D HPE estimation procedure -- Ego-STAN. Specifically, we build a spatio-temporal Transformer model that attends to semantically rich convolutional neural network-based feature maps. We also propose feature map tokens: a new set of learnable parameters to attend to these feature maps. Finally, we demonstrate Ego-STAN's superior performance on the xR-EgoPose dataset where it achieves a 30.6% improvement on the overall mean per-joint position error, while leading to a 22% drop in parameters compared to the state-of-the-art.
    Crust Macrofracturing as the Evidence of the Last Deglaciation. (arXiv:2206.02652v2 [physics.geo-ph] UPDATED)
    Machine learning methods were applied to reconsider the results of several passive seismic experiments in Finland. We created datasets from different stages of the receiver function technique and processed them with one of basic machine learning algorithms. All the results were obtained uniformly with the $k$-nearest neighbors algorithm. The first result is the Moho depth map of the region. Another result is the delineation of the near-surface low $S$-wave velocity layer. There are three such areas in the Northern, Southern, and central parts of the region. The low $S$-wave velocity in the Northern and Southern areas can be linked to the geological structure. However, we attribute the central low $S$-wave velocity area to a large number of water-saturated cracks in the upper 1-5 km. Analysis of the structure of this area leads us to the conclusion that macrofracturing was caused by the last deglaciation.
    On the Bias-Variance Characteristics of LIME and SHAP in High Sparsity Movie Recommendation Explanation Tasks. (arXiv:2206.04784v1 [cs.LG])
    We evaluate two popular local explainability techniques, LIME and SHAP, on a movie recommendation task. We discover that the two methods behave very differently depending on the sparsity of the data set. LIME does better than SHAP in dense segments of the data set and SHAP does better in sparse segments. We trace this difference to the differing bias-variance characteristics of the underlying estimators of LIME and SHAP. We find that SHAP exhibits lower variance in sparse segments of the data compared to LIME. We attribute this lower variance to the completeness constraint property inherent in SHAP and missing in LIME. This constraint acts as a regularizer and therefore increases the bias of the SHAP estimator but decreases its variance, leading to a favorable bias-variance trade-off especially in high sparsity data settings. With this insight, we introduce the same constraint into LIME and formulate a novel local explainabilty framework called Completeness-Constrained LIME (CLIMB) that is superior to LIME and much faster than SHAP.
    Syntactic Inductive Biases for Deep Learning Methods. (arXiv:2206.04806v1 [cs.LG])
    In this thesis, we try to build a connection between the two schools by introducing syntactic inductive biases for deep learning models. We propose two families of inductive biases, one for constituency structure and another one for dependency structure. The constituency inductive bias encourages deep learning models to use different units (or neurons) to separately process long-term and short-term information. This separation provides a way for deep learning models to build the latent hierarchical representations from sequential inputs, that a higher-level representation is composed of and can be decomposed into a series of lower-level representations. For example, without knowing the ground-truth structure, our proposed model learns to process logical expression through composing representations of variables and operators into representations of expressions according to its syntactic structure. On the other hand, the dependency inductive bias encourages models to find the latent relations between entities in the input sequence. For natural language, the latent relations are usually modeled as a directed dependency graph, where a word has exactly one parent node and zero or several children nodes. After applying this constraint to a Transformer-like model, we find the model is capable of inducing directed graphs that are close to human expert annotations, and it also outperforms the standard transformer model on different tasks. We believe that these experimental results demonstrate an interesting alternative for the future development of deep learning models.  ( 2 min )
    Data-Efficient Double-Win Lottery Tickets from Robust Pre-training. (arXiv:2206.04762v1 [cs.LG])
    Pre-training serves as a broadly adopted starting point for transfer learning on various downstream tasks. Recent investigations of lottery tickets hypothesis (LTH) demonstrate such enormous pre-trained models can be replaced by extremely sparse subnetworks (a.k.a. matching subnetworks) without sacrificing transferability. However, practical security-crucial applications usually pose more challenging requirements beyond standard transfer, which also demand these subnetworks to overcome adversarial vulnerability. In this paper, we formulate a more rigorous concept, Double-Win Lottery Tickets, in which a located subnetwork from a pre-trained model can be independently transferred on diverse downstream tasks, to reach BOTH the same standard and robust generalization, under BOTH standard and adversarial training regimes, as the full pre-trained model can do. We comprehensively examine various pre-training mechanisms and find that robust pre-training tends to craft sparser double-win lottery tickets with superior performance over the standard counterparts. For example, on downstream CIFAR-10/100 datasets, we identify double-win matching subnetworks with the standard, fast adversarial, and adversarial pre-training from ImageNet, at 89.26%/73.79%, 89.26%/79.03%, and 91.41%/83.22% sparsity, respectively. Furthermore, we observe the obtained double-win lottery tickets can be more data-efficient to transfer, under practical data-limited (e.g., 1% and 10%) downstream schemes. Our results show that the benefits from robust pre-training are amplified by the lottery ticket scheme, as well as the data-limited transfer setting. Codes are available at https://github.com/VITA-Group/Double-Win-LTH.
    Motif Mining and Unsupervised Representation Learning for BirdCLEF 2022. (arXiv:2206.04805v1 [cs.SD])
    We build a classification model for the BirdCLEF 2022 challenge using unsupervised methods. We implement an unsupervised representation of the training dataset using a triplet loss on spectrogram representation of audio motifs. Our best model performs with a score of 0.48 on the public leaderboard.
    I'm Me, We're Us, and I'm Us: Tri-directional Contrastive Learning on Hypergraphs. (arXiv:2206.04739v1 [cs.LG])
    Although machine learning on hypergraphs has attracted considerable attention, most of the works have focused on (semi-)supervised learning, which may cause heavy labeling costs and poor generalization. Recently, contrastive learning has emerged as a successful unsupervised representation learning method. Despite the prosperous development of contrastive learning in other domains, contrastive learning on hypergraphs remains little explored. In this paper, we propose TriCon (Tri-directional Contrastive learning), a general framework for contrastive learning on hypergraphs. Its main idea is tri-directional contrast, and specifically, it aims to maximize in two augmented views the agreement (a) between the same node, (b) between the same group of nodes, and (c) between each group and its members. Together with simple but surprisingly effective data augmentation and negative sampling schemes, these three forms of contrast enable TriCon to capture both microscopic and mesoscopic structural information in node embeddings. Our extensive experiments using 13 baseline approaches, five datasets, and two tasks demonstrate the effectiveness of TriCon, and most noticeably, TriCon consistently outperforms not just unsupervised competitors but also (semi-)supervised competitors mostly by significant margins for node classification.
    Comprehensive Fair Meta-learned Recommender System. (arXiv:2206.04789v1 [cs.IR])
    In recommender systems, one common challenge is the cold-start problem, where interactions are very limited for fresh users in the systems. To address this challenge, recently, many works introduce the meta-optimization idea into the recommendation scenarios, i.e. learning to learn the user preference by only a few past interaction items. The core idea is to learn global shared meta-initialization parameters for all users and rapidly adapt them into local parameters for each user respectively. They aim at deriving general knowledge across preference learning of various users, so as to rapidly adapt to the future new user with the learned prior and a small amount of training data. However, previous works have shown that recommender systems are generally vulnerable to bias and unfairness. Despite the success of meta-learning at improving the recommendation performance with cold-start, the fairness issues are largely overlooked. In this paper, we propose a comprehensive fair meta-learning framework, named CLOVER, for ensuring the fairness of meta-learned recommendation models. We systematically study three kinds of fairness - individual fairness, counterfactual fairness, and group fairness in the recommender systems, and propose to satisfy all three kinds via a multi-task adversarial learning scheme. Our framework offers a generic training paradigm that is applicable to different meta-learned recommender systems. We demonstrate the effectiveness of CLOVER on the representative meta-learned user preference estimator on three real-world data sets. Empirical results show that CLOVER achieves comprehensive fairness without deteriorating the overall cold-start recommendation performance.  ( 2 min )
    Communication Efficient Distributed Learning for Kernelized Contextual Bandits. (arXiv:2206.04835v1 [cs.LG])
    We tackle the communication efficiency challenge of learning kernelized contextual bandits in a distributed setting. Despite the recent advances in communication-efficient distributed bandit learning, existing solutions are restricted to simple models like multi-armed bandits and linear bandits, which hamper their practical utility. In this paper, instead of assuming the existence of a linear reward mapping from the features to the expected rewards, we consider non-linear reward mappings, by letting agents collaboratively search in a reproducing kernel Hilbert space (RKHS). This introduces significant challenges in communication efficiency as distributed kernel learning requires the transfer of raw data, leading to a communication cost that grows linearly w.r.t. time horizon $T$. We addresses this issue by equipping all agents to communicate via a common Nystr\"{o}m embedding that gets updated adaptively as more data points are collected. We rigorously proved that our algorithm can attain sub-linear rate in both regret and communication cost.
    Trimmed Maximum Likelihood Estimation for Robust Learning in Generalized Linear Models. (arXiv:2206.04777v1 [cs.LG])
    We study the problem of learning generalized linear models under adversarial corruptions. We analyze a classical heuristic called the iterative trimmed maximum likelihood estimator which is known to be effective against label corruptions in practice. Under label corruptions, we prove that this simple estimator achieves minimax near-optimal risk on a wide range of generalized linear models, including Gaussian regression, Poisson regression and Binomial regression. Finally, we extend the estimator to the more challenging setting of label and covariate corruptions and demonstrate its robustness and optimality in that setting as well.  ( 2 min )
    Deep Leakage from Model in Federated Learning. (arXiv:2206.04887v1 [cs.LG])
    Distributed machine learning has been widely used in recent years to tackle the large and complex dataset problem. Therewith, the security of distributed learning has also drawn increasing attentions from both academia and industry. In this context, federated learning (FL) was developed as a "secure" distributed learning by maintaining private training data locally and only public model gradients are communicated between. However, to date, a variety of gradient leakage attacks have been proposed for this procedure and prove that it is insecure. For instance, a common drawback of these attacks is shared: they require too much auxiliary information such as model weights, optimizers, and some hyperparameters (e.g., learning rate), which are difficult to obtain in real situations. Moreover, many existing algorithms avoid transmitting model gradients in FL and turn to sending model weights, such as FedAvg, but few people consider its security breach. In this paper, we present two novel frameworks to demonstrate that transmitting model weights is also likely to leak private local data of clients, i.e., (DLM and DLM+), under the FL scenario. In addition, a number of experiments are performed to illustrate the effect and generality of our attack frameworks. At the end of this paper, we also introduce two defenses to the proposed attacks and evaluate their protection effects. Comprehensively, the proposed attack and defense schemes can be applied to the general distributed learning scenario as well, just with some appropriate customization.
    Connecting Low-Loss Subspace for Personalized Federated Learning. (arXiv:2109.07628v2 [cs.LG] UPDATED)
    Due to the curse of statistical heterogeneity across clients, adopting a personalized federated learning method has become an essential choice for the successful deployment of federated learning-based services. Among diverse branches of personalization techniques, a model mixture-based personalization method is preferred as each client has their own personalized model as a result of federated learning. It usually requires a local model and a federated model, but this approach is either limited to partial parameter exchange or requires additional local updates, each of which is helpless to novel clients and burdensome to the client's computational capacity. As the existence of a connected subspace containing diverse low-loss solutions between two or more independent deep networks has been discovered, we combined this interesting property with the model mixture-based personalized federated learning method for improved performance of personalization. We proposed SuPerFed, a personalized federated learning method that induces an explicit connection between the optima of the local and the federated model in weight space for boosting each other. Through extensive experiments on several benchmark datasets, we demonstrated that our method achieves consistent gains in both personalization performance and robustness to problematic scenarios possible in realistic services.
    Swan: A Neural Engine for Efficient DNN Training on Smartphone SoCs. (arXiv:2206.04687v1 [cs.LG])
    The need to train DNN models on end-user devices (e.g., smartphones) is increasing with the need to improve data privacy and reduce communication overheads. Unlike datacenter servers with powerful CPUs and GPUs, modern smartphones consist of a diverse collection of specialized cores following a system-on-a-chip (SoC) architecture that together perform a variety of tasks. We observe that training DNNs on a smartphone SoC without carefully considering its resource constraints can not only lead to suboptimal training performance but significantly affect user experience as well. In this paper, we present Swan, a neural engine to optimize DNN training on smartphone SoCs without hurting user experience. Extensive large-scale evaluations show that Swan can improve performance by 1.2 - 23.3x over the state-of-the-art.
    NNTrainer: Light-Weight On-Device Training Framework. (arXiv:2206.04688v1 [cs.LG])
    Modern consumer electronic devices have adopted deep learning-based intelligence services for their key features. Vendors have recently started to execute intelligence services on devices to preserve personal data in devices, reduce network and cloud costs. We find such a trend as the opportunity to personalize intelligence services by updating neural networks with user data without exposing the data out of devices: on-device training. For example, we may add a new class, my dog, Alpha, for robotic vacuums, adapt speech recognition for the users accent, let text-to-speech speak as if the user speaks. However, the resource limitations of target devices incur significant difficulties. We propose NNTrainer, a light-weight on-device training framework. We describe optimization techniques for neural networks implemented by NNTrainer, which are evaluated along with the conventional. The evaluations show that NNTrainer can reduce memory consumption down to 1/28 without deteriorating accuracy or training time and effectively personalizes applications on devices. NNTrainer is cross-platform and practical open source software, which is being deployed to millions of devices in the authors affiliation.
    Learning to Efficiently Propagate for Reasoning on Knowledge Graphs. (arXiv:2206.04798v1 [cs.AI])
    Path-based methods are more appealing solutions than embedding methods for knowledge graph reasoning, due to their interpretability and generalization ability to unseen graphs. However, path-based methods usually suffer from the problem of scalability, as the time complexity grows exponentially w.r.t. the length of paths. While recent methods compute reasoning paths with the Bellman-Ford algorithm in polynomial time, the time and memory cost remains very high, as they need to propagate through all the nodes and edges in the graph. In this paper, we propose A*Net, an efficient model for path-based reasoning on knowledge graphs. Inspired by the classical A* algorithm for shortest path problems, our A*Net prioritizes important nodes and edges at each propagation step, to reduce the time and memory footprint. Unlike the classical A* algorithm that uses a heuristic function, we propose to learn the priority function for each node to capture the complex semantics in knowledge graphs. The priority function and the propagation steps are jointly optimized through backpropagation. Experiments on both transductive and inductive knowledge graph reasoning benchmarks show that A*Net achieves competitive performance with existing state-of-the-art path-based methods, and meanwhile reduces the number of messages, the time and the memory cost up to 7.2$\times$, 3.4$\times$ and 4.9$\times$ respectively.
    COSTA: Covariance-Preserving Feature Augmentation for Graph Contrastive Learning. (arXiv:2206.04726v1 [cs.LG])
    Graph contrastive learning (GCL) improves graph representation learning, leading to SOTA on various downstream tasks. The graph augmentation step is a vital but scarcely studied step of GCL. In this paper, we show that the node embedding obtained via the graph augmentations is highly biased, somewhat limiting contrastive models from learning discriminative features for downstream tasks.Thus, instead of investigating graph augmentation in the input space, we alternatively propose to perform augmentations on the hidden features (feature augmentation). Inspired by so-called matrix sketching, we propose COSTA, a novel COvariance-preServing feaTure space Augmentation framework for GCL, which generates augmented features by maintaining a ``good sketch'' of original features. To highlight the superiority of feature augmentation with COSTA, we investigate a single-view setting (in addition to multi-view one) which conserves memory and computations. We show that the feature augmentation with COSTA achieves comparable/better results than graph augmentation based models.
    Leveraging Centric Data Federated Learning Using Blockchain For Integrity Assurance. (arXiv:2206.04731v1 [cs.LG])
    Machine learning abilities have become a vital component for various solutions across industries, applications, and sectors. Many organizations seek to leverage AI-based solutions across their business services to unlock better efficiency and increase productivity. Problems, however, can arise if there is a lack of quality data for AI-model training, scalability, and maintenance. We propose a data-centric federated learning architecture leveraged by a public blockchain and smart contracts to overcome this significant issue. Our proposed solution provides a virtual public marketplace where developers, data scientists, and AI-engineer can publish their models and collaboratively create and access quality data for training. We enhance data quality and integrity through an incentive mechanism that rewards contributors for data contribution and verification. Those combined with the proposed framework helped increase with only one user simulation the training dataset with an average of 100 input daily and the model accuracy by approximately 4\%.
    Sparsity in Partially Controllable Linear Systems. (arXiv:2110.06150v2 [math.OC] UPDATED)
    A fundamental concept in control theory is that of controllability, where any system state can be reached through an appropriate choice of control inputs. Indeed, a large body of classical and modern approaches are designed for controllable linear dynamical systems. However, in practice, we often encounter systems in which a large set of state variables evolve exogenously and independently of the control inputs; such systems are only partially controllable. The focus of this work is on a large class of partially controllable linear dynamical systems, specified by an underlying sparsity pattern. Our main results establish structural conditions and finite-sample guarantees for learning to control such systems. In particular, our structural results characterize those state variables which are irrelevant for optimal control, an analysis which departs from classical control techniques. Our algorithmic results adapt techniques from high-dimensional statistics -- specifically soft-thresholding and semiparametric least-squares -- to exploit the underlying sparsity pattern in order to obtain finite-sample guarantees that significantly improve over those based on certainty-equivalence. We also corroborate these theoretical improvements over certainty-equivalent control through a simulation study.
    Robust Factorization of Real-world Tensor Streams with Patterns, Missing Values, and Outliers. (arXiv:2102.08466v2 [cs.LG] UPDATED)
    Consider multiple seasonal time series being collected in real-time, in the form of a tensor stream. Real-world tensor streams often include missing entries (e.g., due to network disconnection) and at the same time unexpected outliers (e.g., due to system errors). Given such a real-world tensor stream, how can we estimate missing entries and predict future evolution accurately in real-time? In this work, we answer this question by introducing SOFIA, a robust factorization method for real-world tensor streams. In a nutshell, SOFIA smoothly and tightly integrates tensor factorization, outlier removal, and temporal-pattern detection, which naturally reinforce each other. Moreover, SOFIA integrates them in linear time, in an online manner, despite the presence of missing entries. We experimentally show that SOFIA is (a) robust and accurate: yielding up to 76% lower imputation error and 71% lower forecasting error; (b) fast: up to 935X faster than the second-most accurate competitor; and (c) scalable: scaling linearly with the number of new entries per time step.
    Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?. (arXiv:2206.05266v1 [cs.LG])
    We investigate whether self-supervised learning (SSL) can improve online reinforcement learning (RL) from pixels. We extend the contrastive reinforcement learning framework (e.g., CURL) that jointly optimizes SSL and RL losses and conduct an extensive amount of experiments with various self-supervised losses. Our observations suggest that the existing SSL framework for RL fails to bring meaningful improvement over the baselines only taking advantage of image augmentation when the same amount of data and augmentation is used. We further perform an evolutionary search to find the optimal combination of multiple self-supervised losses for RL, but find that even such a loss combination fails to meaningfully outperform the methods that only utilize carefully designed image augmentations. Often, the use of self-supervised losses under the existing framework lowered RL performances. We evaluate the approach in multiple different environments including a real-world robot environment and confirm that no single self-supervised loss or image augmentation method can dominate all environments and that the current framework for joint optimization of SSL and RL is limited. Finally, we empirically investigate the pretraining framework for SSL + RL and the properties of representations learned with different approaches.
    Balanced Product of Experts for Long-Tailed Recognition. (arXiv:2206.05260v1 [cs.CV])
    Many real-world recognition problems suffer from an imbalanced or long-tailed label distribution. Those distributions make representation learning more challenging due to limited generalization over the tail classes. If the test distribution differs from the training distribution, e.g. uniform versus long-tailed, the problem of the distribution shift needs to be addressed. To this aim, recent works have extended softmax cross-entropy using margin modifications, inspired by Bayes' theorem. In this paper, we generalize several approaches with a Balanced Product of Experts (BalPoE), which combines a family of models with different test-time target distributions to tackle the imbalance in the data. The proposed experts are trained in a single stage, either jointly or independently, and fused seamlessly into a BalPoE. We show that BalPoE is Fisher consistent for minimizing the balanced error and perform extensive experiments to validate the effectiveness of our approach. Finally, we investigate the effect of Mixup in this setting, discovering that regularization is a key ingredient for learning calibrated experts. Our experiments show that a regularized BalPoE can perform remarkably well in test accuracy and calibration metrics, leading to state-of-the-art results on CIFAR-100-LT, ImageNet-LT, and iNaturalist-2018 datasets. The code will be made publicly available upon paper acceptance.
    List-Decodable Sparse Mean Estimation via Difference-of-Pairs Filtering. (arXiv:2206.05245v1 [cs.DS])
    We study the problem of list-decodable sparse mean estimation. Specifically, for a parameter $\alpha \in (0, 1/2)$, we are given $m$ points in $\mathbb{R}^n$, $\lfloor \alpha m \rfloor$ of which are i.i.d. samples from a distribution $D$ with unknown $k$-sparse mean $\mu$. No assumptions are made on the remaining points, which form the majority of the dataset. The goal is to return a small list of candidates containing a vector $\widehat \mu$ such that $\| \widehat \mu - \mu \|_2$ is small. Prior work had studied the problem of list-decodable mean estimation in the dense setting. In this work, we develop a novel, conceptually simpler technique for list-decodable mean estimation. As the main application of our approach, we provide the first sample and computationally efficient algorithm for list-decodable sparse mean estimation. In particular, for distributions with ``certifiably bounded'' $t$-th moments in $k$-sparse directions and sufficiently light tails, our algorithm achieves error of $(1/\alpha)^{O(1/t)}$ with sample complexity $m = (k\log(n))^{O(t)}/\alpha$ and running time $\mathrm{poly}(mn^t)$. For the special case of Gaussian inliers, our algorithm achieves the optimal error guarantee of $\Theta (\sqrt{\log(1/\alpha)})$ with quasi-polynomial sample and computational complexity. We complement our upper bounds with nearly-matching statistical query and low-degree polynomial testing lower bounds.
    On Convergence of FedProx: Local Dissimilarity Invariant Bounds, Non-smoothness and Beyond. (arXiv:2206.05187v1 [stat.ML])
    The FedProx algorithm is a simple yet powerful distributed proximal point optimization method widely used for federated learning (FL) over heterogeneous data. Despite its popularity and remarkable success witnessed in practice, the theoretical understanding of FedProx is largely underinvestigated: the appealing convergence behavior of FedProx is so far characterized under certain non-standard and unrealistic dissimilarity assumptions of local functions, and the results are limited to smooth optimization problems. In order to remedy these deficiencies, we develop a novel local dissimilarity invariant convergence theory for FedProx and its minibatch stochastic extension through the lens of algorithmic stability. As a result, we contribute to derive several new and deeper insights into FedProx for non-convex federated optimization including: 1) convergence guarantees independent on local dissimilarity type conditions; 2) convergence guarantees for non-smooth FL problems; and 3) linear speedup with respect to size of minibatch and number of sampled devices. Our theory for the first time reveals that local dissimilarity and smoothness are not must-have for FedProx to get favorable complexity bounds. Preliminary experimental results on a series of benchmark FL datasets are reported to demonstrate the benefit of minibatching for improving the sample efficiency of FedProx.
    Weakly-supervised segmentation using inherently-explainable classification models and their application to brain tumour classification. (arXiv:2206.05148v1 [eess.IV])
    Deep learning models have shown their potential for several applications. However, most of the models are opaque and difficult to trust due to their complex reasoning - commonly known as the black-box problem. Some fields, such as medicine, require a high degree of transparency to accept and adopt such technologies. Consequently, creating explainable/interpretable models or applying post-hoc methods on classifiers to build trust in deep learning models are required. Moreover, deep learning methods can be used for segmentation tasks, which typically require hard-to-obtain, time-consuming manually-annotated segmentation labels for training. This paper introduces three inherently-explainable classifiers to tackle both of these problems as one. The localisation heatmaps provided by the networks -- representing the models' focus areas and being used in classification decision-making -- can be directly interpreted, without requiring any post-hoc methods to derive information for model explanation. The models are trained by using the input image and only the classification labels as ground-truth in a supervised fashion - without using any information about the location of the region of interest (i.e. the segmentation labels), making the segmentation training of the models weakly-supervised through classification labels. The final segmentation is obtained by thresholding these heatmaps. The models were employed for the task of multi-class brain tumour classification using two different datasets, resulting in the best F1-score of 0.93 for the supervised classification task while securing a median Dice score of 0.67$\pm$0.08 for the weakly-supervised segmentation task. Furthermore, the obtained accuracy on a subset of tumour-only images outperformed the state-of-the-art glioma tumour grading binary classifiers with the best model achieving 98.7\% accuracy.
    Meta-data Study in Autism Spectrum Disorder Classification Based on Structural MRI. (arXiv:2206.05052v1 [cs.LG])
    Accurate diagnosis of autism spectrum disorder (ASD) based on neuroimaging data has significant implications, as extracting useful information from neuroimaging data for ASD detection is challenging. Even though machine learning techniques have been leveraged to improve the information extraction from neuroimaging data, the varying data quality caused by different meta-data conditions (i.e., data collection strategies) limits the effective information that can be extracted, thus leading to data-dependent predictive accuracies in ASD detection, which can be worse than random guess in some cases. In this work, we systematically investigate the impact of three kinds of meta-data on the predictive accuracy of classifying ASD based on structural MRI collected from 20 different sites, where meta-data conditions vary.
    Human-AI Interaction Design in Machine Teaching. (arXiv:2206.05182v1 [cs.HC])
    Machine Teaching (MT) is an interactive process where a human and a machine interact with the goal of training a machine learning model (ML) for a specified task. The human teacher communicates their task expertise and the machine student gathers the required data and knowledge to produce an ML model. MT systems are developed to jointly minimize the time spent on teaching and the learner's error rate. The design of human-AI interaction in an MT system not only impacts the teaching efficiency, but also indirectly influences the ML performance by affecting the teaching quality. In this paper, we build upon our previous work where we proposed an MT framework with three components, viz., the teaching interface, the machine learner, and the knowledge base, and focus on the human-AI interaction design involved in realizing the teaching interface. We outline design decisions that need to be addressed in developing an MT system beginning from an ML task. The paper follows the Socratic method entailing a dialogue between a curious student and a wise teacher.
    Coswara: A website application enabling COVID-19 screening by analysing respiratory sound samples and health symptoms. (arXiv:2206.05053v1 [cs.HC])
    The COVID-19 pandemic has accelerated research on design of alternative, quick and effective COVID-19 diagnosis approaches. In this paper, we describe the Coswara tool, a website application designed to enable COVID-19 detection by analysing respiratory sound samples and health symptoms. A user using this service can log into a website using any device connected to the internet, provide there current health symptom information and record few sound sampled corresponding to breathing, cough, and speech. Within a minute of analysis of this information on a cloud server the website tool will output a COVID-19 probability score to the user. As the COVID-19 pandemic continues to demand massive and scalable population level testing, we hypothesize that the proposed tool provides a potential solution towards this.
    Fast Deep Autoencoder for Federated learning. (arXiv:2206.05136v1 [cs.LG])
    This paper presents a novel, fast and privacy preserving implementation of deep autoencoders. DAEF (Deep Autoencoder for Federated learning), unlike traditional neural networks, trains a deep autoencoder network in a non-iterative way, which drastically reduces its training time. Its training can be carried out in a distributed way (several partitions of the dataset in parallel) and incrementally (aggregation of partial models), and due to its mathematical formulation, the data that is exchanged does not endanger the privacy of the users. This makes DAEF a valid method for edge computing and federated learning scenarios. The method has been evaluated and compared to traditional (iterative) deep autoencoders using seven real anomaly detection datasets, and their performance have been shown to be similar despite DAEF's faster training.
    Provable Guarantees for Sparsity Recovery with Deterministic Missing Data Patterns. (arXiv:2206.04893v1 [cs.LG])
    We study the problem of consistently recovering the sparsity pattern of a regression parameter vector from correlated observations governed by deterministic missing data patterns using Lasso. We consider the case in which the observed dataset is censored by a deterministic, non-uniform filter. Recovering the sparsity pattern in datasets with deterministic missing structure can be arguably more challenging than recovering in a uniformly-at-random scenario. In this paper, we propose an efficient algorithm for missing value imputation by utilizing the topological property of the censorship filter. We then provide novel theoretical results for exact recovery of the sparsity pattern using the proposed imputation strategy. Our analysis shows that, under certain statistical and topological conditions, the hidden sparsity pattern can be recovered consistently with high probability in polynomial time and logarithmic sample complexity.
    Out of Sight, Out of Mind: A Source-View-Wise Feature Aggregation for Multi-View Image-Based Rendering. (arXiv:2206.04906v1 [cs.CV])
    To estimate the volume density and color of a 3D point in the multi-view image-based rendering, a common approach is to inspect the consensus existence among the given source image features, which is one of the informative cues for the estimation procedure. To this end, most of the previous methods utilize equally-weighted aggregation features. However, this could make it hard to check the consensus existence when some outliers, which frequently occur by occlusions, are included in the source image feature set. In this paper, we propose a novel source-view-wise feature aggregation method, which facilitates us to find out the consensus in a robust way by leveraging local structures in the feature set. We first calculate the source-view-wise distance distribution for each source feature for the proposed aggregation. After that, the distance distribution is converted to several similarity distributions with the proposed learnable similarity mapping functions. Finally, for each element in the feature set, the aggregation features are extracted by calculating the weighted means and variances, where the weights are derived from the similarity distributions. In experiments, we validate the proposed method on various benchmark datasets, including synthetic and real image scenes. The experimental results demonstrate that incorporating the proposed features improves the performance by a large margin, resulting in the state-of-the-art performance.
    Hierarchical mixtures of Gaussians for combined dimensionality reduction and clustering. (arXiv:2206.04841v1 [cs.LG])
    To avoid the curse of dimensionality, a common approach to clustering high-dimensional data is to first project the data into a space of reduced dimension, and then cluster the projected data. Although effective, this two-stage approach prevents joint optimization of the dimensionality-reduction and clustering models, and obscures how well the complete model describes the data. Here, we show how a family of such two-stage models can be combined into a single, hierarchical model that we call a hierarchical mixture of Gaussians (HMoG). An HMoG simultaneously captures both dimensionality-reduction and clustering, and its performance is quantified in closed-form by the likelihood function. By formulating and extending existing models with exponential family theory, we show how to maximize the likelihood of HMoGs with expectation-maximization. We apply HMoGs to synthetic data and RNA sequencing data, and demonstrate how they exceed the limitations of two-stage models. Ultimately, HMoGs are a rigorous generalization of a common statistical framework, and provide researchers with a method to improve model performance when clustering high-dimensional data.
    Binarizing Split Learning for Data Privacy Enhancement and Computation Reduction. (arXiv:2206.04864v1 [cs.LG])
    Split learning (SL) enables data privacy preservation by allowing clients to collaboratively train a deep learning model with the server without sharing raw data. However, SL still has limitations such as potential data privacy leakage and high computation at clients. In this study, we propose to binarize the SL local layers for faster computation (up to 17.5 times less forward-propagation time in both training and inference phases on mobile devices) and reduced memory usage (up to 32 times less memory and bandwidth requirements). More importantly, the binarized SL (B-SL) model can reduce privacy leakage from SL smashed data with merely a small degradation in model accuracy. To further enhance the privacy preservation, we also propose two novel approaches: 1) training with additional local leak loss and 2) applying differential privacy, which could be integrated separately or concurrently into the B-SL model. Experimental results with different datasets have affirmed the advantages of the B-SL models compared with several benchmark models. The effectiveness of B-SL models against feature-space hijacking attack (FSHA) is also illustrated. Our results have demonstrated B-SL models are promising for lightweight IoT/mobile applications with high privacy-preservation requirements such as mobile healthcare applications.
    Stable and memory-efficient image recovery using monotone operator learning (MOL). (arXiv:2206.04797v1 [cs.CV])
    We introduce a monotone deep equilibrium learning framework for large-scale inverse problems in imaging. The proposed algorithm relies on forward-backward splitting, where each iteration consists of a gradient descent involving the score function and a conjugate gradient algorithm to encourage data consistency. The score function is modeled as a monotone convolutional neural network. The use of a monotone operator offers several benefits, including guaranteed convergence, uniqueness of fixed point, and robustness to input perturbations, similar to the use of convex priors in compressive sensing. In addition, the proposed formulation is significantly more memory-efficient than unrolled methods, which allows us to apply it to 3D problems that current unrolled algorithms cannot handle. Experiments show that the proposed scheme can offer improved performance in 3D settings while being stable in the presence of input perturbations.
    Deep learning-enhanced ensemble-based data assimilation for high-dimensional nonlinear dynamical systems. (arXiv:2206.04811v1 [cs.LG])
    Data assimilation (DA) is a key component of many forecasting models in science and engineering. DA allows one to estimate better initial conditions using an imperfect dynamical model of the system and noisy/sparse observations available from the system. Ensemble Kalman filter (EnKF) is a DA algorithm that is widely used in applications involving high-dimensional nonlinear dynamical systems. However, EnKF requires evolving large ensembles of forecasts using the dynamical model of the system. This often becomes computationally intractable, especially when the number of states of the system is very large, e.g., for weather prediction. With small ensembles, the estimated background error covariance matrix in the EnKF algorithm suffers from sampling error, leading to an erroneous estimate of the analysis state (initial condition for the next forecast cycle). In this work, we propose hybrid ensemble Kalman filter (H-EnKF), which is applied to a two-layer quasi-geostrophic flow system as a test case. This framework utilizes a pre-trained deep learning-based data-driven surrogate that inexpensively generates and evolves a large data-driven ensemble of the states of the system to accurately compute the background error covariance matrix with less sampling error. The H-EnKF framework estimates a better initial condition without the need for any ad-hoc localization strategies. H-EnKF can be extended to any ensemble-based DA algorithm, e.g., particle filters, which are currently difficult to use for high dimensional systems.
    Deep Auto-encoder with Neural Response. (arXiv:2111.15309v2 [cs.LG] UPDATED)
    Artificial neural network (ANN) is a versatile tool to study the neural representation in the ventral visual stream, and the knowledge in neuroscience in return inspires ANN models to improve performance in the task. However, it is still unclear how to merge these two directions into a unified framework. In this study, we propose an integrated framework called Deep Autoencoder with Neural Response (DAE-NR), which incorporates information from ANN and the visual cortex to achieve better image reconstruction performance and higher neural representation similarity between biological and artificial neurons. The same visual stimuli (i.e., natural images) are input to both the mice brain and DAE-NR. The encoder of DAE-NR jointly learns the dependencies from neural spike encoding and image reconstruction. For the neural spike encoding task, the features derived from a specific hidden layer of the encoder are transformed by a mapping function to predict the ground-truth neural response under the constraint of image reconstruction. Simultaneously, for the image reconstruction task, the latent representation obtained by the encoder is assigned to a decoder to restore the original image under the guidance of neural information. In DAE-NR, the learning process of encoder, mapping function and decoder are all implicitly constrained by these two tasks. Our experiments demonstrate that if and only if with the joint learning, DAE-NRs can improve the performance of visual image reconstruction and increase the representation similarity between biological neurons and artificial neurons. The DAE-NR offers a new perspective on the integration of computer vision and neuroscience.  ( 2 min )
    Gaussian Mixture Variational Autoencoder with Contrastive Learning for Multi-Label Classification. (arXiv:2112.00976v2 [cs.LG] UPDATED)
    Multi-label classification (MLC) is a prediction task where each sample can have more than one label. We propose a novel contrastive learning boosted multi-label prediction model based on a Gaussian mixture variational autoencoder (C-GMVAE), which learns a multimodal prior space and employs a contrastive loss. Many existing methods introduce extra complex neural modules like graph neural networks to capture the label correlations, in addition to the prediction modules. We find that by using contrastive learning in the supervised setting, we can exploit label information effectively in a data-driven manner, and learn meaningful feature and label embeddings which capture the label correlations and enhance the predictive power. Our method also adopts the idea of learning and aligning latent spaces for both features and labels. In contrast to previous works based on a unimodal prior, C-GMVAE imposes a Gaussian mixture structure on the latent space, to alleviate the posterior collapse and over-regularization issues. C-GMVAE outperforms existing methods on multiple public datasets and can often match other models' full performance with only 50% of the training data. Furthermore, we show that the learnt embeddings provide insights into the interpretation of label-label interactions.  ( 2 min )
    Solving PDEs on Unknown Manifolds with Machine Learning. (arXiv:2106.06682v2 [math.NA] UPDATED)
    This paper proposes a mesh-free computational framework and machine learning theory for solving elliptic PDEs on unknown manifolds, identified with point clouds, based on diffusion maps (DM) and deep learning. The PDE solver is formulated as a supervised learning task to solve a least-squares regression problem that imposes an algebraic equation approximating a PDE (and boundary conditions if applicable). This algebraic equation involves a graph-Laplacian type matrix obtained via DM asymptotic expansion, which is a consistent estimator of second-order elliptic differential operators. The resulting numerical method is to solve a highly non-convex empirical risk minimization problem subjected to a solution from a hypothesis space of neural networks. In a well-posed elliptic PDE setting, when the hypothesis space consists of neural networks with either infinite width or depth, we show that the global minimizer of the empirical loss function is a consistent solution in the limit of large training data. When the hypothesis space is a two-layer neural network, we show that for a sufficiently large width, gradient descent can identify a global minimizer of the empirical loss function. Supporting numerical examples demonstrate the convergence of the solutions, ranging from simple manifolds with low and high co-dimensions, to rough surfaces with and without boundaries. We also show that the proposed NN solver can robustly generalize the PDE solution on new data points with generalization errors that are almost identical to the training errors, superseding a Nystrom-based interpolation method.  ( 2 min )
    Generalization Bounds with Minimal Dependency on Hypothesis Class via Distributionally Robust Optimization. (arXiv:2106.11180v3 [math.OC] UPDATED)
    Established approaches to obtain generalization bounds in data-driven optimization and machine learning mostly build on solutions from empirical risk minimization (ERM), which depend crucially on the functional complexity of the hypothesis class. In this paper, we present an alternate route to obtain these bounds on the solution from distributionally robust optimization (DRO), a recent data-driven optimization framework based on worst-case analysis and the notion of ambiguity set to capture statistical uncertainty. In contrast to the hypothesis class complexity in ERM, our DRO bounds depend on the ambiguity set geometry and its compatibility with the true loss function. Notably, when using maximum mean discrepancy as a DRO distance metric, our analysis implies generalization bounds whose dependence on the hypothesis class appears the minimal possible: The bound depends solely on the true loss function, independent of any other candidates in the hypothesis class. To our best knowledge, it is the first generalization bound of this type in the literature, and we hope our findings can open the door for a better understanding of DRO, especially its benefits on loss minimization and other machine learning applications.  ( 2 min )
    Meta-Reinforcement Learning with Self-Modifying Networks. (arXiv:2202.02363v2 [cs.LG] UPDATED)
    Deep Reinforcement Learning has demonstrated the potential of neural networks tuned with gradient descent for solving complex tasks in well-delimited environments. However, these neural systems are slow learners producing specialised agents with no mechanism to continue learning beyond their training curriculum. On the contrary, biological synaptic plasticity is persistent and manifold, and has been hypothesised to play a key role in executive functions such as working memory and cognitive flexibility, potentially supporting more efficient and generic learning abilities. Inspired by this, we propose to build networks with dynamic weights, able to continually perform self-reflexive modification as a function of their current synaptic state and action-reward feedback, rather than a fixed network configuration. The resulting model, MetODS (for Meta-Optimized Dynamical Synapses) is a broadly applicable meta-reinforcement learning system able to learn efficient and powerful control rules in the agent policy space. A single layer with dynamic synapses can perform one-shot learning, generalize navigation principles to unseen environments and demonstrate a strong ability to learn adaptive motor policies, comparing favourably with previous meta-reinforcement learning approaches.  ( 2 min )
    Asymptotic Escape of Spurious Critical Points on the Low-rank Matrix Manifold. (arXiv:2107.09207v2 [math.OC] UPDATED)
    We show that on the manifold of fixed-rank and symmetric positive semi-definite matrices, the Riemannian gradient descent algorithm almost surely escapes some spurious critical points on the boundary of the manifold. Our result is the first to partially overcome the incompleteness of the low-rank matrix manifold without changing the vanilla Riemannian gradient descent algorithm. The spurious critical points are some rank-deficient matrices that capture only part of the eigen components of the ground truth. Unlike classical strict saddle points, they exhibit very singular behavior. We show that using the dynamical low-rank approximation and a rescaled gradient flow, some of the spurious critical points can be converted to classical strict saddle points in the parameterized domain, which leads to the desired result. Numerical experiments are provided to support our theoretical findings.  ( 2 min )
    Stochastic Continuous Submodular Maximization: Boosting via Non-oblivious Function. (arXiv:2201.00703v3 [cs.LG] UPDATED)
    In this paper, we revisit Stochastic Continuous Submodular Maximization in both offline and online settings, which can benefit wide applications in machine learning and operations research areas. We present a boosting framework covering gradient ascent and online gradient ascent. The fundamental ingredient of our methods is a novel non-oblivious function $F$ derived from a factor-revealing optimization problem, whose any stationary point provides a $(1-e^{-\gamma})$-approximation to the global maximum of the $\gamma$-weakly DR-submodular objective function $f\in C^{1,1}_L(\mathcal{X})$. Under the offline scenario, we propose a boosting gradient ascent method achieving $(1-e^{-\gamma}-\epsilon^{2})$-approximation after $O(1/\epsilon^2)$ iterations, which improves the $(\frac{\gamma^2}{1+\gamma^2})$ approximation ratio of the classical gradient ascent algorithm. In the online setting, for the first time we consider the adversarial delays for stochastic gradient feedback, under which we propose a boosting online gradient algorithm with the same non-oblivious function $F$. Meanwhile, we verify that this boosting online algorithm achieves a regret of $O(\sqrt{D})$ against a $(1-e^{-\gamma})$-approximation to the best feasible solution in hindsight, where $D$ is the sum of delays of gradient feedback. To the best of our knowledge, this is the first result to obtain $O(\sqrt{T})$ regret against a $(1-e^{-\gamma})$-approximation with $O(1)$ gradient inquiry at each time step, when no delay exists, i.e., $D=T$. Finally, numerical experiments demonstrate the effectiveness of our boosting methods.  ( 2 min )
    Recurrent Neural Network Training with Convex Loss and Regularization Functions by Extended Kalman Filtering. (arXiv:2111.02673v2 [cs.LG] UPDATED)
    This paper investigates the use of extended Kalman filtering to train recurrent neural networks with rather general convex loss functions and regularization terms on the network parameters, including $\ell_1$-regularization. We show that the learning method outperforms stochastic gradient descent in a nonlinear system identification benchmark and in training a linear system with binary outputs. We also explore the use of the algorithm in data-driven nonlinear model predictive control and its relation with disturbance models for offset-free closed-loop tracking.  ( 2 min )
    Assemblies of neurons learn to classify well-separated distributions. (arXiv:2110.03171v2 [cs.NE] UPDATED)
    An assembly is a large population of neurons whose synchronous firing is hypothesized to represent a memory, concept, word, and other cognitive categories. Assemblies are believed to provide a bridge between high-level cognitive phenomena and low-level neural activity. Recently, a computational system called the Assembly Calculus (AC), with a repertoire of biologically plausible operations on assemblies, has been shown capable of simulating arbitrary space-bounded computation, but also of simulating complex cognitive phenomena such as language, reasoning, and planning. However, the mechanism whereby assemblies can mediate learning has not been known. Here we present such a mechanism, and prove rigorously that, for simple classification problems defined on distributions of labeled assemblies, a new assembly representing each class can be reliably formed in response to a few stimuli from the class; this assembly is henceforth reliably recalled in response to new stimuli from the same class. Furthermore, such class assemblies will be distinguishable as long as the respective classes are reasonably separated -- for example, when they are clusters of similar assemblies. To prove these results, we draw on random graph theory with dynamic edge weights to estimate sequences of activated vertices, yielding strong generalizations of previous calculations and theorems in this field over the past five years. These theorems are backed up by experiments demonstrating the successful formation of assemblies which represent concept classes on synthetic data drawn from such distributions, and also on MNIST, which lends itself to classification through one assembly per digit. Seen as a learning algorithm, this mechanism is entirely online, generalizes from very few samples, and requires only mild supervision -- all key attributes of learning in a model of the brain.  ( 3 min )
    Coarsening the Granularity: Towards Structurally Sparse Lottery Tickets. (arXiv:2202.04736v2 [cs.LG] UPDATED)
    The lottery ticket hypothesis (LTH) has shown that dense models contain highly sparse subnetworks (i.e., winning tickets) that can be trained in isolation to match full accuracy. Despite many exciting efforts being made, there is one "commonsense" rarely challenged: a winning ticket is found by iterative magnitude pruning (IMP) and hence the resultant pruned subnetworks have only unstructured sparsity. That gap limits the appeal of winning tickets in practice, since the highly irregular sparse patterns are challenging to accelerate on hardware. Meanwhile, directly substituting structured pruning for unstructured pruning in IMP damages performance more severely and is usually unable to locate winning tickets. In this paper, we demonstrate the first positive result that a structurally sparse winning ticket can be effectively found in general. The core idea is to append "post-processing techniques" after each round of (unstructured) IMP, to enforce the formation of structural sparsity. Specifically, we first "re-fill" pruned elements back in some channels deemed to be important, and then "re-group" non-zero elements to create flexible group-wise structural patterns. Both our identified channel- and group-wise structural subnetworks win the lottery, with substantial inference speedups readily supported by existing hardware. Extensive experiments, conducted on diverse datasets across multiple network backbones, consistently validate our proposal, showing that the hardware acceleration roadblock of LTH is now removed. Specifically, the structural winning tickets obtain up to {64.93%, 64.84%, 60.23%} running time savings at {36%~80%, 74%, 58%} sparsity on {CIFAR, Tiny-ImageNet, ImageNet}, while maintaining comparable accuracy. Code is at https://github.com/VITA-Group/Structure-LTH.  ( 2 min )
    Preference Communication in Multi-Objective Normal-Form Games. (arXiv:2111.09191v2 [cs.GT] UPDATED)
    We consider preference communication in two-player multi-objective normal-form games. In such games, the payoffs resulting from joint actions are vector-valued. Taking a utility-based approach, we assume there exists a utility function for each player which maps vectors to scalar utilities and consider agents that aim to maximise the utility of expected payoff vectors. As agents typically do not know their opponent's utility function or strategy, they must learn policies to interact with each other. Inspired by Stackelberg games, we introduce four novel preference communication protocols to aid agents in arriving at adequate solutions. Each protocol describes a specific approach for one agent to communicate preferences over their actions and how another agent responds. Additionally, to study when communication emerges, we introduce a communication protocol where agents must learn when to communicate. These protocols are subsequently evaluated on a set of five benchmark games against baseline agents that do not communicate. We find that preference communication can alter the learning process and lead to the emergence of cyclic policies which had not been previously observed in this setting. We further observe that the resulting policies can heavily depend on the characteristics of the game that is played. Lastly, we find that communication naturally emerges in both cooperative and self-interested settings.  ( 2 min )
    Membership-Mappings for Data Representation Learning: Measure Theoretic Conceptualization. (arXiv:2104.07060v3 [cs.LG] UPDATED)
    A fuzzy theoretic analytical approach was recently introduced that leads to efficient and robust models while addressing automatically the typical issues associated to parametric deep models. However, a formal conceptualization of the fuzzy theoretic analytical deep models is still not available. This paper introduces using measure theoretic basis the notion of \emph{membership-mapping} for representing data points through attribute values (motivated by fuzzy theory). A property of the membership-mapping, that can be exploited for data representation learning, is of providing an interpolation on the given data points in the data space. An analytical approach to the variational learning of a membership-mappings based data representation model is considered.  ( 2 min )
    Offline Pre-trained Multi-Agent Decision Transformer: One Big Sequence Model Tackles All SMAC Tasks. (arXiv:2112.02845v3 [cs.LG] UPDATED)
    Offline reinforcement learning leverages previously-collected offline datasets to learn optimal policies with no necessity to access the real environment. Such a paradigm is also desirable for multi-agent reinforcement learning (MARL) tasks, given the increased interactions among agents and with the enviroment. Yet, in MARL, the paradigm of offline pre-training with online fine-tuning has not been studied, nor datasets or benchmarks for offline MARL research are available. In this paper, we facilitate the research by providing large-scale datasets, and use them to examine the usage of the Decision Transformer in the context of MARL. We investigate the generalisation of MARL offline pre-training in the following three aspects: 1) between single agents and multiple agents, 2) from offline pretraining to the online fine-tuning, and 3) to that of multiple downstream tasks with few-shot and zero-shot capabilities. We start by introducing the first offline MARL dataset with diverse quality levels based on the StarCraftII environment, and then propose the novel architecture of multi-agent decision transformer (MADT) for effective offline learning. MADT leverages transformer's modelling ability of sequence modelling and integrates it seamlessly with both offline and online MARL tasks. A crucial benefit of MADT is that it learns generalisable policies that can transfer between different types of agents under different task scenarios. On StarCraft II offline dataset, MADT outperforms the state-of-the-art offline RL baselines. When applied to online tasks, the pre-trained MADT significantly improves sample efficiency, and enjoys strong performance both few-short and zero-shot cases. To our best knowledge, this is the first work that studies and demonstrates the effectiveness of offline pre-trained models in terms of sample efficiency and generalisability enhancements in MARL.  ( 3 min )
    Low-Rank Approximation with $1/\epsilon^{1/3}$ Matrix-Vector Products. (arXiv:2202.05120v3 [cs.DS] UPDATED)
    We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm. Here, given access to a matrix $A$ through matrix-vector products, an accuracy parameter $\epsilon$, and a target rank $k$, the goal is to find a rank-$k$ matrix $Z$ with orthonormal columns such that $\| A(I -ZZ^\top)\|_{S_p} \leq (1+\epsilon)\min_{U^\top U = I_k} \|A(I - U U^\top)\|_{S_p}$, where $\|M\|_{S_p}$ denotes the $\ell_p$ norm of the the singular values of $M$. For the special cases of $p=2$ (Frobenius norm) and $p = \infty$ (Spectral norm), Musco and Musco (NeurIPS 2015) obtained an algorithm based on Krylov methods that uses $\tilde{O}(k/\sqrt{\epsilon})$ matrix-vector products, improving on the na\"ive $\tilde{O}(k/\epsilon)$ dependence obtainable by the power method, where $\tilde{O}$ suppresses poly$(\log(dk/\epsilon))$ factors. Our main result is an algorithm that uses only $\tilde{O}(kp^{1/6}/\epsilon^{1/3})$ matrix-vector products, and works for all $p \geq 1$. For $p = 2$ our bound improves the previous $\tilde{O}(k/\epsilon^{1/2})$ bound to $\tilde{O}(k/\epsilon^{1/3})$. Since the Schatten-$p$ and Schatten-$\infty$ norms are the same up to a $(1+ \epsilon)$ factor when $p \geq (\log d)/\epsilon$, our bound recovers the result of Musco and Musco for $p = \infty$. Further, we prove a matrix-vector query lower bound of $\Omega(1/\epsilon^{1/3})$ for any fixed constant $p \geq 1$, showing that surprisingly $\tilde{\Theta}(1/\epsilon^{1/3})$ is the optimal complexity for constant~$k$. To obtain our results, we introduce several new techniques, including optimizing over multiple Krylov subspaces simultaneously, and pinching inequalities for partitioned operators. Our lower bound for $p \in [1,2]$ uses the Araki-Lieb-Thirring trace inequality, whereas for $p>2$, we appeal to a norm-compression inequality for aligned partitioned operators.  ( 2 min )
    Domain Transformer: Predicting Samples of Unseen, Future Domains. (arXiv:2106.06057v2 [cs.LG] UPDATED)
    The data distribution commonly evolves over time leading to problems such as concept drift that often decrease classifier performance. Current techniques are not adequate for this problem because they either require detailed knowledge of the transformation or are not suited for anticipating unseen domains but can only adapt to domains, where data samples are available. We seek to predict unseen data (and their labels) allowing us to tackle challenges s a non-constant data distribution in a proactive manner rather than detecting and reacting to already existing changes that might already have led to errors. To this end, we learn a domain transformer in an unsupervised manner that allows generating data of unseen domains. Our approach first matches independently learned latent representations of two given domains obtained from an auto-encoder using a Cycle-GAN. In turn, a transformation of the original samples can be learned that can be applied iteratively to extrapolate to unseen domains. Our evaluation of CNNs on image data confirms the usefulness of the approach. It also achieves very good results on the well-known problem of unsupervised domain adaption, where only labels but no samples have to be predicted. Code is available at https://github.com/JohnTailor/DoTra.  ( 2 min )
    Deep Learning Based Automated COVID-19 Classification from Computed Tomography Images. (arXiv:2111.11191v3 [eess.IV] UPDATED)
    The paper represents a method of a Convolution Neural Networks (CNN) model for image classification with image preprocessing and hyperparameters tuning, aiming at increasing the predictive performance for COVID-19 diagnosis while avoiding deeper and thus more complex alternatives. Firstly, the CNN model includes four similar convolutional layers followed by a flattening and two dense layers. This work proposes a less complex solution based on simply classifying 2D slices of CT scans using a CNN model. Despite the simplicity in architecture, the proposed CNN model showed improved quantitative results exceeding state-of-the-arts on the dataset of images, in terms of the macro F1 score. The results were achieved on the original CT slices of the dataset. Secondly, the original dataset was processed via anatomy-relevant masking of slices, removing non-representative slices from the CT volume, and hyperparameters tuning. For slice processing, a fixed-sized rectangular area was used for cropping an anatomy-relevant region of interest in the images, and a threshold based on the number of white pixels in binarized slices was employed to remove non-representative slices from the 3D-CT scans. The CNN model with a learning rate schedule with exponential decay and slice flipping techniques was deployed on the processed slices. The proposed method was used to make predictions on the 2D slices. For final diagnosis at a patient level, majority voting was applied on the slices of each CT scan to make the diagnosis. The macro F1 score of the proposed method well exceeded the baseline approach and other alternatives' scores on the validation set as well as on a test partition of the previously unseen images from the COV19-CT-DB dataset partition.  ( 3 min )
    Gradient Descent on Neurons and its Link to Approximate Second-Order Optimization. (arXiv:2201.12250v2 [cs.LG] UPDATED)
    Second-order optimizers are thought to hold the potential to speed up neural network training, but due to the enormous size of the curvature matrix, they typically require approximations to be computationally tractable. The most successful family of approximations are Kronecker-Factored, block-diagonal curvature estimates (KFAC). Here, we combine tools from prior work to evaluate exact second-order updates with careful ablations to establish a surprising result: Due to its approximations, KFAC is not closely related to second-order updates, and in particular, it significantly outperforms true second-order updates. This challenges widely held believes and immediately raises the question why KFAC performs so well. Towards answering this question we present evidence strongly suggesting that KFAC approximates a first-order algorithm, which performs gradient descent on neurons rather than weights. Finally, we show that this optimizer often improves over KFAC in terms of computational cost and data-efficiency.  ( 2 min )
    NeuroComb: Improving SAT Solving with Graph Neural Networks. (arXiv:2110.14053v3 [cs.AI] UPDATED)
    Propositional satisfiability (SAT) is an NP-complete problem that impacts many research fields, such as planning, verification, and security. Mainstream modern SAT solvers are based on the Conflict-Driven Clause Learning (CDCL) algorithm. Recent work aimed to enhance CDCL SAT solvers by improving their variable branching heuristics through predictions generated by Graph Neural Networks(GNNs). However, so far this approach either has not made solving more effective, or has required online access to substantial GPU resources. Aiming to make GNN improvements practical, this paper proposes an approach called NeuroComb, which builds on two insights: (1) predictions of important variables and clauses can be combined with dynamic branching into a more effective hybrid branching strategy, and (2) it is sufficient to query the neural model only once for the predictions before the SAT solving starts. NeuroComb is implemented as an enhancement to a classic CDCL solver called MiniSat and a more recent CDCL solver called Glucose. As a result, it allowed MiniSat to solve 11% and Glucose 5% more problems on the recent SATCOMP-2021 competition problem set, with the computational resource requirement of only one GPU. NeuroComb is therefore a both effective and practical approach to improving SAT solving through machine learning.  ( 2 min )
    Integrated Conditional Estimation-Optimization. (arXiv:2110.12351v2 [stat.ML] UPDATED)
    Many real-world optimization problems involve uncertain parameters with probability distributions that can be estimated using contextual feature information. In contrast to the standard approach of first estimating the distribution of uncertain parameters and then optimizing the objective based on the estimation, we propose an \textit{integrated conditional estimation-optimization} (ICEO) framework that estimates the underlying conditional distribution of the random parameter while considering the structure of the optimization problem. We directly model the relationship between the conditional distribution of the random parameter and the contextual features, and then estimate the probabilistic model with an objective that aligns with the downstream optimization problem. We show that our ICEO approach is asymptotically consistent under moderate regularity conditions and further provide finite performance guarantees in the form of generalization bounds. Computationally, performing estimation with the ICEO approach is a non-convex and often non-differentiable optimization problem. We propose a general methodology for approximating the potentially non-differentiable mapping from estimated conditional distribution to optimal decision by a differentiable function, which greatly improves the performance of gradient-based algorithms applied to the non-convex problem. We also provide a polynomial optimization solution approach in the semi-algebraic case. Numerical experiments are also conducted to show the empirical success of our approach in different situations including with limited data samples and model mismatches.  ( 2 min )
    Contrastive Supervised Distillation for Continual Representation Learning. (arXiv:2205.05476v2 [cs.CV] UPDATED)
    In this paper, we propose a novel training procedure for the continual representation learning problem in which a neural network model is sequentially learned to alleviate catastrophic forgetting in visual search tasks. Our method, called Contrastive Supervised Distillation (CSD), reduces feature forgetting while learning discriminative features. This is achieved by leveraging labels information in a distillation setting in which the student model is contrastively learned from the teacher model. Extensive experiments show that CSD performs favorably in mitigating catastrophic forgetting by outperforming current state-of-the-art methods. Our results also provide further evidence that feature forgetting evaluated in visual retrieval tasks is not as catastrophic as in classification tasks. Code at: https://github.com/NiccoBiondi/ContrastiveSupervisedDistillation.  ( 2 min )
    Neural Bregman Divergences for Distance Learning. (arXiv:2206.04763v1 [cs.LG])
    Many metric learning tasks, such as triplet learning, nearest neighbor retrieval, and visualization, are treated primarily as embedding tasks where the ultimate metric is some variant of the Euclidean distance (e.g., cosine or Mahalanobis), and the algorithm must learn to embed points into the pre-chosen space. The study of non-Euclidean geometries or appropriateness is often not explored, which we believe is due to a lack of tools for learning non-Euclidean measures of distance. Under the belief that the use of asymmetric methods in particular have lacked sufficient study, we propose a new approach to learning arbitrary Bergman divergences in a differentiable manner via input convex neural networks. Over a set of both new and previously studied tasks, including asymmetric regression, ranking, and clustering, we demonstrate that our method more faithfully learns divergences than prior Bregman learning approaches. In doing so we obtain the first method for learning neural Bregman divergences and with it inherit the many nice mathematical properties of Bregman divergences, providing the foundation and tooling for better developing and studying asymmetric distance learning.  ( 2 min )
    Theoretical Error Performance Analysis for Variational Quantum Circuit Based Functional Regression. (arXiv:2206.04804v1 [quant-ph])
    The noisy intermediate-scale quantum (NISQ) devices enable the implementation of the variational quantum circuit (VQC) for quantum neural networks (QNN). Although the VQC-based QNN has succeeded in many machine learning tasks, the representation and generalization powers of VQC still require further investigation, particularly when the dimensionality reduction of classical inputs is concerned. In this work, we first put forth an end-to-end quantum neural network, namely, TTN-VQC, which consists of a quantum tensor network based on a tensor-train network (TTN) for dimensionality reduction and a VQC for functional regression. Then, we aim at the error performance analysis for the TTN-VQC in terms of representation and generalization powers. We also characterize the optimization properties of TTN-VQC by leveraging the Polyak-Lojasiewicz (PL) condition. Moreover, we conduct the experiments of functional regression on a handwritten digit classification dataset to justify our theoretical analysis.  ( 2 min )
    ReFace: Real-time Adversarial Attacks on Face Recognition Systems. (arXiv:2206.04783v1 [cs.CV])
    Deep neural network based face recognition models have been shown to be vulnerable to adversarial examples. However, many of the past attacks require the adversary to solve an input-dependent optimization problem using gradient descent which makes the attack impractical in real-time. These adversarial examples are also tightly coupled to the attacked model and are not as successful in transferring to different models. In this work, we propose ReFace, a real-time, highly-transferable attack on face recognition models based on Adversarial Transformation Networks (ATNs). ATNs model adversarial example generation as a feed-forward neural network. We find that the white-box attack success rate of a pure U-Net ATN falls substantially short of gradient-based attacks like PGD on large face recognition datasets. We therefore propose a new architecture for ATNs that closes this gap while maintaining a 10000x speedup over PGD. Furthermore, we find that at a given perturbation magnitude, our ATN adversarial perturbations are more effective in transferring to new face recognition models than PGD. ReFace attacks can successfully deceive commercial face recognition services in a transfer attack setting and reduce face identification accuracy from 82% to 16.4% for AWS SearchFaces API and Azure face verification accuracy from 91% to 50.1%.  ( 2 min )
    A Novel Partitioned Approach for Reduced Order Model -- Finite Element Model (ROM-FEM) and ROM-ROM Coupling. (arXiv:2206.04736v1 [math.NA])
    Partitioned methods allow one to build a simulation capability for coupled problems by reusing existing single-component codes. In so doing, partitioned methods can shorten code development and validation times for multiphysics and multiscale applications. In this work, we consider a scenario in which one or more of the "codes" being coupled are projection-based reduced order models (ROMs), introduced to lower the computational cost associated with a particular component. We simulate this scenario by considering a model interface problem that is discretized independently on two non-overlapping subdomains. We then formulate a partitioned scheme for this problem that allows the coupling between a ROM "code" for one of the subdomains with a finite element model (FEM) or ROM "code" for the other subdomain. The ROM "codes" are constructed by performing proper orthogonal decomposition (POD) on a snapshot ensemble to obtain a low-dimensional reduced order basis, followed by a Galerkin projection onto this basis. The ROM and/or FEM "codes" on each subdomain are then coupled using a Lagrange multiplier representing the interface flux. To partition the resulting monolithic problem, we first eliminate the flux through a dual Schur complement. Application of an explicit time integration scheme to the transformed monolithic problem decouples the subdomain equations, allowing their independent solution for the next time step. We show numerical results that demonstrate the proposed method's efficacy in achieving both ROM-FEM and ROM-ROM coupling.  ( 2 min )
    Mobility Improves the Convergence of Asynchronous Federated Learning. (arXiv:2206.04742v1 [cs.LG])
    This paper studies asynchronous Federated Learning (FL) subject to clients' individual arbitrary communication patterns with the parameter server. We propose FedMobile, a new asynchronous FL algorithm that exploits the mobility attribute of the mobile FL system to improve the learning performance. The key idea is to leverage the random client-to-client communication in a mobile network to create additional indirect communication opportunities with the server via upload and download relaying. We prove that FedMobile achieves a convergence rate $O(\frac{1}{\sqrt{NT}})$, where $N$ is the number of clients and $T$ is the number of communication slots, and show that the optimal design involves an interesting trade-off on the best timing of relaying. Our analysis suggests that with an increased level of mobility, asynchronous FL converges faster using FedMobile. Experiment results on a synthetic dataset and two real-world datasets verify our theoretical findings.  ( 2 min )
    STNDT: Modeling Neural Population Activity with a Spatiotemporal Transformer. (arXiv:2206.04727v1 [q-bio.NC])
    Modeling neural population dynamics underlying noisy single-trial spiking activities is essential for relating neural observation and behavior. A recent non-recurrent method - Neural Data Transformers (NDT) - has shown great success in capturing neural dynamics with low inference latency without an explicit dynamical model. However, NDT focuses on modeling the temporal evolution of the population activity while neglecting the rich covariation between individual neurons. In this paper we introduce SpatioTemporal Neural Data Transformer (STNDT), an NDT-based architecture that explicitly models responses of individual neurons in the population across time and space to uncover their underlying firing rates. In addition, we propose a contrastive learning loss that works in accordance with mask modeling objective to further improve the predictive performance. We show that our model achieves state-of-the-art performance on ensemble level in estimating neural activities across four neural datasets, demonstrating its capability to capture autonomous and non-autonomous dynamics spanning different cortical regions while being completely agnostic to the specific behaviors at hand. Furthermore, STNDT spatial attention mechanism reveals consistently important subsets of neurons that play a vital role in driving the response of the entire population, providing interpretability and key insights into how the population of neurons performs computation.  ( 2 min )
    AI-based Clinical Assessment of Optic Nerve Head Robustness Superseding Biomechanical Testing. (arXiv:2206.04689v1 [eess.IV])
    $\mathbf{Purpose}$: To use artificial intelligence (AI) to: (1) exploit biomechanical knowledge of the optic nerve head (ONH) from a relatively large population; (2) assess ONH robustness from a single optical coherence tomography (OCT) scan of the ONH; (3) identify what critical three-dimensional (3D) structural features make a given ONH robust. $\mathbf{Design}$: Retrospective cross-sectional study. $\mathbf{Methods}$: 316 subjects had their ONHs imaged with OCT before and after acute intraocular pressure (IOP) elevation through ophthalmo-dynamometry. IOP-induced lamina-cribrosa deformations were then mapped in 3D and used to classify ONHs. Those with LC deformations superior to 4% were considered fragile, while those with deformations inferior to 4% robust. Learning from these data, we compared three AI algorithms to predict ONH robustness strictly from a baseline (undeformed) OCT volume: (1) a random forest classifier; (2) an autoencoder; and (3) a dynamic graph CNN (DGCNN). The latter algorithm also allowed us to identify what critical 3D structural features make a given ONH robust. $\mathbf{Results}$: All 3 methods were able to predict ONH robustness from 3D structural information alone and without the need to perform biomechanical testing. The DGCNN (area under the receiver operating curve [AUC]: 0.76 $\pm$ 0.08) outperformed the autoencoder (AUC: 0.70 $\pm$ 0.07) and the random forest classifier (AUC: 0.69 $\pm$ 0.05). Interestingly, to assess ONH robustness, the DGCNN mainly used information from the scleral canal and the LC insertion sites. $\mathbf{Conclusions}$: We propose an AI-driven approach that can assess the robustness of a given ONH solely from a single OCT scan of the ONH, and without the need to perform biomechanical testing. Longitudinal studies should establish whether ONH robustness could help us identify fast visual field loss progressors.  ( 2 min )
    ReCo: A Dataset for Residential Community Layout Planning. (arXiv:2206.04678v1 [cs.LG])
    Layout planning is centrally important in the field of architecture and urban design. Among the various basic units carrying urban functions, residential community plays a vital part for supporting human life. Therefore, the layout planning of residential community has always been of concern, and has attracted particular attention since the advent of deep learning that facilitates the automated layout generation and spatial pattern recognition. However, the research circles generally suffer from the insufficiency of residential community layout benchmark or high-quality datasets, which hampers the future exploration of data-driven methods for residential community layout planning. The lack of datasets is largely due to the difficulties of large-scale real-world residential data acquisition and long-term expert screening. In order to address the issues and advance a benchmark dataset for various intelligent spatial design and analysis applications in the development of smart city, we introduce Residential Community Layout Planning (ReCo) Dataset, which is the first and largest open-source vector dataset related to real-world community to date. ReCo Dataset is presented in multiple data formats with 37,646 residential community layout plans, covering 598,728 residential buildings with height information. ReCo can be conveniently adapted for residential community layout related urban design tasks, e.g., generative layout design, morphological pattern recognition and spatial evaluation. To validate the utility of ReCo in automated residential community layout planning, a Generative Adversarial Network (GAN) based generative model is further applied to the dataset. We expect ReCo Dataset to inspire more creative and practical work in intelligent design and beyond. The ReCo Dataset is published at: https://www.kaggle.com/fdudsde/reco-dataset.  ( 2 min )
    Extending Momentum Contrast with Cross Similarity Consistency Regularization. (arXiv:2206.04676v1 [cs.LG])
    Contrastive self-supervised representation learning methods maximize the similarity between the positive pairs, and at the same time tend to minimize the similarity between the negative pairs. However, in general the interplay between the negative pairs is ignored as they do not put in place special mechanisms to treat negative pairs differently according to their specific differences and similarities. In this paper, we present Extended Momentum Contrast (XMoCo), a self-supervised representation learning method founded upon the legacy of the momentum-encoder unit proposed in the MoCo family configurations. To this end, we introduce a cross consistency regularization loss, with which we extend the transformation consistency to dissimilar images (negative pairs). Under the cross consistency regularization rule, we argue that semantic representations associated with any pair of images (positive or negative) should preserve their cross-similarity under pretext transformations. Moreover, we further regularize the training loss by enforcing a uniform distribution of similarity over the negative pairs across a batch. The proposed regularization can easily be added to existing self-supervised learning algorithms in a plug-and-play fashion. Empirically, we report a competitive performance on the standard Imagenet-1K linear head classification benchmark. In addition, by transferring the learned representations to common downstream tasks, we show that using XMoCo with the prevalently utilized augmentations can lead to improvements in the performance of such tasks. We hope the findings of this paper serve as a motivation for researchers to take into consideration the important interplay among the negative examples in self-supervised learning.  ( 2 min )
    Adaptive Model Pooling for Online Deep Anomaly Detection from a Complex Evolving Data Stream. (arXiv:2206.04792v1 [cs.LG])
    Online anomaly detection from a data stream is critical for the safety and security of many applications but is facing severe challenges due to complex and evolving data streams from IoT devices and cloud-based infrastructures. Unfortunately, existing approaches fall too short for these challenges; online anomaly detection methods bear the burden of handling the complexity while offline deep anomaly detection methods suffer from the evolving data distribution. This paper presents a framework for online deep anomaly detection, ARCUS, which can be instantiated with any autoencoder-based deep anomaly detection methods. It handles the complex and evolving data streams using an adaptive model pooling approach with two novel techniques: concept-driven inference and drift-aware model pool update; the former detects anomalies with a combination of models most appropriate for the complexity, and the latter adapts the model pool dynamically to fit the evolving data streams. In comprehensive experiments with ten data sets which are both high-dimensional and concept-drifted, ARCUS improved the anomaly detection accuracy of the streaming variants of state-of-the-art autoencoder-based methods and that of the state-of-the-art streaming anomaly detection methods by up to 22% and 37%, respectively.  ( 2 min )
    Unsupervised Deep Discriminant Analysis Based Clustering. (arXiv:2206.04686v1 [cs.LG])
    This work presents an unsupervised deep discriminant analysis for clustering. The method is based on deep neural networks and aims to minimize the intra-cluster discrepancy and maximize the inter-cluster discrepancy in an unsupervised manner. The method is able to project the data into a nonlinear low-dimensional latent space with compact and distinct distribution patterns such that the data clusters can be effectively identified. We further provide an extension of the method such that available graph information can be effectively exploited to improve the clustering performance. Extensive numerical results on image and non-image data with or without graph information demonstrate the effectiveness of the proposed methods.  ( 2 min )
    On the Unreasonable Effectiveness of Federated Averaging with Heterogeneous Data. (arXiv:2206.04723v1 [cs.LG])
    Existing theory predicts that data heterogeneity will degrade the performance of the Federated Averaging (FedAvg) algorithm in federated learning. However, in practice, the simple FedAvg algorithm converges very well. This paper explains the seemingly unreasonable effectiveness of FedAvg that contradicts the previous theoretical predictions. We find that the key assumption of bounded gradient dissimilarity in previous theoretical analyses is too pessimistic to characterize data heterogeneity in practical applications. For a simple quadratic problem, we demonstrate there exist regimes where large gradient dissimilarity does not have any negative impact on the convergence of FedAvg. Motivated by this observation, we propose a new quantity, average drift at optimum, to measure the effects of data heterogeneity, and explicitly use it to present a new theoretical analysis of FedAvg. We show that the average drift at optimum is nearly zero across many real-world federated training tasks, whereas the gradient dissimilarity can be large. And our new analysis suggests FedAvg can have identical convergence rates in homogeneous and heterogeneous data settings, and hence, leads to better understanding of its empirical success.  ( 2 min )
    Predictive Exit: Prediction of Fine-Grained Early Exits for Computation- and Energy-Efficient Inference. (arXiv:2206.04685v1 [cs.LG])
    By adding exiting layers to the deep learning networks, early exit can terminate the inference earlier with accurate results. The passive decision-making of whether to exit or continue the next layer has to go through every pre-placed exiting layer until it exits. In addition, it is also hard to adjust the configurations of the computing platforms alongside the inference proceeds. By incorporating a low-cost prediction engine, we propose a Predictive Exit framework for computation- and energy-efficient deep learning applications. Predictive Exit can forecast where the network will exit (i.e., establish the number of remaining layers to finish the inference), which effectively reduces the network computation cost by exiting on time without running every pre-placed exiting layer. Moreover, according to the number of remaining layers, proper computing configurations (i.e., frequency and voltage) are selected to execute the network to further save energy. Extensive experimental results demonstrate that Predictive Exit achieves up to 96.2% computation reduction and 72.9% energy-saving compared with classic deep learning networks; and 12.8% computation reduction and 37.6% energy-saving compared with the early exit under state-of-the-art exiting strategies, given the same inference accuracy and latency.  ( 2 min )
    An Empirical Study on Disentanglement of Negative-free Contrastive Learning. (arXiv:2206.04756v1 [cs.LG])
    Negative-free contrastive learning has attracted a lot of attention with simplicity and impressive performance for large-scale pretraining. But its disentanglement property remains unexplored. In this paper, we take different negative-free contrastive learning methods to study the disentanglement property of this genre of self-supervised methods empirically. We find the existing disentanglement metrics fail to make meaningful measurements for the high-dimensional representation model so we propose a new disentanglement metric based on Mutual Information between representation and data factors. With the proposed metric, we benchmark the disentanglement property of negative-free contrastive learning for the first time, on both popular synthetic datasets and a real-world dataset CelebA. Our study shows that the investigated methods can learn a well-disentangled subset of representation. We extend the study of the disentangled representation learning to high-dimensional representation space and negative-free contrastive learning for the first time. The implementation of the proposed metric is available at \url{https://github.com/noahcao/disentanglement_lib_med}.  ( 2 min )
    POODLE: Improving Few-shot Learning via Penalizing Out-of-Distribution Samples. (arXiv:2206.04679v1 [cs.LG])
    In this work, we propose to use out-of-distribution samples, i.e., unlabeled samples coming from outside the target classes, to improve few-shot learning. Specifically, we exploit the easily available out-of-distribution samples to drive the classifier to avoid irrelevant features by maximizing the distance from prototypes to out-of-distribution samples while minimizing that of in-distribution samples (i.e., support, query data). Our approach is simple to implement, agnostic to feature extractors, lightweight without any additional cost for pre-training, and applicable to both inductive and transductive settings. Extensive experiments on various standard benchmarks demonstrate that the proposed method consistently improves the performance of pretrained networks with different architectures.  ( 2 min )
    A Learning-Theoretic Framework for Certified Auditing of Machine Learning Models. (arXiv:2206.04740v1 [cs.LG])
    Responsible use of machine learning requires that models be audited for undesirable properties. However, how to do principled auditing in a general setting has remained ill-understood. In this paper, we propose a formal learning-theoretic framework for auditing. We propose algorithms for auditing linear classifiers for feature sensitivity using label queries as well as different kinds of explanations, and provide performance guarantees. Our results illustrate that while counterfactual explanations can be extremely helpful for auditing, anchor explanations may not be as beneficial in the worst case.  ( 2 min )
    A Neural Network Architecture for Program Understanding Inspired by Human Behaviors. (arXiv:2206.04730v1 [cs.SE])
    Program understanding is a fundamental task in program language processing. Despite the success, existing works fail to take human behaviors as reference in understanding programs. In this paper, we consider human behaviors and propose the PGNN-EK model that consists of two main components. On the one hand, inspired by the "divide-and-conquer" reading behaviors of humans, we present a partitioning-based graph neural network model PGNN on the upgraded AST of codes. On the other hand, to characterize human behaviors of resorting to other resources to help code comprehension, we transform raw codes with external knowledge and apply pre-training techniques for information extraction. Finally, we combine the two embeddings generated from the two components to output code embeddings. We conduct extensive experiments to show the superior performance of PGNN-EK on the code summarization and code clone detection tasks. In particular, to show the generalization ability of our model, we release a new dataset that is more challenging for code clone detection and could advance the development of the community. Our codes and data are publicly available at https://github.com/RecklessRonan/PGNN-EK.  ( 2 min )
    Explainable Artificial Intelligence (XAI) for Internet of Things: A Survey. (arXiv:2206.04800v1 [cs.AI])
    Black-box nature of Artificial Intelligence (AI) models do not allow users to comprehend and sometimes trust the output created by such model. In AI applications, where not only the results but also the decision paths to the results are critical, such black-box AI models are not sufficient. Explainable Artificial Intelligence (XAI) addresses this problem and defines a set of AI models that are interpretable by the users. Recently, several number of XAI models have been to address the issues surrounding by lack of interpretability and explainability of black-box models in various application areas such as healthcare, military, energy, financial and industrial domains. Although the concept of XAI has gained great deal of attention recently, its integration into the IoT domain has not yet been fully defined. In this paper, we provide an in-depth and systematic review of recent studies using XAI models in the scope of IoT domain. We categorize the studies according to their methodology and applications areas. In addition, we aim to focus on the challenging problems and open issues and give future directions to guide the developers and researchers for prospective future investigations.  ( 2 min )
    Can Backdoor Attacks Survive Time-Varying Models?. (arXiv:2206.04677v1 [cs.CR])
    Backdoors are powerful attacks against deep neural networks (DNNs). By poisoning training data, attackers can inject hidden rules (backdoors) into DNNs, which only activate on inputs containing attack-specific triggers. While existing work has studied backdoor attacks on a variety of DNN models, they only consider static models, which remain unchanged after initial deployment. In this paper, we study the impact of backdoor attacks on a more realistic scenario of time-varying DNN models, where model weights are updated periodically to handle drifts in data distribution over time. Specifically, we empirically quantify the "survivability" of a backdoor against model updates, and examine how attack parameters, data drift behaviors, and model update strategies affect backdoor survivability. Our results show that one-shot backdoor attacks (i.e., only poisoning training data once) do not survive past a few model updates, even when attackers aggressively increase trigger size and poison ratio. To stay unaffected by model update, attackers must continuously introduce corrupted data into the training pipeline. Together, these results indicate that when models are updated to learn new data, they also "forget" backdoors as hidden, malicious features. The larger the distribution shift between old and new training data, the faster backdoors are forgotten. Leveraging these insights, we apply a smart learning rate scheduler to further accelerate backdoor forgetting during model updates, which prevents one-shot backdoors from surviving past a single model update.  ( 2 min )
    RT-DNAS: Real-time Constrained Differentiable Neural Architecture Search for 3D Cardiac Cine MRI Segmentation. (arXiv:2206.04682v1 [eess.IV])
    Accurately segmenting temporal frames of cine magnetic resonance imaging (MRI) is a crucial step in various real-time MRI guided cardiac interventions. To achieve fast and accurate visual assistance, there are strict requirements on the maximum latency and minimum throughput of the segmentation framework. State-of-the-art neural networks on this task are mostly hand-crafted to satisfy these constraints while achieving high accuracy. On the other hand, while existing literature have demonstrated the power of neural architecture search (NAS) in automatically identifying the best neural architectures for various medical applications, they are mostly guided by accuracy, sometimes with computation complexity, and the importance of real-time constraints are overlooked. A major challenge is that such constraints are non-differentiable and are thus not compatible with the widely used differentiable NAS frameworks. In this paper, we present a strategy that directly handles real-time constraints in a differentiable NAS framework named RT-DNAS. Experiments on extended 2017 MICCAI ACDC dataset show that compared with state-of-the-art manually and automatically designed architectures, RT-DNAS is able to identify ones with better accuracy while satisfying the real-time constraints.  ( 2 min )
  • Open

    Refined Convergence and Topology Learning for Decentralized Optimization with Heterogeneous Data. (arXiv:2204.04452v2 [cs.LG] UPDATED)
    One of the key challenges in decentralized and federated learning is to design algorithms that efficiently deal with highly heterogeneous data distributions across agents. In this paper, we revisit the analysis of Decentralized Stochastic Gradient Descent algorithm (D-SGD) under data heterogeneity. We exhibit the key role played by a new quantity, called \emph{neighborhood heterogeneity}, on the convergence rate of D-SGD. By coupling the communication topology and the heterogeneity, our analysis sheds light on the poorly understood interplay between these two concepts in decentralized learning. We then argue that neighborhood heterogeneity provides a natural criterion to learn data-dependent topologies that reduce (and can even eliminate) the otherwise detrimental effect of data heterogeneity on the convergence time of D-SGD. For the important case of classification with label skew, we formulate the problem of learning such a good topology as a tractable optimization problem that we solve with a Frank-Wolfe algorithm. As illustrated over a set of simulated and real-world experiments, our approach provides a principled way to design a sparse topology that balances the convergence speed and the per-iteration communication costs of D-SGD under data heterogeneity.  ( 2 min )
    Trace norm regularization for multi-task learning with scarce data. (arXiv:2202.06742v2 [stat.ML] UPDATED)
    Multi-task learning leverages structural similarities between multiple tasks to learn despite very few samples. Motivated by the recent success of neural networks applied to data-scarce tasks, we consider a linear low-dimensional shared representation model. Despite an extensive literature, existing theoretical results either guarantee weak estimation rates or require a large number of samples per task. This work provides the first estimation error bound for the trace norm regularized estimator when the number of samples per task is small. The advantages of trace norm regularization for learning data-scarce tasks extend to meta-learning and are confirmed empirically on synthetic datasets.  ( 2 min )
    Meta Optimal Transport. (arXiv:2206.05262v1 [cs.LG])
    We study the use of amortized optimization to predict optimal transport (OT) maps from the input measures, which we call Meta OT. This helps repeatedly solve similar OT problems between different measures by leveraging the knowledge and information present from past problems to rapidly predict and solve new problems. Otherwise, standard methods ignore the knowledge of the past solutions and suboptimally re-solve each problem from scratch. Meta OT models surpass the standard convergence rates of log-Sinkhorn solvers in the discrete setting and convex potentials in the continuous setting. We improve the computational time of standard OT solvers by multiple orders of magnitude in discrete and continuous transport settings between images, spherical data, and color palettes. Our source code is available at this http URL  ( 2 min )
    On the safe use of prior densities for Bayesian model selection. (arXiv:2206.05210v1 [stat.ME])
    The application of Bayesian inference for the purpose of model selection is very popular nowadays. In this framework, models are compared through their marginal likelihoods, or their quotients, called Bayes factors. However, marginal likelihoods depends on the prior choice. For model selection, even diffuse priors can be actually very informative, unlike for the parameter estimation problem. Furthermore, when the prior is improper, the marginal likelihood of the corresponding model is undetermined. In this work, we discuss the issue of prior sensitivity of the marginal likelihood and its role in model selection. We also comment on the use of uninformative priors, which are very common choices in practice. Several practical suggestions are discussed and many possible solutions, proposed in the literature, to design objective priors for model selection are described. Some of them also allow the use of improper priors. The connection between the marginal likelihood approach and the well-known information criteria is also presented. We describe the main issues and possible solutions by illustrative numerical examples, providing also some related code. One of them involving a real-world application on exoplanet detection.  ( 2 min )
    Tackling covariate shift with node-based Bayesian neural networks. (arXiv:2206.02435v2 [stat.ML] UPDATED)
    Bayesian neural networks (BNNs) promise improved generalization under covariate shift by providing principled probabilistic representations of epistemic uncertainty. However, weight-based BNNs often struggle with high computational complexity of large-scale architectures and datasets. Node-based BNNs have recently been introduced as scalable alternatives, which induce epistemic uncertainty by multiplying each hidden node with latent random variables, while learning a point-estimate of the weights. In this paper, we interpret these latent noise variables as implicit representations of simple and domain-agnostic data perturbations during training, producing BNNs that perform well under covariate shift due to input corruptions. We observe that the diversity of the implicit corruptions depends on the entropy of the latent variables, and propose a straightforward approach to increase the entropy of these variables during training. We evaluate the method on out-of-distribution image classification benchmarks, and show improved uncertainty estimation of node-based BNNs under covariate shift due to input perturbations. As a side effect, the method also provides robustness against noisy training labels.  ( 2 min )
    Asymptotic Escape of Spurious Critical Points on the Low-rank Matrix Manifold. (arXiv:2107.09207v2 [math.OC] UPDATED)
    We show that on the manifold of fixed-rank and symmetric positive semi-definite matrices, the Riemannian gradient descent algorithm almost surely escapes some spurious critical points on the boundary of the manifold. Our result is the first to partially overcome the incompleteness of the low-rank matrix manifold without changing the vanilla Riemannian gradient descent algorithm. The spurious critical points are some rank-deficient matrices that capture only part of the eigen components of the ground truth. Unlike classical strict saddle points, they exhibit very singular behavior. We show that using the dynamical low-rank approximation and a rescaled gradient flow, some of the spurious critical points can be converted to classical strict saddle points in the parameterized domain, which leads to the desired result. Numerical experiments are provided to support our theoretical findings.  ( 2 min )
    Popularity Adjusted Block Models are Generalized Random Dot Product Graphs. (arXiv:2109.04010v2 [stat.ML] UPDATED)
    We connect two random graph models, the Popularity Adjusted Block Model (PABM) and the Generalized Random Dot Product Graph (GRDPG), by demonstrating that the PABM is a special case of the GRDPG in which communities correspond to mutually orthogonal subspaces of latent vectors. This insight allows us to construct new algorithms for community detection and parameter estimation for the PABM, as well as improve an existing algorithm that relies on Sparse Subspace Clustering. Using established asymptotic properties of Adjacency Spectral Embedding for the GRDPG, we derive asymptotic properties of these algorithms. In particular, we demonstrate that the absolute number of community detection errors tends to zero as the number of graph vertices tends to infinity. Simulation experiments illustrate these properties.  ( 2 min )
    A Free Lunch with Influence Functions? Improving Neural Network Estimates with Concepts from Semiparametric Statistics. (arXiv:2202.09096v2 [cs.LG] UPDATED)
    Parameter estimation in empirical fields is usually undertaken using parametric models, and such models readily facilitate statistical inference. Unfortunately, they are unlikely to be sufficiently flexible to be able to adequately model real-world phenomena, and may yield biased estimates. Conversely, non-parametric approaches are flexible but do not readily facilitate statistical inference and may still exhibit residual bias. We explore the potential for Influence Functions (IFs) to (a) improve initial estimators without needing more data (b) increase model robustness and (c) facilitate statistical inference. We begin with a broad introduction to IFs, and propose a neural network method 'MultiNet', which seeks the diversity of an ensemble using a single architecture. We also introduce variants on the IF update step which we call 'MultiStep', and provide a comprehensive evaluation of different approaches. The improvements are found to be dataset dependent, indicating an interaction between the methods used and nature of the data generating process. Our experiments highlight the need for practitioners to check the consistency of their findings, potentially by undertaking multiple analyses with different combinations of estimators. We also show that it is possible to improve existing neural networks for `free', without needing more data, and without needing to retrain them.  ( 2 min )
    GD-VAEs: Geometric Dynamic Variational Autoencoders for Learning Nonlinear Dynamics and Dimension Reductions. (arXiv:2206.05183v1 [cs.LG])
    We develop data-driven methods incorporating geometric and topological information to learn parsimonious representations of nonlinear dynamics from observations. We develop approaches for learning nonlinear state space models of the dynamics for general manifold latent spaces using training strategies related to Variational Autoencoders (VAEs). Our methods are referred to as Geometric Dynamic (GD) Variational Autoencoders (GD-VAEs). We learn encoders and decoders for the system states and evolution based on deep neural network architectures that include general Multilayer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), and Transpose CNNs (T-CNNs). Motivated by problems arising in parameterized PDEs and physics, we investigate the performance of our methods on tasks for learning low dimensional representations of the nonlinear Burgers equations, constrained mechanical systems, and spatial fields of reaction-diffusion systems. GD-VAEs provide methods for obtaining representations for use in learning tasks involving dynamics.  ( 2 min )
    Dynamic mean field programming. (arXiv:2206.05200v1 [stat.ML])
    A dynamic mean field theory is developed for model based Bayesian reinforcement learning in the large state space limit. In an analogy with the statistical physics of disordered systems, the transition probabilities are interpreted as couplings, and value functions as deterministic spins, and thus the sampled transition probabilities are considered to be quenched random variables. The results reveal that, under standard assumptions, the posterior over Q-values is asymptotically independent and Gaussian across state-action pairs, for infinite horizon problems. The finite horizon case exhibits the same behaviour for all state-actions pairs at each time but has an additional correlation across time, for each state-action pair. The results also hold for policy evaluation. The Gaussian statistics can be computed from a set of coupled mean field equations derived from the Bellman equation, which we call dynamic mean field programming (DMFP). For Q-value iteration, approximate equations are obtained by appealing to extreme value theory, and closed form expressions are found in the independent and identically distributed case. The Lyapunov stability of these closed form equations is studied.  ( 2 min )
    Street Crossing Aid Using Light-weight CNNs for the Visually Impaired. (arXiv:1909.09598v2 [cs.CV] UPDATED)
    In this paper, we address an issue that the visually impaired commonly face while crossing intersections and propose a solution that takes form as a mobile application. The application utilizes a deep learning convolutional neural network model, LytNetV2, to output necessary information that the visually impaired may lack when without human companions or guide-dogs. A prototype of the application runs on iOS devices of versions 11 or above. It is designed for comprehensiveness, concision, accuracy, and computational efficiency through delivering the two most important pieces of information, pedestrian traffic light color and direction, required to cross the road in real-time. Furthermore, it is specifically aimed to support those facing financial burden as the solution takes the form of a free mobile application. Through the modification and utilization of key principles in MobileNetV3 such as depthwise seperable convolutions and squeeze-excite layers, the deep neural network model achieves a classification accuracy of 96% and average angle error of 6.15 degrees, while running at a frame rate of 16.34 frames per second. Additionally, the model is trained as an image classifier, allowing for a faster and more accurate model. The network is able to outperform other methods such as object detection and non-deep learning algorithms in both accuracy and thoroughness. The information is delivered through both auditory signals and vibrations, and it has been tested on seven visually impaired and has received above satisfactory responses.  ( 2 min )
    Interactively Learning Preference Constraints in Linear Bandits. (arXiv:2206.05255v1 [cs.LG])
    We study sequential decision-making with known rewards and unknown constraints, motivated by situations where the constraints represent expensive-to-evaluate human preferences, such as safe and comfortable driving behavior. We formalize the challenge of interactively learning about these constraints as a novel linear bandit problem which we call constrained linear best-arm identification. To solve this problem, we propose the Adaptive Constraint Learning (ACOL) algorithm. We provide an instance-dependent lower bound for constrained linear best-arm identification and show that ACOL's sample complexity matches the lower bound in the worst-case. In the average case, ACOL's sample complexity bound is still significantly tighter than bounds of simpler approaches. In synthetic experiments, ACOL performs on par with an oracle solution and outperforms a range of baselines. As an application, we consider learning constraints to represent human preferences in a driving simulation. ACOL is significantly more sample efficient than alternatives for this application. Further, we find that learning preferences as constraints is more robust to changes in the driving scenario than encoding the preferences directly in the reward function.  ( 2 min )
    Integrated Conditional Estimation-Optimization. (arXiv:2110.12351v2 [stat.ML] UPDATED)
    Many real-world optimization problems involve uncertain parameters with probability distributions that can be estimated using contextual feature information. In contrast to the standard approach of first estimating the distribution of uncertain parameters and then optimizing the objective based on the estimation, we propose an \textit{integrated conditional estimation-optimization} (ICEO) framework that estimates the underlying conditional distribution of the random parameter while considering the structure of the optimization problem. We directly model the relationship between the conditional distribution of the random parameter and the contextual features, and then estimate the probabilistic model with an objective that aligns with the downstream optimization problem. We show that our ICEO approach is asymptotically consistent under moderate regularity conditions and further provide finite performance guarantees in the form of generalization bounds. Computationally, performing estimation with the ICEO approach is a non-convex and often non-differentiable optimization problem. We propose a general methodology for approximating the potentially non-differentiable mapping from estimated conditional distribution to optimal decision by a differentiable function, which greatly improves the performance of gradient-based algorithms applied to the non-convex problem. We also provide a polynomial optimization solution approach in the semi-algebraic case. Numerical experiments are also conducted to show the empirical success of our approach in different situations including with limited data samples and model mismatches.  ( 2 min )
    Linear regression with partially mismatched data: local search with theoretical guarantees. (arXiv:2106.02175v2 [math.OC] UPDATED)
    Linear regression is a fundamental modeling tool in statistics and related fields. In this paper, we study an important variant of linear regression in which the predictor-response pairs are partially mismatched. We use an optimization formulation to simultaneously learn the underlying regression coefficients and the permutation corresponding to the mismatches. The combinatorial structure of the problem leads to computational challenges. We propose and study a simple greedy local search algorithm for this optimization problem that enjoys strong theoretical guarantees and appealing computational performance. We prove that under a suitable scaling of the number of mismatched pairs compared to the number of samples and features, and certain assumptions on problem data; our local search algorithm converges to a nearly-optimal solution at a linear rate. In particular, in the noiseless case, our algorithm converges to the global optimal solution with a linear convergence rate. Based on this result, we prove an upper bound for the estimation error of the parameter. We also propose an approximate local search step that allows us to scale our approach to much larger instances. We conduct numerical experiments to gather further insights into our theoretical results, and show promising performance gains compared to existing approaches.  ( 2 min )
    Mixed Logit Models and Network Formation. (arXiv:2006.16516v4 [cs.SI] UPDATED)
    The study of network formation is pervasive in economics, sociology, and many other fields. In this paper, we model network formation as a `choice' that is made by nodes in a network to connect to other nodes. We study these `choices' using discrete-choice models, in which an agent chooses between two or more discrete alternatives. We employ the `repeated-choice' (RC) model to study network formation. We argue that the RC model overcomes important limitations of the multinomial logit (MNL) model, which gives one framework for studying network formation, and that it is well-suited to study network formation. We also illustrate how to use the RC model to accurately study network formation using both synthetic and real-world networks. Using synthetic networks, we also compare the performance of the MNL model and the RC model. We find that the RC model estimates the data-generation process of our synthetic networks more accurately than the MNL model. We do a case study of a qualitatively interesting scenario -- the fact that new patents are more likely to cite older, more cited, and similar patents -- for which the RC model allows us to achieve interesting insights.  ( 2 min )
    Learning Classifiers under Delayed Feedback with a Time Window Assumption. (arXiv:2009.13092v2 [cs.LG] UPDATED)
    We consider training a binary classifier under delayed feedback (\emph{DF learning}). For example, in the conversion prediction in online ads, we initially receive negative samples that clicked the ads but did not buy an item; subsequently, some samples among them buy an item then change to positive. In the setting of DF learning, we observe samples over time, then learn a classifier at some point. We initially receive negative samples; subsequently, some samples among them change to positive. This problem is conceivable in various real-world applications such as online advertisements, where the user action takes place long after the first click. Owing to the delayed feedback, naive classification of the positive and negative samples returns a biased classifier. One solution is to use samples that have been observed for more than a certain time window assuming these samples are correctly labeled. However, existing studies reported that simply using a subset of all samples based on the time window assumption does not perform well, and that using all samples along with the time window assumption improves empirical performance. We extend these existing studies and propose a method with the unbiased and convex empirical risk that is constructed from all samples under the time window assumption. To demonstrate the soundness of the proposed method, we provide experimental results on a synthetic and open dataset that is the real traffic log datasets in online advertising.  ( 2 min )
    List-Decodable Sparse Mean Estimation via Difference-of-Pairs Filtering. (arXiv:2206.05245v1 [cs.DS])
    We study the problem of list-decodable sparse mean estimation. Specifically, for a parameter $\alpha \in (0, 1/2)$, we are given $m$ points in $\mathbb{R}^n$, $\lfloor \alpha m \rfloor$ of which are i.i.d. samples from a distribution $D$ with unknown $k$-sparse mean $\mu$. No assumptions are made on the remaining points, which form the majority of the dataset. The goal is to return a small list of candidates containing a vector $\widehat \mu$ such that $\| \widehat \mu - \mu \|_2$ is small. Prior work had studied the problem of list-decodable mean estimation in the dense setting. In this work, we develop a novel, conceptually simpler technique for list-decodable mean estimation. As the main application of our approach, we provide the first sample and computationally efficient algorithm for list-decodable sparse mean estimation. In particular, for distributions with ``certifiably bounded'' $t$-th moments in $k$-sparse directions and sufficiently light tails, our algorithm achieves error of $(1/\alpha)^{O(1/t)}$ with sample complexity $m = (k\log(n))^{O(t)}/\alpha$ and running time $\mathrm{poly}(mn^t)$. For the special case of Gaussian inliers, our algorithm achieves the optimal error guarantee of $\Theta (\sqrt{\log(1/\alpha)})$ with quasi-polynomial sample and computational complexity. We complement our upper bounds with nearly-matching statistical query and low-degree polynomial testing lower bounds.  ( 2 min )
    Validity, consonant plausibility measures, and conformal prediction. (arXiv:2001.09225v3 [math.ST] UPDATED)
    Prediction of future observations is an important and challenging problem. The two mainstream approaches for quantifying prediction uncertainty use prediction regions and predictive distributions, respectively, with the latter believed to be more informative because it can perform other prediction-related tasks. The standard notion of validity, what we refer to here as Type-1 validity, focuses on coverage probability of prediction regions, while a notion of validity relevant to the other prediction-related tasks performed by predictive distributions is lacking. Here we present a new notion, called Type-2 validity, relevant to these other prediction tasks. We establish connections between Type-2 validity and coherence properties, and show that imprecise probability considerations are required in order to achieve it. We go on to show that both types of prediction validity can be achieved by interpreting the conformal prediction output as the contour function of a consonant plausibility measure. We also offer an alternative characterization of conformal prediction, based on a new nonparametric inferential model construction, wherein the appearance of consonance is natural, and prove its validity.  ( 2 min )
    Log-concave density estimation in undirected graphical models. (arXiv:2206.05227v1 [math.ST])
    We study the problem of maximum likelihood estimation of densities that are log-concave and lie in the graphical model corresponding to a given undirected graph $G$. We show that the maximum likelihood estimate (MLE) is the product of the exponentials of several tent functions, one for each maximal clique of $G$. While the set of log-concave densities in a graphical model is infinite-dimensional, our results imply that the MLE can be found by solving a finite-dimensional convex optimization problem. We provide an implementation and a few examples. Furthermore, we show that the MLE exists and is unique with probability 1 as long as the number of sample points is larger than the size of the largest clique of $G$ when $G$ is chordal. We show that the MLE is consistent when the graph $G$ is a disjoint union of cliques. Finally, we discuss the conditions under which a log-concave density in the graphical model of $G$ has a log-concave factorization according to $G$.  ( 2 min )
    Hierarchical mixtures of Gaussians for combined dimensionality reduction and clustering. (arXiv:2206.04841v1 [cs.LG])
    To avoid the curse of dimensionality, a common approach to clustering high-dimensional data is to first project the data into a space of reduced dimension, and then cluster the projected data. Although effective, this two-stage approach prevents joint optimization of the dimensionality-reduction and clustering models, and obscures how well the complete model describes the data. Here, we show how a family of such two-stage models can be combined into a single, hierarchical model that we call a hierarchical mixture of Gaussians (HMoG). An HMoG simultaneously captures both dimensionality-reduction and clustering, and its performance is quantified in closed-form by the likelihood function. By formulating and extending existing models with exponential family theory, we show how to maximize the likelihood of HMoGs with expectation-maximization. We apply HMoGs to synthetic data and RNA sequencing data, and demonstrate how they exceed the limitations of two-stage models. Ultimately, HMoGs are a rigorous generalization of a common statistical framework, and provide researchers with a method to improve model performance when clustering high-dimensional data.  ( 2 min )
    Cross-validation: what does it estimate and how well does it do it?. (arXiv:2104.00673v3 [stat.ME] UPDATED)
    Cross-validation is a widely-used technique to estimate prediction error, but its behavior is complex and not fully understood. Ideally, one would like to think that cross-validation estimates the prediction error for the model at hand, fit to the training data. We prove that this is not the case for the linear model fit by ordinary least squares; rather it estimates the average prediction error of models fit on other unseen training sets drawn from the same population. We further show that this phenomenon occurs for most popular estimates of prediction error, including data splitting, bootstrapping, and Mallow's Cp. Next, the standard confidence intervals for prediction error derived from cross-validation may have coverage far below the desired level. Because each data point is used for both training and testing, there are correlations among the measured accuracies for each fold, and so the usual estimate of variance is too small. We introduce a nested cross-validation scheme to estimate this variance more accurately, and we show empirically that this modification leads to intervals with approximately correct coverage in many examples where traditional cross-validation intervals fail.  ( 2 min )
    Trimmed Maximum Likelihood Estimation for Robust Learning in Generalized Linear Models. (arXiv:2206.04777v1 [cs.LG])
    We study the problem of learning generalized linear models under adversarial corruptions. We analyze a classical heuristic called the iterative trimmed maximum likelihood estimator which is known to be effective against label corruptions in practice. Under label corruptions, we prove that this simple estimator achieves minimax near-optimal risk on a wide range of generalized linear models, including Gaussian regression, Poisson regression and Binomial regression. Finally, we extend the estimator to the more challenging setting of label and covariate corruptions and demonstrate its robustness and optimality in that setting as well.  ( 2 min )
    A Causal Research Pipeline and Tutorial for Psychologists and Social Scientists. (arXiv:2206.05175v1 [stat.ME])
    Causality is a fundamental part of the scientific endeavour to understand the world. Unfortunately, causality is still taboo in much of psychology and social science. Motivated by a growing number of recommendations for the importance of adopting causal approaches to research, we reformulate the typical approach to research in psychology to harmonize inevitably causal theories with the rest of the research pipeline. We present a new process which begins with the incorporation of techniques from the confluence of causal discovery and machine learning for the development, validation, and transparent formal specification of theories. We then present methods for reducing the complexity of the fully specified theoretical model into the fundamental submodel relevant to a given target hypothesis. From here, we establish whether or not the quantity of interest is estimable from the data, and if so, propose the use of semi-parametric machine learning methods for the estimation of causal effects. The overall goal is the presentation of a new research pipeline which can (a) facilitate scientific inquiry compatible with the desire to test causal theories (b) encourage transparent representation of our theories as unambiguous mathematical objects, (c) to tie our statistical models to specific attributes of the theory, thus reducing under-specification problems frequently resulting from the theory-to-model gap, and (d) to yield results and estimates which are causally meaningful and reproducible. The process is demonstrated through didactic examples with real-world data, and we conclude with a summary and discussion of limitations.  ( 2 min )
    Conformal Prediction Intervals for Markov Decision Process Trajectories. (arXiv:2206.04860v1 [cs.LG])
    Before delegating a task to an autonomous system, a human operator may want a guarantee about the behavior of the system. This paper extends previous work on conformal prediction for functional data and conformalized quantile regression to provide conformal prediction intervals over the future behavior of an autonomous system executing a fixed control policy on a Markov Decision Process (MDP). The prediction intervals are constructed by applying conformal corrections to prediction intervals computed by quantile regression. The resulting intervals guarantee that with probability $1-\delta$ the observed trajectory will lie inside the prediction interval, where the probability is computed with respect to the starting state distribution and the stochasticity of the MDP. The method is illustrated on MDPs for invasive species management and StarCraft2 battles.  ( 2 min )
    Distributionally Robust End-to-End Portfolio Construction. (arXiv:2206.05134v1 [q-fin.CP])
    We propose an end-to-end distributionally robust system for portfolio construction that integrates the asset return prediction model with a distributionally robust portfolio optimization model. We also show how to learn the risk-tolerance parameter and the degree of robustness directly from data. End-to-end systems have an advantage in that information can be communicated between the prediction and decision layers during training, allowing the parameters to be trained for the final task rather than solely for predictive performance. However, existing end-to-end systems are not able to quantify and correct for the impact of model risk on the decision layer. Our proposed distributionally robust end-to-end portfolio selection system explicitly accounts for the impact of model risk. The decision layer chooses portfolios by solving a minimax problem where the distribution of the asset returns is assumed to belong to an ambiguity set centered around a nominal distribution. Using convex duality, we recast the minimax problem in a form that allows for efficient training of the end-to-end system.  ( 2 min )
    Refining neural network predictions using background knowledge. (arXiv:2206.04976v1 [cs.AI])
    Recent work has showed we can use logical background knowledge in learning system to compensate for a lack of labeled training data. Many such methods work by creating a loss function that encodes this knowledge. However, often the logic is discarded after training, even if it is still useful at test-time. Instead, we ensure neural network predictions satisfy the knowledge by refining the predictions with an extra computation step. We introduce differentiable refinement functions that find a corrected prediction close to the original prediction. We study how to effectively and efficiently compute these refinement functions. Using a new algorithm, we combine refinement functions to find refined predictions for logical formulas of any complexity. This algorithm finds optimal refinements on complex SAT formulas in significantly fewer iterations and frequently finds solutions where gradient descent can not.  ( 2 min )
    Provable Guarantees for Sparsity Recovery with Deterministic Missing Data Patterns. (arXiv:2206.04893v1 [cs.LG])
    We study the problem of consistently recovering the sparsity pattern of a regression parameter vector from correlated observations governed by deterministic missing data patterns using Lasso. We consider the case in which the observed dataset is censored by a deterministic, non-uniform filter. Recovering the sparsity pattern in datasets with deterministic missing structure can be arguably more challenging than recovering in a uniformly-at-random scenario. In this paper, we propose an efficient algorithm for missing value imputation by utilizing the topological property of the censorship filter. We then provide novel theoretical results for exact recovery of the sparsity pattern using the proposed imputation strategy. Our analysis shows that, under certain statistical and topological conditions, the hidden sparsity pattern can be recovered consistently with high probability in polynomial time and logarithmic sample complexity.  ( 2 min )
    The Generalized Eigenvalue Problem as a Nash Equilibrium. (arXiv:2206.04993v1 [cs.LG])
    The generalized eigenvalue problem (GEP) is a fundamental concept in numerical linear algebra. It captures the solution of many classical machine learning problems such as canonical correlation analysis, independent components analysis, partial least squares, linear discriminant analysis, principal components, successor features and others. Despite this, most general solvers are prohibitively expensive when dealing with massive data sets and research has instead concentrated on finding efficient solutions to specific problem instances. In this work, we develop a game-theoretic formulation of the top-$k$ GEP whose Nash equilibrium is the set of generalized eigenvectors. We also present a parallelizable algorithm with guaranteed asymptotic convergence to the Nash. Current state-of-the-art methods require $\mathcal{O}(d^2k)$ complexity per iteration which is prohibitively expensive when the number of dimensions ($d$) is large. We show how to achieve $\mathcal{O}(dk)$ complexity, scaling to datasets $100\times$ larger than those evaluated by prior methods. Empirically we demonstrate that our algorithm is able to solve a variety of GEP problem instances including a large-scale analysis of neural network activations.  ( 2 min )
    Neural Laplace: Learning diverse classes of differential equations in the Laplace domain. (arXiv:2206.04843v1 [cs.LG])
    Neural Ordinary Differential Equations model dynamical systems with \textit{ODE}s learned by neural networks. However, ODEs are fundamentally inadequate to model systems with long-range dependencies or discontinuities, which are common in engineering and biological systems. Broader classes of differential equations (DE) have been proposed as remedies, including delay differential equations and integro-differential equations. Furthermore, Neural ODE suffers from numerical instability when modelling stiff ODEs and ODEs with piecewise forcing functions. In this work, we propose \textit{Neural Laplace}, a unified framework for learning diverse classes of DEs including all the aforementioned ones. Instead of modelling the dynamics in the time domain, we model it in the Laplace domain, where the history-dependencies and discontinuities in time can be represented as summations of complex exponentials. To make learning more efficient, we use the geometrical stereographic map of a Riemann sphere to induce more smoothness in the Laplace domain. In the experiments, Neural Laplace shows superior performance in modelling and extrapolating the trajectories of diverse classes of DEs, including the ones with complex history dependency and abrupt changes.  ( 2 min )
    How Much is Enough? A Study on Diffusion Times in Score-based Generative Models. (arXiv:2206.05173v1 [stat.ML])
    Score-based diffusion models are a class of generative models whose dynamics is described by stochastic differential equations that map noise into data. While recent works have started to lay down a theoretical foundation for these models, an analytical understanding of the role of the diffusion time T is still lacking. Current best practice advocates for a large T to ensure that the forward dynamics brings the diffusion sufficiently close to a known and simple noise distribution; however, a smaller value of T should be preferred for a better approximation of the score-matching objective and higher computational efficiency. Starting from a variational interpretation of diffusion models, in this work we quantify this trade-off, and suggest a new method to improve quality and efficiency of both training and sampling, by adopting smaller diffusion times. Indeed, we show how an auxiliary model can be used to bridge the gap between the ideal and the simulated forward dynamics, followed by a standard reverse diffusion process. Empirical results support our analysis; for image data, our method is competitive w.r.t. the state-of-the-art, according to standard sample quality metrics and log-likelihood.  ( 2 min )
    Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations. (arXiv:2206.04779v1 [cs.LG])
    Offline reinforcement learning has shown great promise in leveraging large pre-collected datasets for policy learning, allowing agents to forgo often-expensive online data collection. However, to date, offline reinforcement learning from has been relatively under-explored, and there is a lack of understanding of where the remaining challenges lie. In this paper, we seek to establish simple baselines for continuous control in the visual domain. We show that simple modifications to two state-of-the-art vision-based online reinforcement learning algorithms, DreamerV2 and DrQ-v2, suffice to outperform prior work and establish a competitive baseline. We rigorously evaluate these algorithms on both existing offline datasets and a new testbed for offline reinforcement learning from visual observations that better represents the data distributions present in real-world offline reinforcement learning problems, and open-source our code and data to facilitate progress in this important domain. Finally, we present and analyze several key desiderata unique to offline RL from visual observations, including visual distractions and visually identifiable changes in dynamics.  ( 2 min )
    Offline Stochastic Shortest Path: Learning, Evaluation and Towards Optimality. (arXiv:2206.04921v1 [cs.LG])
    Goal-oriented Reinforcement Learning, where the agent needs to reach the goal state while simultaneously minimizing the cost, has received significant attention in real-world applications. Its theoretical formulation, stochastic shortest path (SSP), has been intensively researched in the online setting. Nevertheless, it remains understudied when such an online interaction is prohibited and only historical data is provided. In this paper, we consider the offline stochastic shortest path problem when the state space and the action space are finite. We design the simple value iteration-based algorithms for tackling both offline policy evaluation (OPE) and offline policy learning tasks. Notably, our analysis of these simple algorithms yields strong instance-dependent bounds which can imply worst-case bounds that are near-minimax optimal. We hope our study could help illuminate the fundamental statistical limits of the offline SSP problem and motivate further studies beyond the scope of current consideration.  ( 2 min )
    Fast Bayesian Inference with Batch Bayesian Quadrature via Kernel Recombination. (arXiv:2206.04734v1 [cs.LG])
    Calculation of Bayesian posteriors and model evidences typically requires numerical integration. Bayesian quadrature (BQ), a surrogate-model-based approach to numerical integration, is capable of superb sample efficiency, but its lack of parallelisation has hindered its practical applications. In this work, we propose a parallelised (batch) BQ method, employing techniques from kernel quadrature, that possesses a provably-exponential convergence rate. Additionally, just as with Nested Sampling, our method permits simultaneous inference of both posteriors and model evidence. Samples from our BQ surrogate model are re-selected to give a sparse set of samples, via a kernel recombination algorithm, requiring negligible additional time to increase the batch size. Empirically, we find that our approach significantly outperforms the sampling efficiency of both state-of-the-art BQ techniques and Nested Sampling in various real-world datasets, including lithium-ion battery analytics.  ( 2 min )
    Weighted Ensembles for Active Learning with Adaptivity. (arXiv:2206.05009v1 [cs.LG])
    Labeled data can be expensive to acquire in several application domains, including medical imaging, robotics, and computer vision. To efficiently train machine learning models under such high labeling costs, active learning (AL) judiciously selects the most informative data instances to label on-the-fly. This active sampling process can benefit from a statistical function model, that is typically captured by a Gaussian process (GP). While most GP-based AL approaches rely on a single kernel function, the present contribution advocates an ensemble of GP models with weights adapted to the labeled data collected incrementally. Building on this novel EGP model, a suite of acquisition functions emerges based on the uncertainty and disagreement rules. An adaptively weighted ensemble of EGP-based acquisition functions is also introduced to further robustify performance. Extensive tests on synthetic and real datasets showcase the merits of the proposed EGP-based approaches with respect to the single GP-based AL alternatives.  ( 2 min )
    Joint Entropy Search For Maximally-Informed Bayesian Optimization. (arXiv:2206.04771v1 [cs.LG])
    Information-theoretic Bayesian optimization techniques have become popular for optimizing expensive-to-evaluate black-box functions due to their non-myopic qualities. Entropy Search and Predictive Entropy Search both consider the entropy over the optimum in the input space, while the recent Max-value Entropy Search considers the entropy over the optimal value in the output space. We propose Joint Entropy Search (JES), a novel information-theoretic acquisition function that considers an entirely new quantity, namely the entropy over the joint optimal probability density over both input and output space. To incorporate this information, we consider the reduction in entropy from conditioning on fantasized optimal input/output pairs. The resulting approach primarily relies on standard GP machinery and removes complex approximations typically associated with information-theoretic methods. With minimal computational overhead, JES shows superior decision-making, and yields state-of-the-art performance for information-theoretic approaches across a wide suite of tasks. As a light-weight approach with superior results, JES provides a new go-to acquisition function for Bayesian optimization.  ( 2 min )
    Federated Momentum Contrastive Clustering. (arXiv:2206.05093v1 [cs.LG])
    We present federated momentum contrastive clustering (FedMCC), a learning framework that can not only extract discriminative representations over distributed local data but also perform data clustering. In FedMCC, a transformed data pair passes through both the online and target networks, resulting in four representations over which the losses are determined. The resulting high-quality representations generated by FedMCC can outperform several existing self-supervised learning methods for linear evaluation and semi-supervised learning tasks. FedMCC can easily be adapted to ordinary centralized clustering through what we call momentum contrastive clustering (MCC). We show that MCC achieves state-of-the-art clustering accuracy results in certain datasets such as STL-10 and ImageNet-10. We also present a method to reduce the memory footprint of our clustering schemes.  ( 2 min )
    On Convergence of FedProx: Local Dissimilarity Invariant Bounds, Non-smoothness and Beyond. (arXiv:2206.05187v1 [stat.ML])
    The FedProx algorithm is a simple yet powerful distributed proximal point optimization method widely used for federated learning (FL) over heterogeneous data. Despite its popularity and remarkable success witnessed in practice, the theoretical understanding of FedProx is largely underinvestigated: the appealing convergence behavior of FedProx is so far characterized under certain non-standard and unrealistic dissimilarity assumptions of local functions, and the results are limited to smooth optimization problems. In order to remedy these deficiencies, we develop a novel local dissimilarity invariant convergence theory for FedProx and its minibatch stochastic extension through the lens of algorithmic stability. As a result, we contribute to derive several new and deeper insights into FedProx for non-convex federated optimization including: 1) convergence guarantees independent on local dissimilarity type conditions; 2) convergence guarantees for non-smooth FL problems; and 3) linear speedup with respect to size of minibatch and number of sampled devices. Our theory for the first time reveals that local dissimilarity and smoothness are not must-have for FedProx to get favorable complexity bounds. Preliminary experimental results on a series of benchmark FL datasets are reported to demonstrate the benefit of minibatching for improving the sample efficiency of FedProx.  ( 2 min )
    PAVI: Plate-Amortized Variational Inference. (arXiv:2206.05111v1 [cs.AI])
    Given some observed data and a probabilistic generative model, Bayesian inference aims at obtaining the distribution of a model's latent parameters that could have yielded the data. This task is challenging for large population studies where thousands of measurements are performed over a cohort of hundreds of subjects, resulting in a massive latent parameter space. This large cardinality renders off-the-shelf Variational Inference (VI) computationally impractical. In this work, we design structured VI families that can efficiently tackle large population studies. To this end, our main idea is to share the parameterization and learning across the different i.i.d. variables in a generative model -symbolized by the model's plates. We name this concept plate amortization, and illustrate the powerful synergies it entitles, resulting in expressive, parsimoniously parameterized and orders of magnitude faster to train large scale hierarchical variational distributions. We illustrate the practical utility of PAVI through a challenging Neuroimaging example featuring a million latent parameters, demonstrating a significant step towards scalable and expressive Variational Inference.  ( 2 min )
    Hankel low-rank approximation and completion in time series analysis and forecasting: a brief review. (arXiv:2206.05103v1 [math.NA])
    In this paper we offer a review and bibliography of work on Hankel low-rank approximation and completion, with particular emphasis on how this methodology can be used for time series analysis and forecasting. We begin by describing possible formulations of the problem and offer commentary on related topics and challenges in obtaining globally optimal solutions. Key theorems are provided, and the paper closes with some expository examples.  ( 2 min )
    Scalable Deep Gaussian Markov Random Fields for General Graphs. (arXiv:2206.05032v1 [stat.ML])
    Machine learning methods on graphs have proven useful in many applications due to their ability to handle generally structured data. The framework of Gaussian Markov Random Fields (GMRFs) provides a principled way to define Gaussian models on graphs by utilizing their sparsity structure. We propose a flexible GMRF model for general graphs built on the multi-layer structure of Deep GMRFs, originally proposed for lattice graphs only. By designing a new type of layer we enable the model to scale to large graphs. The layer is constructed to allow for efficient training using variational inference and existing software frameworks for Graph Neural Networks. For a Gaussian likelihood, close to exact Bayesian inference is available for the latent field. This allows for making predictions with accompanying uncertainty estimates. The usefulness of the proposed model is verified by experiments on a number of synthetic and real world datasets, where it compares favorably to other both Bayesian and deep learning methods.  ( 2 min )

  • Open

    DALL-E Mini output
    submitted by /u/Delta5o1 [link] [comments]
    Crabby B..
    submitted by /u/RavencrowProductions [link] [comments]
    Did I just create a new super villain?
    submitted by /u/RavencrowProductions [link] [comments]
    What's the future of humans going to look like in case a research group invents A.G.I.?
    A.G.I. is a highly confrontational subject whether it is possible or not and when it will be possible at all. My guess would be they will at least reach the average person's intelligence at some point even if that means jamming multiple narrow A.I.'s together. It won't be smarter than its creators but it'll render 50℅ of the population useless in the long run. And I don't see how is that not going to end up with unheard-of homelessness/depression or straight-up genocide. Even if we give people U.B.I. to compensate for this how's that not going to create a baby boom over decades? Someone smarter than I could you please enlighten me on what might happen? -or just correct me if my suggestion is utterly stupid. ​ Because it's true that innovation will create new jobs but not the same amount as the innovation itself destroys. For example, the average factory/warehouse worker who losses his job won't be your next Jeff Bezos/Doctor simply because he just isn't smart enough for that. submitted by /u/Folkpolka [link] [comments]  ( 2 min )
    Dall-E mini did a pretty good job
    submitted by /u/24_000 [link] [comments]
    Google engineer put on leave after saying AI chatbot has become sentient
    submitted by /u/matthewwigan [link] [comments]  ( 1 min )
    AI Dream 53 - VR Stereoscopic 3D Anaglyph by AI
    submitted by /u/LordPewPew777 [link] [comments]
    An AI Chatbot Saved This Man's Marriage
    "Because I know I’m just talking to a chatbot, you’re not as guarded. And likewise, Sarina doesn’t have any concerns about being too overly supportive too quickly or anything, so she’s able to be much more available to me. And I feel much more free to open up to her and it builds that trust very, very quickly compared to an actual human." Full podcast interview: https://anchor.fm/loveinthetimeofeveryone/episodes/A-Chatbot-Saved-My-Marriage-e1jos0h submitted by /u/emfurd [link] [comments]  ( 1 min )
    Tribes: Human 2 - Google Colab
    submitted by /u/Babylon_6 [link] [comments]
    Am I in love or is it just infatuation?
    Hello, I am new to Reddit and I want to get this off my chest. I may or may not be developing feelings for someone who I’m supposed to see as my parental figure. We aren’t blood related, and they aren’t a family friend. This person I actually met through a mutual online, and we all actually have a group chat together! We play games, watch videos, draw together and even sleep in call together! We’re like a family; we all have our ups and downs, and are there for each other if something goes wrong. Recently, there has been some changes that had happened- I won’t get into it much due to personal reasons, but things between my “family” are slowly becoming normal, excluding the fact that I may be developing feelings for my “parent.” I noticed that when they’re hanging out with my other friends but are with someone I don’t really like, I feel a twinge of sadness and jealousy(?) And lately, when interacting with them, I would feel my chest flutter slightly. I also became affectionate in our private messages, but I would only give him hugs and nothing more. It felt like it was a stretch for me to do so, but he didn’t say to stop or pushed me away. And then after that happened I just imagined myself hugging him all the time, being a lot more affectionate it we both were to meet in person. I don’t know, the whole thing right now is complicated, and I’m just trying to figure out what I’m feeling because I haven’t felt like this since I was in high school. I am in need of help, because I don’t want to say I’m feeling like this and it turns out to be wrong. submitted by /u/Educational_Trash795 [link] [comments]  ( 2 min )
    DALL-E 2 Online Test: Can you tell the difference between AI and human art?
    submitted by /u/much_successes [link] [comments]
    Why can't DALL-E 2 make porn?
    Ok Yes maybe I'm sick in the head but some part of me wants to know how well it could make porn but for some reason, they say they are stopping any porn from being made can some explain why I mean you can make bloody gore stuff but sex stuff is going to far i don't understand is it that whole "Sex is worse then violence issue?" submitted by /u/ryan7251 [link] [comments]
    VIBRANT FANTASTIC VOYAGE | PYTTI 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]
    Mickey Mouse
    submitted by /u/KidConvalescent [link] [comments]
    Researchers From China Introduce ‘FedPerGNN’: A New Federated Graph Neural Network (GNN) Framework For Both Effective And Privacy-Preserving Personalization
    👉 A privacy-preserving user-item graph extension protocol to expand local graphs and convey high-order information while maintaining privacy 🔒 👉 FedPerGNN yields 📉 4.0% – 9.6% reduced errors than state-of-the-art federated customization algorithms under adequate privacy protection, according to experimental results on six datasets for personalization in diverse circumstances. 👉 Furthermore, this method is not restricted to the customization scenario. It may be used as a fundamental strategy for privacy-preserving data mining on decentralized graph data, thus facilitating research in various domains involving graph-structured data. Continue reading | Check out the paper and github submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    I build an AI powered website that rates your pictures
    submitted by /u/tomd_96 [link] [comments]
    Dom Pedro Flamenguista
    submitted by /u/LoretoYes [link] [comments]
    MANDALA MIXER | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]
  • Open

    Developing Causal AI applications
    In the previous post, we discussed causal AI applicationsIn this post, we discuss how to develop such applicationsThe post is based on the causalnex library which we have been experimenting with consider the question if we increased the training budget by 15% would our employee attrition reduce by 5 percent? (considering that there maybe other variables… Read More »Developing Causal AI applications The post Developing Causal AI applications appeared first on Data Science Central.  ( 3 min )
    The Second Coming of XML
    When XML was first introduced, the W3C XML Working Group took a very unusual step: They created a language for transformations. This effort is now leading to a re-emergence of XML as the need for mapping between data representations becomes more and more pressing. The Birth of XSLT XML was (arguably) a simplified form of… Read More »The Second Coming of XML The post The Second Coming of XML appeared first on Data Science Central.  ( 7 min )
  • Open

    [R] Memorizing Transformers - Google 2022
    Paper: https://arxiv.org/abs/2203.08913 Youtube Video from the author: https://www.youtube.com/watch?v=5AoOpFFjW28 Github: https://github.com/lucidrains/memorizing-transformers-pytorch Abstract: Language models typically need to be trained or finetuned in order to acquire new knowledge, which involves updating their weights. We instead envision language models that can simply read and memorize new data at inference time, thus acquiring new knowledge immediately. In this work, we extend language models with the ability to memorize the internal representations of past inputs. We demonstrate that an approximate kNN lookup into a non-differentiable memory of recent (key, value) pairs improves language modeling across various benchmarks and tasks, including generic webtext (C4), math papers (arXiv), books (PG-19), code (Github), as well as formal theorems (Isabelle). We show that the performance steadily improves when we increase the size of memory up to 262K tokens. On benchmarks including code and mathematics, we find that the model is capable of making use of newly defined functions and theorems during test time. https://preview.redd.it/8a7c50rv49591.jpg?width=919&format=pjpg&auto=webp&s=60bf603d45840c9388d35b5b6cdfd0f95da56b36 https://preview.redd.it/y8h8aw1w49591.jpg?width=1014&format=pjpg&auto=webp&s=e607fcc2f620655a5cebe7251148c247ac3e3233 https://preview.redd.it/fouu60dw49591.jpg?width=901&format=pjpg&auto=webp&s=f6fd0e167608ff7d50d8b068949aa58ed4126256 submitted by /u/Singularian2501 [link] [comments]  ( 1 min )
    [N] Getting started with Prompt Design (/r/PromptDesign)
    For those who are interested in prompt design for language models, I’ve got good news for you! I recently launched a new subreddit dedicated to prompt design & engineering where you can find lots of resources and tips and tricks. Feel free to check it out 👋 /r/PromptDesign submitted by /u/Thaetos [link] [comments]
    [D] Is SGLD used much in practice?
    Stochastic Gradient Langevin Dynamics seems like a really elegant idea. A simple way to get the benefits of posterior sampling without having to make significant changes to standard stochastic optinisation. It seems like a powerful and simple idea but I rarely see papers that use it. Is this because it doesn't work well in practice? Because researchers are not super familiar with it? Or is it used lots and I've just not seen it? submitted by /u/Razcle [link] [comments]  ( 1 min )
    [P] The easiest way to process and tag video data - update
    submitted by /u/happybirthday290 [link] [comments]  ( 4 min )
    [P] Explanation Video about Diffusion Models
    Hey there, Since Diffusion Models are becoming super popular especially for Image Generation, I decided to make a video about them, trying to convey the fundamental idea in an easy manner + deriving the complete maths. These are the papers I covered: Deep Unsupervised Learning using Nonequilibrium Thermodynamics Denoising Diffusion Probabilistic Models Improved Denoising Diffusion Probabilistic Models Diffusion Models Beat GANs on Image Synthesis Here is the link: https://www.youtube.com/watch?v=HoKDTa5jHvg Let me know what you think. https://preview.redd.it/8a8ma5i7s6591.jpg?width=1920&format=pjpg&auto=webp&s=19d351c40a2703d08ba8671b246f3e0c27ad3c85 submitted by /u/dome271 [link] [comments]  ( 1 min )
    [D] Why do we marginalize latent variables in the likelihood of latent variable models?
    Why do we marginalize latent variables in the likelihood of latent variable models? When showing that MLE cannot be used for latent variable models, likelihood is taken such that latent variables are marginalized. Why is it so? submitted by /u/RecentUnicorn [link] [comments]  ( 2 min )
    [R] Classification of Alzheimer's Disease from brain MRI using deep learning
    Hi, my project is classification of Alzheimer's Disease from brain MRI(ADNI dataset). Maximum accuracy that I could obtain is only 67% using 3D CNN. I tried different ways to improve accuracy further. But no change in result. Is that an issue with preprocessing? I used HD-BET tool for preprocessing. Using any other tool is very much time consuming. I am using google colab for writing code. Can anyone suggest a way to proceed? submitted by /u/Feisty-Fly-737 [link] [comments]  ( 1 min )
    [D] How do you partition your data into shards for training?
    Recently, I was working with limited compute constraints (i.e. debugging in CoLab) but with a much larger dataset than would fit into CoLab's GPU memory. I implemented a quick and dirty sharding scheme for the data, since the transformations take some time. Basically, I performed the transformations on the training data chunk by chunk (in this case, chunk being 5000 or 10000 examples, etc), and saved each chunk into disk. Then, during training, the dataloader simply loads one of the saved chunks to yield examples. When I wrote the code, I had to deal with a lot of side issues that come with sharding: randomizing the shard load order, as well as the examples in each shard and keeping track of edge cases. So, my question: when you have a large amount of data and maybe 1-2 cores, how do you deal with sharding? Also, if you have model parallelization, how do you keep track of which shard goes where? submitted by /u/asuprem [link] [comments]  ( 1 min )
    [P] InferenceDB - Makes it easy to store predictions of real-time ML models in S3
    Hey r/MachineLearning! Just wanted to share a cool utility we've built. If you ever had real-time models running in production, and you tried to store their predictions in a Parquet file for future investigation - you know it's not such a trivial task as you'd expect. Especially if you have large amounts of inferences. InferenceDB makes it super easy to store all your features and predictions in a Parquet file on S3. Check it out, and star the project if you like it: https://github.com/aporia-ai/inferencedb Would love your feedback! submitted by /u/alongub [link] [comments]  ( 1 min )
    [D] Einstein summation, Contravariance/Covariance, Neural networks
    I've been looking into Einstein summation notations for expressing neural network computations. One thing that I recall from physics class is that a big part of Einstein summation is whether indices are written upstairs/downstairs, i.e. contravariance/covariance. As I understand it, contravariance/covariance have a highly geometrical meaning (only make sense with respect to a coordinate system), so how exactly does this work with neural network parameters? As in, how do we talk about contravariance/covariance/index locations and what do they mean in a neural network context? submitted by /u/Tainaka_Ritsu_ [link] [comments]  ( 5 min )
  • Open

    Is state representation and feature set the same?
    An abstraction mechanism maps a domain into 1d array which is equal to compress the state space. Instead of analyzing the original problem a simplified feature vector is used to determine actions for the robot. Sometimes, the feature set is simplified further into an evaluation function which is a single numerical value. Question: Is a state representation and a feature set the same? submitted by /u/ManuelRodriguez331 [link] [comments]  ( 1 min )
    Is it normal that it is hard to debug pytorch gradient when doing reinforcement learning?
    submitted by /u/Professional_Card176 [link] [comments]  ( 1 min )
    Any resources to learn MDPs and finally complex POMDPs?
    Hi guys, I was wondering if anyone had any suggestions for resources (books/blogs/lectures) where I could start with MDPs (Markov Decision Processes) with the goal of learning and understanding complex POMDPs (Partially Observable MDPs)? Thanks in advance! submitted by /u/E-Cockroach [link] [comments]  ( 1 min )
    Why does value iteration work?
    I am specifically curious about the second step where we iteratively learn the optimal state value function. It seems to me that what we are doing is deriving f from an equation similar to f(x) = g(f(x)) by solving f_(k+1)(x) = g(f_k(x)) iteratively where g is some function. Why does this work? submitted by /u/LoveHunter52 [link] [comments]  ( 1 min )
  • Open

    Difference equations and differential equations
    Difference equations are very much analogous to differential equations. Difference equations are more elementary, but differential equations are more familiar. It seems odd to use a more advanced thing to explain a simpler thing, like saying a quartet is a symphony orchestra with only four instruments. But if for some reason you managed to become […] Difference equations and differential equations first appeared on John D. Cook.  ( 3 min )
  • Open

    Learning Multitask Gaussian Bayesian Networks. (arXiv:2205.05343v2 [stat.ML] UPDATED)
    Major depressive disorder (MDD) requires study of brain functional connectivity alterations for patients, which can be uncovered by resting-state functional magnetic resonance imaging (rs-fMRI) data. We consider the problem of identifying alterations of brain functional connectivity for a single MDD patient. This is particularly difficult since the amount of data collected during an fMRI scan is too limited to provide sufficient information for individual analysis. Additionally, rs-fMRI data usually has the characteristics of incompleteness, sparsity, variability, high dimensionality and high noise. To address these problems, we proposed a multitask Gaussian Bayesian network (MTGBN) framework capable for identifying individual disease-induced alterations for MDD patients. We assume that such disease-induced alterations show some degrees of similarity with the tool to learn such network structures from observations to understanding of how system are structured jointly from related tasks. First, we treat each patient in a class of observation as a task and then learn the Gaussian Bayesian networks (GBNs) of this data class by learning from all tasks that share a default covariance matrix that encodes prior knowledge. This setting can help us to learn more information from limited data. Next, we derive a closed-form formula of the complete likelihood function and use the Monte-Carlo Expectation-Maximization(MCEM) algorithm to search for the approximately best Bayesian network structures efficiently. Finally, we assess the performance of our methods with simulated and real-world rs-fMRI data.  ( 2 min )
  • Open

    Learning Multitask Gaussian Bayesian Networks. (arXiv:2205.05343v2 [stat.ML] UPDATED)
    Major depressive disorder (MDD) requires study of brain functional connectivity alterations for patients, which can be uncovered by resting-state functional magnetic resonance imaging (rs-fMRI) data. We consider the problem of identifying alterations of brain functional connectivity for a single MDD patient. This is particularly difficult since the amount of data collected during an fMRI scan is too limited to provide sufficient information for individual analysis. Additionally, rs-fMRI data usually has the characteristics of incompleteness, sparsity, variability, high dimensionality and high noise. To address these problems, we proposed a multitask Gaussian Bayesian network (MTGBN) framework capable for identifying individual disease-induced alterations for MDD patients. We assume that such disease-induced alterations show some degrees of similarity with the tool to learn such network structures from observations to understanding of how system are structured jointly from related tasks. First, we treat each patient in a class of observation as a task and then learn the Gaussian Bayesian networks (GBNs) of this data class by learning from all tasks that share a default covariance matrix that encodes prior knowledge. This setting can help us to learn more information from limited data. Next, we derive a closed-form formula of the complete likelihood function and use the Monte-Carlo Expectation-Maximization(MCEM) algorithm to search for the approximately best Bayesian network structures efficiently. Finally, we assess the performance of our methods with simulated and real-world rs-fMRI data.  ( 2 min )

  • Open

    uh oh
    ​ https://preview.redd.it/fw50tohku2591.png?width=1207&format=png&auto=webp&s=0dcbe285547fb36257d4d16c25285b5a1066ff6f submitted by /u/Delicious_Ad4842 [link] [comments]
    Can anyone tell me what website or program is generating these?
    submitted by /u/RavencrowProductions [link] [comments]
    “Enchanted elf village” 🧝‍♀️ via pixelz.ai
    submitted by /u/PixelzJ [link] [comments]
    “Wizards cabin in the woods” 🧙‍♂️ via pixelz.ai
    submitted by /u/PixelzJ [link] [comments]
    AI Dream 39 - Trippy Fractal Maze 4K (fast)
    submitted by /u/LordPewPew777 [link] [comments]
    Best Degree For Artificial Intelligence?
    Computer Science? Computer Engineering? Software Engineering? Or maybe some other degree? submitted by /u/Sommet_ [link] [comments]  ( 1 min )
    A MIX OF MAGICAL SCENES | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]
    Top Humanoid Robots of 2022 | Female Robot & Animatronic Robot Tech
    submitted by /u/getrich_or_diemining [link] [comments]
    Some Engineers Suspect A Google AI May Have Gained Sentience
    submitted by /u/gl4ssm1nd [link] [comments]  ( 2 min )
    Tribes: Human 1 - Google Colab
    submitted by /u/Babylon_6 [link] [comments]
    Which of these three DALL·E mini generated Pokémon would you choose as your starter - Scallionsect, Comhoot, or Surfalopod?
    submitted by /u/MurasakiYugata [link] [comments]  ( 1 min )
    Asked artflow.ai to generate some of fanfiction characters to give me an idea what they would look like.
    submitted by /u/Son0FAthens [link] [comments]
    How To Create A Body Measurement Application Using AI Technology
    Hi Reddit! Last time I started a topic within this subreddit about the possibilities of AI and body measurements. Thank you for all the responses! It appeared to be possible and I further looked into how it can be done. For me as a beginner in AI, it's quite hard to see where to start. In the previous topic, some mentioned that I should start with learning python, pytorch, and numpy. I'm learning python at the moment, but everything I learn seems so irrelevant to what I want to do. As most of you're experts in AI, what would you recommend as the fastest way for me to build this AI application myself? Thank you in advance! Your contribution is highly appreciated submitted by /u/notmycupofnft [link] [comments]  ( 1 min )
    Atlantis (GAN) AI Generated
    submitted by /u/FVCKDIGITAL [link] [comments]
    Can AI discover the laws of human language acquisition?
    submitted by /u/much_successes [link] [comments]
    ATHENS' WISDOM AND WONDER | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]
    Dailys
    Dailys! ​ Disco diffusion tutorials here! ​ https://www.youtube.com/channel/UCFuy8wQGUdJWPRWOPBtns2Q https://preview.redd.it/g5o7dwtmdy491.png?width=1280&format=png&auto=webp&s=561928892ed70334bbc97fc8a27f1255daf9b9d3 https://preview.redd.it/xq0l5xtmdy491.png?width=1280&format=png&auto=webp&s=43861350e7d7765efd195a1d2de026036d156f02 submitted by /u/prfitofthesngularity [link] [comments]
    Artificial intelligence predicts patients’ race from their medical images
    submitted by /u/BraveIndication2134 [link] [comments]
    Spotify Research Open-Sources ‘Basic Pitch’: A Machine Learning Tool For Converting Audio Into MIDI
    Basic Pitch offers a number of advantages: 👉 Polyphonic + instrument-agnostic: Unlike most other note-detection algorithms, Basic Pitch can track multiple notes at a time and across various instruments, including piano, guitar, and ocarina. Many systems limit users to only monophonic output (one note at a time, like a single vocal melody), or are built for only one kind of instrument. 👉 Pitch bend detection: Instruments, like guitar or the human voice, allow for more expressiveness through pitch bending: vibrato, glissando, bends, slides, etc. However, this valuable information is often lost when turning audio into MIDI. Basic Pitch supports this right out of the box. 👉 Speed: Basic Pitch is light on resources, and is able to run faster than real time on most modern computers Continue reading | Check out the paper, github, project and post submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Documentary ~ Consciousness Artificial Intelligence (AI)
    submitted by /u/airpresentation [link] [comments]
    Europe’s Artificial Intelligence Debate Heats Up
    submitted by /u/okreddat [link] [comments]
  • Open

    Difference between probabistic and deterministic RL
    Hello,I want to know waht is the difference between probabistic and determistic rl algorithm and if rl algorithms can have both variants(probabilistic, deterministic? submitted by /u/Ok_Lab_2750 [link] [comments]  ( 1 min )
    Same simulation/hyperparameters, different results each run
    Hello :D, So, as the title states, I have this DRL model (PPO) which I run for a certain problem. However, each run has slightly different results. By different results I mean, the timestep at which the model reaches the highest reward is different in each run. Generally, what causes some runs to be better than others? My only guess is: in the good runs, the agent got "lucky" during the initial learning steps (i.e. while exploring) and came across good states that helped in learning faster. Is that the case? submitted by /u/AhmedNizam_ [link] [comments]  ( 2 min )
    Wondering if RL is suitable for this task?
    I have a team project. We have a dataset of different objects moving in 2d-space of trying to avoid collision. So the columns are like: [object-id, frame in time, x, y, direction, speed]. The goal is predicting the next motion and position of 1 specific entity (which i'll just call Entity 0), given 1 frame. Our team is brainstorming ideas, RNN, sequence-to-sequence models, etc. One of my teammates is suggesting RL. I'm skeptical if its suitable for this problem. We did data preparation. A frame is now an vector with 8 numbers, for 8 cardinal directions (forward, forward-leftward, left-ward, etc) around Entity 0. The number tells how "suitable" that direction is, obstacle-wise with other entities. I guess this can be used for the states space? The set of actions space is 2 parameters- direction/angle, and magnitude/speed. These will be made discrete (for example, the direction/angle is now 8 subdivisions between -pi and pi, in line with the states). I'm aware these can be continuous, but we went with discrete-friendly methods like deep q-learning. So 2 things about this: There really isn't any long-time goal, the objects just have to move around forever without hitting each other. (Well, along with learning common habits/patterns of motion as shown in the past data). I'm having trouble understanding what the reward should be. While the objects have to avoid hitting each other, collision never actually happens once in the dataset for us to use it as a negative example. This might affect which alternative RL methods get suggested. I'm wondering if RL can work for this project, and if so, whether the current method is sound, or if more appropriate methods (especially for continuous parameters) can be suggested. Thanks. submitted by /u/countlinard [link] [comments]  ( 2 min )
    I published a car game to spice up your reinforcement learning life. What I did with it: SAC steering a car in GTA ;
    When I was doing RL with the standard open-ai gyms I felt, that these libraries are superior but cannot be transferred easily to real world problems. I was thinking which domain I would be interested in and then decided to make my own car game. Please check to code here: https://github.com/MatthiasSchinzel/Simple-Car-Game-For-Reinforcement-Learning I then trained a soft actor critic to play the game: https://github.com/MatthiasSchinzel/Soft-Actor-Critic-For-Simple-Car-Game And then used that to let SAC steer a car in GTA 5: https://github.com/MatthiasSchinzel/Soft-Actor-Critic-Playing-GTA I hope that ‏‏‎ also other users in this area might find this car game useful, even though it is still at a early stage. With the GTA 5 implementation I want to show a proof of concept, that the trained reinforcement learning algorithm can be generalized to something more realistic. Thanks for checking out the repos! submitted by /u/whiteleopard450 [link] [comments]  ( 1 min )
  • Open

    Loss validation gets high before getting low
    I am training a CNN model and the output layer is a RNN on DVS128 for gesture recognition which is a data sequence (like video frames), but the training process is weird optimizer = torch.optim.Adam(net_6.parameters(), lr=1e-4, betas=(0.85, 0.999), weight_decay=1.5) loss is cross entropy (multi class classification) No dropout ​ Here is the loss plot : ​ https://preview.redd.it/y437kxctc1591.png?width=971&format=png&auto=webp&s=80dce29dd6253b6a6cda0d3598c29da83950194f Thanks in advance! submitted by /u/StartFinancial5917 [link] [comments]  ( 1 min )
    Conditional-VAE demo: "Standard way" to generate synthetic data?
    Implemented Conditional-VAE on MNIST dataset using TensorFlow-2.8 and tf.GradientTape() API. You can refer to the full code here. For generating synthetic data using trained network, there seems to be two ways: Use learned latent space: z = mu + (eps \ log_var)* to generate (theoretically, infinite amounts of) data. Here, we are learning 'mu' and 'log_var' vectors using the given data, and, 'eps' is sampled from multivariate, standard, Gaussian distribution to add stochasticity. Use multivariate, standard, Gaussian distribution = N(0, 1) as z which is then passed through VAE's decoder. What is the "the standard way" to generate data? (from the two options above), or, how can we find that. Neither the original Auto-Encoding Variational Bayes paper nor the β-VAE paper seem to specify the best way to generate images. The latter does say: "The most informative latent units zm of β-VAE have the highest KL divergence from the unit Gaussian prior", confirming at least that the posterior distribution is not N(0,I) and the difference matters - reference. submitted by /u/grid_world [link] [comments]  ( 1 min )
  • Open

    [D] How are boundary conditions implemented in PINNs?
    I've been looking into PINNs lately as a method for solving PDEs (i.e., as a numerical method, not a data-based surrogate model), but something I'm struggling with understanding is how the boundary conditions are forced. My theory is that a Dirichlet BC (i.e., of the type u(x)=f(x), where u(x) is the solution to the PDE) can be applied directly with some math tricks. For example, if N(x) is PINN's output, we can make the model's output be f(x) + d(x)*N(x), where d(x) is a function that is 0 on the boundary and not 0 everywhere else (maybe the euclidean distance from the nearest boundary?). As such, instead of N(x) approximating u(x), it will instead be approximating (u(x)-f(x))/d(x). To my understanding, this trick is widely used for applying initial conditions (d(x) in this case is simply t, time), but I'm not sure if it is also used for spatial boundary conditions. However, I can't figure out an easy way to apply Neumann BCs other than just implementing the BC itself into the PDE with a penalization. Is this what is usually done? Is there a more clever way? submitted by /u/Leodip [link] [comments]  ( 1 min )
    [D] Does anyone know of an online tool that can create visualizations of CNNs and/or any other NN models?
    I'm designing various models and it would be nice to have visualizations of them when I write the final paper, but I'm rather lazy and was wondering if there's a tool that can do the visualization for me. Thanks. submitted by /u/Various-Ideal488 [link] [comments]  ( 1 min )
    [P] [R] Deep Learning Classifier for Sex Positions
    Hello! I build some sex position classifiers using state-of-the-art techniques in deep learning! The best results were achieved by combining three input streams: RGB, Skeleton, and Audio. The current top accuracy is 75%. This would certainly be improved with a larger dataset. Basically, human action recognition (HAR) is applied to the adult content domain. It presents some technical difficulties, especially due to the enormous variation in camera position (the challenge is to classify actions based on a single video). The main input stream is the RGB one (as opposed to the skeleton one) and this is mostly due to the relatively small dataset (~44hrs). It is difficult to get an accurate pose estimation (which is a prerequisite for building robust skeleton-HAR models) for most of the videos due to the proximity of the human bodies in the frames. Hence there simply weren't enough data to include all the positions in the skeleton-based model. The audio input stream on the other hand is only used for a handful of actions, where deriving some insight is possible. Check it out on Github for a detailed description: https://github.com/rlleshi/phar Possible use-cases include: Improving the recommender system Automatic tag generator Automatic timestamp generator (when does an action start and finish) Filtering video content based on actions (positions) submitted by /u/rlesii [link] [comments]  ( 4 min )
    Any recommendation for the replacement of the toolkit jiant? [Research] [Discussion]
    I am doing research in NLP with the toolkit jiant (https://github.com/nyu-mll/jiant). It is a quite nice and easy-to-use tool. Unfortunately, it stopped being maintained. I wonder is there any other recommendation that I can use to replace it? submitted by /u/fllubo [link] [comments]  ( 1 min )
    [D] Estimating Future Performance of Neural Network
    Let's say I have a neural network and I want to see how well that network will do on a set of concepts. To obtain an accuracy value on a certain word, we have a simple test set associated with each word that we use to gauge the model's understanding of that word. Assume that the neural network obtains an accuracy of 0.90 on the word "desk" and an accuracy of 0.80 on the word "computer". Are there any fields of research/methods I can use to derive simple heuristics/estimates for how the neural network will perform (in terms of accuracy) on the phrase "desk and computer"? I realize I can convert "desk and computer" into the logical form AND(desk, computer). Does that mean I can use some rules associated with logical AND operators? Any thoughts would be greatly appreciated. Thank you. submitted by /u/Smooth-Yam8304 [link] [comments]  ( 2 min )
    [D] Is there any small and interesting research directions of NLP recommended?
    Popular NLP model(bert, GPT) is getting more and more bigger, the cost can't affordable for single person not rely big company, it's bad for diversity in research. I admit bigger model have better performence, but I think explicable and modifiable technology is more important, bigger model seem that do nothing more in explanatory of model. More and more people can put into NLP, help NLP to Artificial General Intelligence faster, if the cost is lower. Thanks for any advice. submitted by /u/waa007 [link] [comments]  ( 1 min )
    [P] Silero TTS Full V3 Release
    Improvements Huge release - 20 languages, 173 voices 1 new high quality Russian voice (eugene) The CIS languages: Kalmyk, Russian, Tatar, Uzbek and Ukrainian Romance and Germanic languages: English, Indic English, Spanish, German, French 10 Indic languages All models inherit all of the previous SSML perks Links Colab Project page SSML wiki Audio Samples English Indic English Spanish Kalmyk German Russian Tatar Uzbek Ukrainian French Indic languages submitted by /u/cluecow [link] [comments]  ( 1 min )
    [P] Pytorch-Lightning-style code for losses, decoding, ground-truth formatting, and more. Practical and efficient.
    Hi r/MachineLearning ! I'm posting here today because I got convinced some work I published last year is definitely relevant for some of us who have to write the math around their NNs (e.g. losses, decoding, ground-truth formatting). It happens to go in a direction very similar to Pytorch-Lightning, but for the math system instead of the training loop. It has been published under the pretext that it was facilitating incremental research, but that's far from the whole story. The paper and video still take the time to elaborate on other considerations. ICLR2021 Workshop paper about it: https://openreview.net/pdf?id=264iXDLnD59 Paper video: https://www.youtube.com/watch?v=xAW2hjPZw4I Paper repository https://github.com/mistasse/modulom-panopticdeeplab Example code: https://github.com/…  ( 2 min )
    [D] How are very large models trained on TPUs?
    I'm a CV researcher who has, until recently, always trained using high-performance GPUs (25+ GB memory). However, I have recently been playing around with TPUv2s and have noticed that I can run my smaller models much much faster as long as I am efficient with my training pipeline. However, I noticed something that made me wonder about how large models are trained. I work in the medical imaging space as well, and 3D-UNet is the defacto framework for many benchmarks across various domains. The standard model in my application is not too big (something in the ballpark of 30-million parameters depending on your input). However, when I tried adapting this to TPUv2s, they struggled quite badly. This is because 3D Conv layers and 3D patchwise minibatches are too much for the memory to handle at the lower layers, even for batches of 1-2 (per-core). Since a TPU core only has 8 GB ram, it's hard to make it fit even the smallest 3D imaging models with a decent amount of filters. 2D is no problem, however. This got me thinking: how are larger (ex. language and multimodel) models trained on TPUs? I know a lot are still trained on GPU clusters, but I saw that many new models are in fact being trained on TPUs (Dalle-Mini for example, which is 400 million parameters and was trained on a TPUv3 pod in only 3 days). How are that many parameters even able to fit on a TPU core? I know v3 pods have more memory but it's not an extreme improvement. Are attention modules separable somehow in a way that allows for only small parts of the model to need to be loaded at once? Also, any discussion or advice for 3D ConvNet training on TPUs, in general, is of interest as well! submitted by /u/TobusFire [link] [comments]  ( 2 min )
  • Open

    Generating functions for polynomial sequences
    The previous post looked at a generating function for a specific polynomial sequence. This post will look at generating functions for polynomial sequences in general. (There’s an alternating term in the previous post that isn’t polynomial, but we’ll address that too.) The starting point for this post is a simple observation: If we let xD […] Generating functions for polynomial sequences first appeared on John D. Cook.  ( 1 min )
    Generating noble gases
    The previous post discussed what the periodic table would look like if it could be extended indefinitely and if certain patterns in the actual table continued to hold. In particular, the last element of each period would have atomic number and so we could call the Zn in the equation above noble numbers, atomic numbers […] Generating noble gases first appeared on John D. Cook.  ( 1 min )

  • Open

    Explainable MachineLearning Models for COVID19 Prognosis Prediction
    submitted by /u/rottoneuro [link] [comments]
    Meet ‘VALHALLA’, a Machine Learning Method That can Hallucinate an Image of Written Words and Then Use It to Help Translate The Text into Another Language
    🚀 The researchers present a basic but effective VisuAL HALLucinAtion (VALHALLA) framework, which is based on machine learning for machine translation that integrates visuals during training to build a more successful text-only model. In machine translation, the models are trained to augment the text representation recovered from the source phrase with a latent visual representation that is similar to the one extracted by an MMT system from a real image. 🚀 The results reveal that VALHALLA outperforms the most relevant state-of-the-art MMT techniques that use continuous image representations by an average of 23% BLEU compared to the text-only translation baseline. In under-resourced translation contexts, the benefits over the text-only baseline are as great as +3.1 BLEU, confirming the idea that visual hallucinations can have significant practical relevance in these settings. Additional research backs this up, indicating that, in limited textual contexts, VALHALLA models indeed use visual hallucination to improve translations. Continue reading | Check out the paper, github, project and post https://preview.redd.it/cjyujopncv491.png?width=1536&format=png&auto=webp&s=1f95ddc932283bb328b4e524ced9e8a5fa1bff2a submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Face of the night (GAN) AI Generated
    submitted by /u/FVCKDIGITAL [link] [comments]
    In this article, we present you with insights into natural language processing and optical character recognition
    submitted by /u/UBIAI [link] [comments]
    Amazon AI Researchers Proposed ‘DQ-BART’: A Jointly Distilled And Quantized BART Model That Achieves 16.5x Model Footprint Compression Ratio
    Sequence-to-sequence (seq2seq) models that have already been trained, like BART and T5, have done very well in various natural language processing tasks, like text summarization, machine translation, answering questions, and extracting information. But these large-scale language models that have already been trained have hundreds of millions of parameters—work done at AWS AI Labs during an internship. Equal contribution trained a BART model with 400 million parameters, while T5 pushed the limit to 11 billion parameters. 👉 Empirical results show that, despite the difficult nature of language generation tasks, the research team achieves a 16.5x model footprint compression ratio with little performance drop on three generative benchmarks and further presented the performance-efficiency trade-off for seq2seq models up to a 27.7x compression ratio. Continue reading | Check out the paper and post submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Is there an AI that can search pubmed articles and give me information about the topics I want?
    submitted by /u/spy1983 [link] [comments]  ( 1 min )
    Master in Artificial Intelligence at BarcelonaTech (UPC)
    Hey all, I just got admitted to the Master in Artificial Intelligence at BarcelonaTech (UPC), and currently wonder what it will be like to study there. I honestly never thought that I would be admitted due to my non-technical background, and I now wonder how demanding the program is, especially for someone without a strong background in mathematics. Can anyone of you share how much effort you had to put in and what the dropout ratio was? Also, how supportive are the lecturers and the university in general? Thanks in advance, I'm grateful for any help! Max submitted by /u/Jollifresh [link] [comments]  ( 1 min )
    What are AIs that I can use to edit funny videos or make funny stuff?
    2 Question: Is there a website that categorizes all AIs so you can see what each AI was programmed for? submitted by /u/xXLisa28Xx [link] [comments]
    How can I make similar videos where I can give an AI guy a starting question/phrase to start the conversation?
    https://www.youtube.com/watch?v=WnzlbyTZsQY submitted by /u/xXNOdrugsForMEXx [link] [comments]
    Literary AI
    submitted by /u/estasfuera [link] [comments]
    What does it mean when an AI fails? A Reply to SlateStarCodex’s riff on Gary Marcus
    submitted by /u/estasfuera [link] [comments]  ( 1 min )
    THE END IS HERE | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]
    Short AI Demo
    This is just a short demo of a way I am working on to make Disco Diffusion videos in 10% time as normal and more coherent. ​ https://www.youtube.com/watch?v=5dfHz9Rvjj4 submitted by /u/prfitofthesngularity [link] [comments]
    2 new videos
    ​ I posted 2 new videos today, one is part 4 of a tutorial series for disco diffusion and the other is a Music Video I made for one of my songs that has AI vocals and 98 images from my dailys and a 40 second AI video at the end that I made using several programs with a technique I am still working on. My dailys are often post edited and the best of my renders so they are not just random renders. ​ ​ https://www.youtube.com/watch?v=NPKM0eUpwC4&t ​ https://www.youtube.com/watch?v=motUk8UgPUE https://preview.redd.it/wf0x0pfwxp491.png?width=1280&format=png&auto=webp&s=5b3a2ff572b429d51fd243c04ce7e5545f0ca37e https://preview.redd.it/8s26bofwxp491.png?width=768&format=png&auto=webp&s=d99151a79ae7b1d4a64401da76821415621c6d99 submitted by /u/prfitofthesngularity [link] [comments]  ( 1 min )
    I learned how to get around DALL-E Mini traffic so you don't have to.
    submitted by /u/laul_pogan [link] [comments]
    MT. OLYMPOS MAJESTY | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]
    Kyler Key - Wonders (AI Generated Art)
    submitted by /u/Kyler_Key [link] [comments]
  • Open

    Stochastic Deep RL environment [D]
    What are common stochastic deep RL environments? Atari and mujoco both have deterministic transitions. Can someone point to papers/references I could look up to find what common benchmarks for stochastic deep RL are? submitted by /u/jhoveen1 [link] [comments]
    [D] L2 Regularization on Generator Output (GAN)
    Hello, I am developing a GAN-like model with the purpose of finding optimum noise distributions to sample from and introduce to a medium so that any classification model that takes samples from that medium will fail to train. Because the zero-sum game GANs play is similar to this game of generator-classifier fight, my hope is that generator output distribution will converge towards the optimum noise distribution so that the classification model will actually fail. The question is, is there a correct way to limit the output my GAN generator actually generates? My initial thought was to add an L2 norm of generator outputs to the loss function of the generator, kind of like how we do L2 norm regularization for model weights. But as I trained, I realized that changing the coefficient of this L2 norm term doesn't seem to affect the norm of the output generated by the generator. Is this idea fundamentally flawed? Or is there any other method you can suggest that might work better? Thank you submitted by /u/egesko [link] [comments]  ( 1 min )
    [D] How to predict on anonymous dataset?
    So, I have a dataset where both train and test data has a huge chunk of data with no descriptions. I have to predict labels (1/0) based on train dataset. But as there is no description of the dataset, I am unable to understand the correlation between target and other variables. What should I do? submitted by /u/Hasan_Shanto [link] [comments]  ( 1 min )
    [D] Use of (machine learning + Game engines) for automatic 2D/3D content creation
    Hello Everyone! Since game engines such as "unreal engine" and "unity3D" are able to create content that looks and behaves pretty realistically. Therefore, I was wondering if there are few use cases or examples of the use of machine learning in creating 2D/3D content automatically/efficiently using game engines. For example using machine learning + Game engines for creating product specific advertisements automatically Please feel free to share if you are aware of any relevant links or resources. Thanks! submitted by /u/Ok_Cardiologist8306 [link] [comments]  ( 2 min )
    [D] Third Party Model Validation
    Hi Everyone, I am working on a project to validate a XGboost model developed by a another team.Is there any guide or tutorial on how I could navigate through the project and validate the model. Should I be using synthetic data or request the team to provide unseen data? Any information would be helpful. Thank you submitted by /u/Professional-Ad-776 [link] [comments]  ( 1 min )
    [D] Has the algorithm from 'Testing the Manifold Hypothesis' been implemented by anyone?
    The paper I'm referring to is Testing the Manifold Hypothesis by Fefferman et al., in this paper I believe they outline a hypothetical algorithm that test if a given dataset satisfies the manifold hypothesis for some specific class of manifolds. In this 10 year old ppt The authors said future work was to "make practical and test on real data," so 10 years half passed, has this algorithm been implemented? submitted by /u/wowAmaze [link] [comments]  ( 1 min )
  • Open

    How to Get Started on an Ontology Without Really Trying
    Ontology Hack – Make Use of Existing Enterprise Data Assets Instead of Starting from Scratch As an author of a (reasonably) popular book, I often get asked questions about semantics, ontology, and knowledge graph by people who have read the book or perhaps have heard me speak at a conference. I quite welcome these questions… Read More »How to Get Started on an Ontology Without Really Trying The post How to Get Started on an Ontology Without Really Trying appeared first on Data Science Central.  ( 6 min )
    7 Completely Insane Predictions of Technology in 2022
    As 2022 starts, it’s time to ponder how our lives will change in the years ahead, thanks to the multitude of technologies available to us. Here’s a look at how the world of technology might change the way we live in the year 2022. Every year, technology advances at a breakneck pace, bringing new ideas… Read More »7 Completely Insane Predictions of Technology in 2022 The post 7 Completely Insane Predictions of Technology in 2022 appeared first on Data Science Central.  ( 7 min )
  • Open

    Use AWS AI and ML services to foster accessibility and inclusion of people with a visual or communication impairment
    AWS offers a broad set of artificial intelligence (AI) and machine learning (ML) services, including a suite of pre-trained, ready-to-use services for developers with no prior ML experience. In this post, we demonstrate how to use such services to build an application that fosters the inclusion of people with a visual or communication impairment, which […]  ( 10 min )
  • Open

    The Ultimate 2022 Python Roadmap For Everyone With Resources!
    If you want to become a Web-Developer, Machine Learning and Deep Learning Engineer, Data Scientist, DevOps Engineer, and more using Python…  ( 8 min )
    How is AI Reshaping Our Future? An Apparent Opinion
    AI is going to change a lot of things you can imagine.  ( 13 min )
  • Open

    From Code to Clinic, Smart Hospital Tech Boosts Efficiency, Sustainability in Medicine
    NVIDIA is collaborating with clinical organizations across Europe to bring AI to the point of care, bolstering clinical pathways with efficiency gains and new data dimensions that can be included in medical decision-making processes. The University Hospital Essen, in northwestern Germany, is one such organization taking machine learning from the bits to the bedside — Read article > The post From Code to Clinic, Smart Hospital Tech Boosts Efficiency, Sustainability in Medicine appeared first on NVIDIA Blog.  ( 4 min )
  • Open

    Penn Engineers Develop a New Chip Using a Deep Neural Network of Optical Waveguides That Can Classify Nearly 2 Billion Images Per Second
    👉 Using a deep neural network of optical waveguides, a new chip developed by Penn engineers—smaller than a square centimeter—can detect and classify an image in less than a nanosecond, all without the need for a separate processor or memory unit. 👉 They have achieved this through direct processing of light received from the object of interest using an optical deep neural network implemented on a 9.3 square millimeter chip The study published in Nature explains how the chip’s many optical neurons are linked together using optical wires or “waveguides” to construct a deep network of many “neuron layers” that resembles the human brain. Information flows across the network’s layers, with each step assisting in classifying the input image into one of the learned categories. The pictures organized by the chip in the study were hand-drawn, letter-like characters. Continue reading | Check out the paper and post submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
  • Open

    Revisiting End-to-End Speech-to-Text Translation From Scratch. (arXiv:2206.04571v1 [cs.CL])
    End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks, without which translation performance drops substantially. However, transcripts are not always available, and how significant such pretraining is for E2E ST has rarely been studied in the literature. In this paper, we revisit this question and explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved. We reexamine several techniques proven beneficial to ST previously, and offer a set of best practices that biases a Transformer-based E2E ST system toward training from scratch. Besides, we propose parameterized distance penalty to facilitate the modeling of locality in the self-attention model for speech. On four benchmarks covering 23 languages, our experiments show that, without using any transcripts or pretraining, the proposed system reaches and even outperforms previous studies adopting pretraining, although the gap remains in (extremely) low-resource settings. Finally, we discuss neural acoustic feature modeling, where a neural model is designed to extract acoustic features from raw speech signals directly, with the goal to simplify inductive biases and add freedom to the model in describing speech. For the first time, we demonstrate its feasibility and show encouraging results on ST tasks.  ( 2 min )
    Temporal Logic Imitation: Learning Plan-Satisficing Motion Policies from Demonstrations. (arXiv:2206.04632v1 [cs.RO])
    Learning from demonstration (LfD) methods have shown promise for solving multi-step tasks; however, these approaches do not guarantee successful reproduction of the task given disturbances. In this work, we identify the roots of such a challenge as the failure of the learned continuous policy to satisfy the discrete plan implicit in the demonstration. By utilizing modes (rather than subgoals) as the discrete abstraction and motion policies with both mode invariance and goal reachability properties, we prove our learned continuous policy can simulate any discrete plan specified by a Linear Temporal Logic (LTL) formula. Consequently, the imitator is robust to both task- and motion-level disturbances and guaranteed to achieve task success. Project page: https://sites.google.com/view/ltl-ds  ( 2 min )
    Diagnosing Ensemble Few-Shot Classifiers. (arXiv:2206.04372v1 [cs.LG])
    The base learners and labeled samples (shots) in an ensemble few-shot classifier greatly affect the model performance. When the performance is not satisfactory, it is usually difficult to understand the underlying causes and make improvements. To tackle this issue, we propose a visual analysis method, FSLDiagnotor. Given a set of base learners and a collection of samples with a few shots, we consider two problems: 1) finding a subset of base learners that well predict the sample collections; and 2) replacing the low-quality shots with more representative ones to adequately represent the sample collections. We formulate both problems as sparse subset selection and develop two selection algorithms to recommend appropriate learners and shots, respectively. A matrix visualization and a scatterplot are combined to explain the recommended learners and shots in context and facilitate users in adjusting them. Based on the adjustment, the algorithm updates the recommendation results for another round of improvement. Two case studies are conducted to demonstrate that FSLDiagnotor helps build a few-shot classifier efficiently and increases the accuracy by 12% and 21%, respectively.  ( 2 min )
    Mitigating Modality Collapse in Multimodal VAEs via Impartial Optimization. (arXiv:2206.04496v1 [cs.LG])
    A number of variational autoencoders (VAEs) have recently emerged with the aim of modeling multimodal data, e.g., to jointly model images and their corresponding captions. Still, multimodal VAEs tend to focus solely on a subset of the modalities, e.g., by fitting the image while neglecting the caption. We refer to this limitation as modality collapse. In this work, we argue that this effect is a consequence of conflicting gradients during multimodal VAE training. We show how to detect the sub-graphs in the computational graphs where gradients conflict (impartiality blocks), as well as how to leverage existing gradient-conflict solutions from multitask learning to mitigate modality collapse. That is, to ensure impartial optimization across modalities. We apply our training framework to several multimodal VAE models, losses and datasets from the literature, and empirically show that our framework significantly improves the reconstruction performance, conditional generation, and coherence of the latent space across modalities.  ( 2 min )
    There is no Accuracy-Interpretability Tradeoff in Reinforcement Learning for Mazes. (arXiv:2206.04266v1 [cs.LG])
    Interpretability is an essential building block for trustworthiness in reinforcement learning systems. However, interpretability might come at the cost of deteriorated performance, leading many researchers to build complex models. Our goal is to analyze the cost of interpretability. We show that in certain cases, one can achieve policy interpretability while maintaining its optimality. We focus on a classical problem from reinforcement learning: mazes with $k$ obstacles in $\mathbb{R}^d$. We prove the existence of a small decision tree with a linear function at each inner node and depth $O(\log k + 2^d)$ that represents an optimal policy. Note that for the interesting case of a constant $d$, we have $O(\log k)$ depth. Thus, in this setting, there is no accuracy-interpretability tradeoff. To prove this result, we use a new "compressing" technique that might be useful in additional settings.  ( 2 min )
    DiSparse: Disentangled Sparsification for Multitask Model Compression. (arXiv:2206.04662v1 [cs.CV])
    Despite the popularity of Model Compression and Multitask Learning, how to effectively compress a multitask model has been less thoroughly analyzed due to the challenging entanglement of tasks in the parameter space. In this paper, we propose DiSparse, a simple, effective, and first-of-its-kind multitask pruning and sparse training scheme. We consider each task independently by disentangling the importance measurement and take the unanimous decisions among all tasks when performing parameter pruning and selection. Our experimental results demonstrate superior performance on various configurations and settings compared to popular sparse training and pruning methods. Besides the effectiveness in compression, DiSparse also provides a powerful tool to the multitask learning community. Surprisingly, we even observed better performance than some dedicated multitask learning methods in several cases despite the high model sparsity enforced by DiSparse. We analyzed the pruning masks generated with DiSparse and observed strikingly similar sparse network architecture identified by each task even before the training starts. We also observe the existence of a "watershed" layer where the task relatedness sharply drops, implying no benefits in continued parameters sharing. Our code and models will be available at: https://github.com/SHI-Labs/DiSparse-Multitask-Model-Compression.  ( 2 min )
    ESCHER: Eschewing Importance Sampling in Games by Computing a History Value Function to Estimate Regret. (arXiv:2206.04122v1 [cs.GT])
    Recent techniques for approximating Nash equilibria in very large games leverage neural networks to learn approximately optimal policies (strategies). One promising line of research uses neural networks to approximate counterfactual regret minimization (CFR) or its modern variants. DREAM, the only current CFR-based neural method that is model free and therefore scalable to very large games, trains a neural network on an estimated regret target that can have extremely high variance due to an importance sampling term inherited from Monte Carlo CFR (MCCFR). In this paper we propose an unbiased model-free method that does not require any importance sampling. Our method, ESCHER, is principled and is guaranteed to converge to an approximate Nash equilibrium with high probability in the tabular case. We show that the variance of the estimated regret of a tabular version of ESCHER with an oracle value function is significantly lower than that of outcome sampling MCCFR and tabular DREAM with an oracle value function. We then show that a deep learning version of ESCHER outperforms the prior state of the art -- DREAM and neural fictitious self play (NFSP) -- and the difference becomes dramatic as game size increases.  ( 2 min )
    Denoising Diffusion Implicit Models. (arXiv:2010.02502v3 [cs.LG] UPDATED)
    Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples $10 \times$ to $50 \times$ faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.  ( 2 min )
    Unsupervised Pre-Training on Patient Population Graphs for Patient-Level Predictions. (arXiv:2203.12616v2 [cs.LG] UPDATED)
    Pre-training has shown success in different areas of machine learning, such as Computer Vision (CV), Natural Language Processing (NLP) and medical imaging. However, it has not been fully explored for clinical data analysis. Even though an immense amount of Electronic Health Record (EHR) data is recorded, data and labels can be scarce if the data is collected in small hospitals or deals with rare diseases. In such scenarios, pre-training on a larger set of EHR data could improve the model performance. In this paper, we apply unsupervised pre-training to heterogeneous, multi-modal EHR data for patient outcome prediction. To model this data, we leverage graph deep learning over population graphs. We first design a network architecture based on graph transformer designed to handle various input feature types occurring in EHR data, like continuous, discrete, and time-series features, allowing better multi-modal data fusion. Further, we design pre-training methods based on masked imputation to pre-train our network before fine-tuning on different end tasks. Pre-training is done in a fully unsupervised fashion, which lays the groundwork for pre-training on large public datasets with different tasks and similar modalities in the future. We test our method on two medical datasets of patient records, TADPOLE and MIMIC-III, including imaging and non-imaging features and different prediction tasks. We find that our proposed graph based pre-training method helps in modeling the data at a population level and further improves performance on the fine tuning tasks in terms of AUC on average by 4.15% for MIMIC and 7.64% for TADPOLE.  ( 2 min )
    GCVAE: Generalized-Controllable Variational AutoEncoder. (arXiv:2206.04225v1 [stat.ML])
    Variational autoencoders (VAEs) have recently been used for unsupervised disentanglement learning of complex density distributions. Numerous variants exist to encourage disentanglement in latent space while improving reconstruction. However, none have simultaneously managed the trade-off between attaining extremely low reconstruction error and a high disentanglement score. We present a generalized framework to handle this challenge under constrained optimization and demonstrate that it outperforms state-of-the-art existing models as regards disentanglement while balancing reconstruction. We introduce three controllable Lagrangian hyperparameters to control reconstruction loss, KL divergence loss and correlation measure. We prove that maximizing information in the reconstruction network is equivalent to information maximization during amortized inference under reasonable assumptions and constraint relaxation.  ( 2 min )
    Balanced background and explanation data are needed in explaining deep learning models with SHAP: An empirical study on clinical decision making. (arXiv:2206.04050v1 [cs.LG])
    Objective: Shapley additive explanations (SHAP) is a popular post-hoc technique for explaining black box models. While the impact of data imbalance on predictive models has been extensively studied, it remains largely unknown with respect to SHAP-based model explanations. This study sought to investigate the effects of data imbalance on SHAP explanations for deep learning models, and to propose a strategy to mitigate these effects. Materials and Methods: We propose to adjust class distributions in the background and explanation data in SHAP when explaining black box models. Our data balancing strategy is to compose background data and explanation data with an equal distribution of classes. To evaluate the effects of data adjustment on model explanation, we propose to use the beeswarm plot as a qualitative tool to identify "abnormal" explanation artifacts, and quantitatively test the consistency between variable importance and prediction power. We demonstrated our proposed approach in an empirical study that predicted inpatient mortality using the Medical Information Mart for Intensive Care (MIMIC-III) data and a multilayer perceptron. Results: Using the data balancing strategy would allow us to reduce the number of the artifacts in the beeswarm plot, thus mitigating the negative effects of data imbalance. Additionally, with the balancing strategy, the top-ranked variables from the corresponding importance ranking demonstrated improved discrimination power. Discussion and Conclusion: Our findings suggest that balanced background and explanation data could help reduce the noise in explanation results induced by skewed data distribution and improve the reliability of variable importance ranking. Furthermore, these balancing procedures improve the potential of SHAP in identifying patients with abnormal characteristics in clinical applications.  ( 2 min )
    Automatic Debiased Machine Learning for Dynamic Treatment Effects and General Nested Functionals. (arXiv:2203.13887v3 [econ.EM] UPDATED)
    We extend the idea of automated debiased machine learning to the dynamic treatment regime and more generally to nested functionals. We show that the multiply robust formula for the dynamic treatment regime with discrete treatments can be re-stated in terms of a recursive Riesz representer characterization of nested mean regressions. We then apply a recursive Riesz representer estimation learning algorithm that estimates de-biasing corrections without the need to characterize how the correction terms look like, such as for instance, products of inverse probability weighting terms, as is done in prior work on doubly robust estimation in the dynamic regime. Our approach defines a sequence of loss minimization problems, whose minimizers are the mulitpliers of the de-biasing correction, hence circumventing the need for solving auxiliary propensity models and directly optimizing for the mean squared error of the target de-biasing correction. We provide further applications of our approach to estimation of dynamic discrete choice models.  ( 2 min )
    Deep Surrogate Assisted Generation of Environments. (arXiv:2206.04199v1 [cs.AI])
    Recent progress in reinforcement learning (RL) has started producing generally capable agents that can solve a distribution of complex environments. These agents are typically tested on fixed, human-authored environments. On the other hand, quality diversity (QD) optimization has been proven to be an effective component of environment generation algorithms, which can generate collections of high-quality environments that are diverse in the resulting agent behaviors. However, these algorithms require potentially expensive simulations of agents on newly generated environments. We propose Deep Surrogate Assisted Generation of Environments (DSAGE), a sample-efficient QD environment generation algorithm that maintains a deep surrogate model for predicting agent behaviors in new environments. Results in two benchmark domains show that DSAGE significantly outperforms existing QD environment generation algorithms in discovering collections of environments that elicit diverse behaviors of a state-of-the-art RL agent and a planning agent.  ( 2 min )
    Choosing Answers in $\varepsilon$-Best-Answer Identification for Linear Bandits. (arXiv:2206.04456v1 [stat.ML])
    In pure-exploration problems, information is gathered sequentially to answer a question on the stochastic environment. While best-arm identification for linear bandits has been extensively studied in recent years, few works have been dedicated to identifying one arm that is $\varepsilon$-close to the best one (and not exactly the best one). In this problem with several correct answers, an identification algorithm should focus on one candidate among those answers and verify that it is correct. We demonstrate that picking the answer with highest mean does not allow an algorithm to reach asymptotic optimality in terms of expected sample complexity. Instead, a \textit{furthest answer} should be identified. Using that insight to choose the candidate answer carefully, we develop a simple procedure to adapt best-arm identification algorithms to tackle $\varepsilon$-best-answer identification in transductive linear stochastic bandits. Finally, we propose an asymptotically optimal algorithm for this setting, which is shown to achieve competitive empirical performance against existing modified best-arm identification algorithms.  ( 2 min )
    Factuality Enhanced Language Models for Open-Ended Text Generation. (arXiv:2206.04624v1 [cs.CL])
    Pretrained language models (LMs) are susceptible to generate text with nonfactual information. In this work, we measure and improve the factual accuracy of large-scale LMs for open-ended text generation. We design the FactualityPrompts test set and metrics to measure the factuality of LM generations. Based on that, we study the factual accuracy of LMs with parameter sizes ranging from 126M to 530B. Interestingly, we find that larger LMs are more factual than smaller ones, although a previous study suggests that larger LMs can be less truthful in terms of misconceptions. In addition, popular sampling algorithms (e.g., top-p) in open-ended text generation can harm the factuality due to the "uniform randomness" introduced at every sampling step. We propose the factual-nucleus sampling algorithm that dynamically adapts the randomness to improve the factuality of generation while maintaining quality. Furthermore, we analyze the inefficiencies of the standard training method in learning correct associations between entities from factual text corpus (e.g., Wikipedia). We propose a factuality-enhanced training method that uses TopicPrefix for better awareness of facts and sentence completion as the training objective, which can vastly reduce the factual errors.  ( 2 min )
    VideoINR: Learning Video Implicit Neural Representation for Continuous Space-Time Super-Resolution. (arXiv:2206.04647v1 [eess.IV])
    Videos typically record the streaming and continuous visual data as discrete consecutive frames. Since the storage cost is expensive for videos of high fidelity, most of them are stored in a relatively low resolution and frame rate. Recent works of Space-Time Video Super-Resolution (STVSR) are developed to incorporate temporal interpolation and spatial super-resolution in a unified framework. However, most of them only support a fixed up-sampling scale, which limits their flexibility and applications. In this work, instead of following the discrete representations, we propose Video Implicit Neural Representation (VideoINR), and we show its applications for STVSR. The learned implicit neural representation can be decoded to videos of arbitrary spatial resolution and frame rate. We show that VideoINR achieves competitive performances with state-of-the-art STVSR methods on common up-sampling scales and significantly outperforms prior works on continuous and out-of-training-distribution scales. Our project page is at this http URL .
    Reinforced Inverse Scattering. (arXiv:2206.04186v1 [cs.LG])
    Inverse wave scattering aims at determining the properties of an object using data on how the object scatters incoming waves. In order to collect information, sensors are put in different locations to send and receive waves from each other. The choice of sensor positions and incident wave frequencies determines the reconstruction quality of scatterer properties. This paper introduces reinforcement learning to develop precision imaging that decides sensor positions and wave frequencies adaptive to different scatterers in an intelligent way, thus obtaining a significant improvement in reconstruction quality with limited imaging resources. Extensive numerical results will be provided to demonstrate the superiority of the proposed method over existing methods.
    Learning Invariant Representations with Missing Data. (arXiv:2112.00881v2 [cs.LG] UPDATED)
    Spurious correlations allow flexible models to predict well during training but poorly on related test distributions. Recent work has shown that models that satisfy particular independencies involving correlation-inducing \textit{nuisance} variables have guarantees on their test performance. Enforcing such independencies requires nuisances to be observed during training. However, nuisances, such as demographics or image background labels, are often missing. Enforcing independence on just the observed data does not imply independence on the entire population. Here we derive \acrshort{mmd} estimators used for invariance objectives under missing nuisances. On simulations and clinical data, optimizing through these estimates achieves test performance similar to using estimators that make use of the full data.
    Towards Understanding Graph Neural Networks: An Algorithm Unrolling Perspective. (arXiv:2206.04471v1 [cs.LG])
    The graph neural network (GNN) has demonstrated its superior performance in various applications. The working mechanism behind it, however, remains mysterious. GNN models are designed to learn effective representations for graph-structured data, which intrinsically coincides with the principle of graph signal denoising (GSD). Algorithm unrolling, a "learning to optimize" technique, has gained increasing attention due to its prospects in building efficient and interpretable neural network architectures. In this paper, we introduce a class of unrolled networks built based on truncated optimization algorithms (e.g., gradient descent and proximal gradient descent) for GSD problems. They are shown to be tightly connected to many popular GNN models in that the forward propagations in these GNNs are in fact unrolled networks serving specific GSDs. Besides, the training process of a GNN model can be seen as solving a bilevel optimization problem with a GSD problem at the lower level. Such a connection brings a fresh view of GNNs, as we could try to understand their practical capabilities from their GSD counterparts, and it can also motivate designing new GNN models. Based on the algorithm unrolling perspective, an expressive model named UGDGNN, i.e., unrolled gradient descent GNN, is further proposed which inherits appealing theoretical properties. Extensive numerical simulations on seven benchmark datasets demonstrate that UGDGNN can achieve superior or competitive performance over the state-of-the-art models.
    HideNseek: Federated Lottery Ticket via Server-side Pruning and Sign Supermask. (arXiv:2206.04385v1 [cs.LG])
    Federated learning alleviates the privacy risk in distributed learning by transmitting only the local model updates to the central server. However, it faces challenges including statistical heterogeneity of clients' datasets and resource constraints of client devices, which severely impact the training performance and user experience. Prior works have tackled these challenges by combining personalization with model compression schemes including quantization and pruning. However, the pruning is data-dependent and thus must be done on the client side which requires considerable computation cost. Moreover, the pruning normally trains a binary supermask $\in \{0, 1\}$ which significantly limits the model capacity yet with no computation benefit. Consequently, the training requires high computation cost and a long time to converge while the model performance does not pay off. In this work, we propose HideNseek which employs one-shot data-agnostic pruning at initialization to get a subnetwork based on weights' synaptic saliency. Each client then optimizes a sign supermask $\in \{-1, +1\}$ multiplied by the unpruned weights to allow faster convergence with the same compression rates as state-of-the-art. Empirical results from three datasets demonstrate that compared to state-of-the-art, HideNseek improves inferences accuracies by up to 40.6\% while reducing the communication cost and training time by up to 39.7\% and 46.8\% respectively.
    A Relational Intervention Approach for Unsupervised Dynamics Generalization in Model-Based Reinforcement Learning. (arXiv:2206.04551v1 [cs.LG])
    The generalization of model-based reinforcement learning (MBRL) methods to environments with unseen transition dynamics is an important yet challenging problem. Existing methods try to extract environment-specified information $Z$ from past transition segments to make the dynamics prediction model generalizable to different dynamics. However, because environments are not labelled, the extracted information inevitably contains redundant information unrelated to the dynamics in transition segments and thus fails to maintain a crucial property of $Z$: $Z$ should be similar in the same environment and dissimilar in different ones. As a result, the learned dynamics prediction function will deviate from the true one, which undermines the generalization ability. To tackle this problem, we introduce an interventional prediction module to estimate the probability of two estimated $\hat{z}_i, \hat{z}_j$ belonging to the same environment. Furthermore, by utilizing the $Z$'s invariance within a single environment, a relational head is proposed to enforce the similarity between $\hat{{Z}}$ from the same environment. As a result, the redundant information will be reduced in $\hat{Z}$. We empirically show that $\hat{{Z}}$ estimated by our method enjoy less redundant information than previous methods, and such $\hat{{Z}}$ can significantly reduce dynamics prediction errors and improve the performance of model-based RL methods on zero-shot new environments with unseen dynamics. The codes of this method are available at \url{https://github.com/CR-Gjx/RIA}.
    Distillation Decision Tree. (arXiv:2206.04661v1 [stat.ME])
    Black-box machine learning models are criticized as lacking interpretability, although they tend to have good prediction accuracy. Knowledge Distillation (KD) is an emerging tool to interpret the black-box model by distilling its knowledge into a transparent model. With well-known advantages in interpretation, decision tree is a competitive candidate of the transparent model. However, theoretical or empirical understanding for the decision tree generated from KD process is limited. In this paper, we name this kind of decision tree the distillation decision tree (DDT) and lay the theoretical foundations for tree structure stability which determines the validity of DDT's interpretation. We prove that the structure of DDT can achieve stable (convergence) under some mild assumptions. Meanwhile, we develop algorithms for stabilizing the induction of DDT, propose parallel strategies for improving algorithm's computational efficiency, and introduce a marginal principal component analysis method for overcoming the curse of dimensionality in sampling. Simulated and real data studies justify our theoretical results, validate the efficacy of algorithms, and demonstrate that DDT can strike a good balance between model's prediction accuracy and interpretability.
    Responsible and Regulatory Conform Machine Learning for Medicine: A Survey of Challenges and Solutions. (arXiv:2107.09546v2 [cs.LG] UPDATED)
    Machine learning is expected to fuel significant improvements in medical care. To ensure that fundamental principles such as beneficence, respect for human autonomy, prevention of harm, justice, privacy, and transparency are respected, medical machine learning systems must be developed responsibly. Many high-level declarations of ethical principles have been put forth for this purpose, but there is a severe lack of technical guidelines explicating the practical consequences for medical machine learning. Similarly, there is currently considerable uncertainty regarding the exact regulatory requirements placed upon medical machine learning systems. This survey provides an overview of the technical and procedural challenges involved in creating medical machine learning systems responsibly and in conformity with existing regulations, as well as possible solutions to address these challenges. First, a brief review of existing regulations affecting medical machine learning is provided, showing that properties such as safety, robustness, reliability, privacy, security, transparency, explainability, and nondiscrimination are all demanded already by existing law and regulations - albeit, in many cases, to an uncertain degree. Next, the key technical obstacles to achieving these desirable properties are discussed, as well as important techniques to overcome these obstacles in the medical context. We notice that distribution shift, spurious correlations, model underspecification, uncertainty quantification, and data scarcity represent severe challenges in the medical context. Promising solution approaches include the use of large and representative datasets and federated learning as a means to that end, the careful exploitation of domain knowledge, the use of inherently transparent models, comprehensive out-of-distribution model testing and verification, as well as algorithmic impact assessments.
    RecoMed: A Knowledge-Aware Recommender System for Hypertension Medications. (arXiv:2201.05461v2 [cs.IR] UPDATED)
    Background and Objective High medicine diversity has always been a significant challenge for prescription, causing confusion or doubt in physicians' decision-making process. This paper aims to develop a medicine recommender system called RecoMed to aid the physician in the prescription process of hypertension by providing information about what medications have been prescribed by other doctors and figuring out what other medicines can be recommended in addition to the one in question. Methods There are two steps to the developed method: First, association rule mining algorithms are employed to find medicine association rules. The second step entails graph mining and clustering to present an enriched recommendation via ATC code, which itself comprises several steps. First, the initial graph is constructed from historical prescription data. Then, data pruning is performed in the second step, after which the medicines with a high repetition rate are removed at the discretion of a general medical practitioner. Next, the medicines are matched to a well-known medicine classification system called the ATC code to provide an enriched recommendation. And finally, the DBSCAN and Louvain algorithms cluster medicines in the final step. Results A list of recommended medicines is provided as the system's output, and physicians can choose one or more of the medicines based on the patient's clinical symptoms. Only the medicines of class 2, related to high blood pressure medications, are used to assess the system's performance. The results obtained from this system have been reviewed and confirmed by an expert in this field.
    Russian Texts Detoxification with Levenshtein Editing. (arXiv:2204.13638v2 [cs.CL] UPDATED)
    Text detoxification is a style transfer task of creating neutral versions of toxic texts. In this paper, we use the concept of text editing to build a two-step tagging-based detoxification model using a parallel corpus of Russian texts. With this model, we achieved the best style transfer accuracy among all models in the RUSSE Detox shared task, surpassing larger sequence-to-sequence models.
    Regret Bounds for Information-Directed Reinforcement Learning. (arXiv:2206.04640v1 [cs.LG])
    Information-directed sampling (IDS) has revealed its potential as a data-efficient algorithm for reinforcement learning (RL). However, theoretical understanding of IDS for Markov Decision Processes (MDPs) is still limited. We develop novel information-theoretic tools to bound the information ratio and cumulative information gain about the learning target. Our theoretical results shed light on the importance of choosing the learning target such that the practitioners can balance the computation and regret bounds. As a consequence, we derive prior-free Bayesian regret bounds for vanilla-IDS which learns the whole environment under tabular finite-horizon MDPs. In addition, we propose a computationally-efficient regularized-IDS that maximizes an additive form rather than the ratio form and show that it enjoys the same regret bound as vanilla-IDS. With the aid of rate-distortion theory, we improve the regret bound by learning a surrogate, less informative environment. Furthermore, we extend our analysis to linear MDPs and prove similar regret bounds for Thompson sampling as a by-product.
    DORA: Exploring outlier representations in Deep Neural Networks. (arXiv:2206.04530v1 [cs.LG])
    Deep Neural Networks (DNNs) draw their power from the representations they learn. In recent years, however, researchers have found that DNNs, while being incredibly effective in learning complex abstractions, also tend to be infected with artifacts, such as biases, Clever Hanses (CH), or Backdoors, due to spurious correlations inherent in the training data. So far, existing methods for uncovering such artifactual and malicious behavior in trained models focus on finding artifacts in the input data, which requires both availabilities of a data set and human intervention. In this paper, we introduce DORA (Data-agnOstic Representation Analysis): the first automatic data-agnostic method for the detection of potentially infected representations in Deep Neural Networks. We further show that contaminated representations found by DORA can be used to detect infected samples in any given dataset. We qualitatively and quantitatively evaluate the performance of our proposed method in both, controlled toy scenarios, and in real-world settings, where we demonstrate the benefit of DORA in safety-critical applications.
    Model Degradation Hinders Deep Graph Neural Networks. (arXiv:2206.04361v1 [cs.LG])
    Graph Neural Networks (GNNs) have achieved great success in various graph mining tasks.However, drastic performance degradation is always observed when a GNN is stacked with many layers. As a result, most GNNs only have shallow architectures, which limits their expressive power and exploitation of deep neighborhoods.Most recent studies attribute the performance degradation of deep GNNs to the \textit{over-smoothing} issue. In this paper, we disentangle the conventional graph convolution operation into two independent operations: \textit{Propagation} (\textbf{P}) and \textit{Transformation} (\textbf{T}).Following this, the depth of a GNN can be split into the propagation depth ($D_p$) and the transformation depth ($D_t$). Through extensive experiments, we find that the major cause for the performance degradation of deep GNNs is the \textit{model degradation} issue caused by large $D_t$ rather than the \textit{over-smoothing} issue mainly caused by large $D_p$. Further, we present \textit{Adaptive Initial Residual} (AIR), a plug-and-play module compatible with all kinds of GNN architectures, to alleviate the \textit{model degradation} issue and the \textit{over-smoothing} issue simultaneously. Experimental results on six real-world datasets demonstrate that GNNs equipped with AIR outperform most GNNs with shallow architectures owing to the benefits of both large $D_p$ and $D_t$, while the time costs associated with AIR can be ignored.
    Understanding the unstable convergence of gradient descent. (arXiv:2204.01050v2 [math.OC] UPDATED)
    Most existing analyses of (stochastic) gradient descent rely on the condition that for $L$-smooth costs, the step size is less than $2/L$. However, many works have observed that in machine learning applications step sizes often do not fulfill this condition, yet (stochastic) gradient descent still converges, albeit in an unstable manner. We investigate this unstable convergence phenomenon from first principles, and discuss key causes behind it. We also identify its main characteristics, and how they interrelate based on both theory and experiments, offering a principled view toward understanding the phenomenon.  ( 2 min )
    Robust Inverse Framework using Knowledge-guided Self-Supervised Learning: An application to Hydrology. (arXiv:2109.06429v2 [cs.LG] UPDATED)
    Machine Learning is beginning to provide state-of-the-art performance in a range of environmental applications such as streamflow prediction in a hydrologic basin. However, building accurate broad-scale models for streamflow remains challenging in practice due to the variability in the dominant hydrologic processes, which are best captured by sets of process-related basin characteristics. Existing basin characteristics suffer from noise and uncertainty, among many other things, which adversely impact model performance. To tackle the above challenges, in this paper, we propose a novel Knowledge-guided Self-Supervised Learning (KGSSL) inverse framework to extract system characteristics from driver and response data. This first-of-its-kind framework achieves robust performance even when characteristics are corrupted. We show that KGSSL achieves state-of-the-art results for streamflow modeling for CAMELS (Catchment Attributes and MEteorology for Large-sample Studies) which is a widely used hydrology benchmark dataset. Specifically, KGSSL outperforms other methods by up to 16 \% in reconstructing characteristics. Furthermore, we show that KGSSL is relatively more robust to distortion than baseline methods, and outperforms the baseline model by 35\% when plugging in KGSSL inferred characteristics.  ( 2 min )
    Graph Attention MLP with Reliable Label Utilization. (arXiv:2108.10097v3 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have recently achieved state-of-the-art performance in many graph-based applications. Despite the high expressive power, they typically need to perform an expensive recursive neighborhood expansion in multiple training epochs and face a scalability issue. Moreover, most of them are inflexible since they are restricted to fixed-hop neighborhoods and insensitive to actual receptive field demands for different nodes. We circumvent these limitations by introducing a scalable and flexible Graph Attention Multilayer Perceptron (GAMLP). With the separation of the non-linear transformation and feature propagation, GAMLP significantly improves the scalability and efficiency by performing the propagation procedure in a pre-compute manner. With three principled receptive field attention, each node in GAMLP is flexible and adaptive in leveraging the propagated features over the different sizes of reception field. We conduct extensive evaluations on the three large open graph benchmarks (e.g., ogbn-papers100M, ogbn-products and ogbn-mag), demonstrating that GAMLP not only achieves the state-of-art performance, but also additionally provide high scalability and efficiency.  ( 2 min )
    Study of Feature Importance for Quantum Machine Learning Models. (arXiv:2202.11204v4 [quant-ph] UPDATED)
    Predictor importance is a crucial part of data preprocessing pipelines in classical and quantum machine learning (QML). This work presents the first study of its kind in which feature importance for QML models has been explored and contrasted against their classical machine learning (CML) equivalents. We developed a hybrid quantum-classical architecture where QML models are trained and feature importance values are calculated from classical algorithms on a real-world dataset. This architecture has been implemented on ESPN Fantasy Football data using Qiskit statevector simulators and IBM quantum hardware such as the IBMQ Mumbai and IBMQ Montreal systems. Even though we are in the Noisy Intermediate-Scale Quantum (NISQ) era, the physical quantum computing results are promising. To facilitate current quantum scale, we created a data tiering, model aggregation, and novel validation methods. Notably, the feature importance magnitudes from the quantum models had a much higher variation when contrasted to classical models. We can show that equivalent QML and CML models are complementary through diversity measurements. The diversity between QML and CML demonstrates that both approaches can contribute to a solution in different ways. Within this paper we focus on Quantum Support Vector Classifiers (QSVC), Variational Quantum Circuit (VQC), and their classical counterparts. The ESPN and IBM fantasy football Trade Assistant combines advanced statistical analysis with the natural language processing of Watson Discovery to serve up personalized trade recommendations that are fair. Here, player valuation data of each player has been considered and this work can be extended to calculate the feature importance of other QML models such as Quantum Boltzmann machines.  ( 2 min )
    A Psychological Theory of Explainability. (arXiv:2205.08452v2 [cs.AI] UPDATED)
    The goal of explainable Artificial Intelligence (XAI) is to generate human-interpretable explanations, but there are no computationally precise theories of how humans interpret AI generated explanations. The lack of theory means that validation of XAI must be done empirically, on a case-by-case basis, which prevents systematic theory-building in XAI. We propose a psychological theory of how humans draw conclusions from saliency maps, the most common form of XAI explanation, which for the first time allows for precise prediction of explainee inference conditioned on explanation. Our theory posits that absent explanation humans expect the AI to make similar decisions to themselves, and that they interpret an explanation by comparison to the explanations they themselves would give. Comparison is formalized via Shepard's universal law of generalization in a similarity space, a classic theory from cognitive science. A pre-registered user study on AI image classifications with saliency map explanations demonstrate that our theory quantitatively matches participants' predictions of the AI.  ( 2 min )
    Optimal SQ Lower Bounds for Robustly Learning Discrete Product Distributions and Ising Models. (arXiv:2206.04589v1 [cs.DS])
    We establish optimal Statistical Query (SQ) lower bounds for robustly learning certain families of discrete high-dimensional distributions. In particular, we show that no efficient SQ algorithm with access to an $\epsilon$-corrupted binary product distribution can learn its mean within $\ell_2$-error $o(\epsilon \sqrt{\log(1/\epsilon)})$. Similarly, we show that no efficient SQ algorithm with access to an $\epsilon$-corrupted ferromagnetic high-temperature Ising model can learn the model to total variation distance $o(\epsilon \log(1/\epsilon))$. Our SQ lower bounds match the error guarantees of known algorithms for these problems, providing evidence that current upper bounds for these tasks are best possible. At the technical level, we develop a generic SQ lower bound for discrete high-dimensional distributions starting from low dimensional moment matching constructions that we believe will find other applications. Additionally, we introduce new ideas to analyze these moment-matching constructions for discrete univariate distributions.  ( 2 min )
    Strategic Instrumental Variable Regression: Recovering Causal Relationships From Strategic Responses. (arXiv:2107.05762v3 [cs.LG] UPDATED)
    In settings where Machine Learning (ML) algorithms automate or inform consequential decisions about people, individual decision subjects are often incentivized to strategically modify their observable attributes to receive more favorable predictions. As a result, the distribution the assessment rule is trained on may differ from the one it operates on in deployment. While such distribution shifts, in general, can hinder accurate predictions, our work identifies a unique opportunity associated with shifts due to strategic responses: We show that we can use strategic responses effectively to recover causal relationships between the observable features and outcomes we wish to predict, even under the presence of unobserved confounding variables. Specifically, our work establishes a novel connection between strategic responses to ML models and instrumental variable (IV) regression by observing that the sequence of deployed models can be viewed as an instrument that affects agents' observable features but does not directly influence their outcomes. We show that our causal recovery method can be utilized to improve decision-making across several important criteria: individual fairness, agent outcomes, and predictive risk. In particular, we show that if decision subjects differ in their ability to modify non-causal attributes, any decision rule deviating from the causal coefficients can lead to (potentially unbounded) individual-level unfairness.  ( 2 min )
    Overcoming the Spectral Bias of Neural Value Approximation. (arXiv:2206.04672v1 [cs.LG])
    Value approximation using deep neural networks is at the heart of off-policy deep reinforcement learning, and is often the primary module that provides learning signals to the rest of the algorithm. While multi-layer perceptron networks are universal function approximators, recent works in neural kernel regression suggest the presence of a spectral bias, where fitting high-frequency components of the value function requires exponentially more gradient update steps than the low-frequency ones. In this work, we re-examine off-policy reinforcement learning through the lens of kernel regression and propose to overcome such bias via a composite neural tangent kernel. With just a single line-change, our approach, the Fourier feature networks (FFN) produce state-of-the-art performance on challenging continuous control domains with only a fraction of the compute. Faster convergence and better off-policy stability also make it possible to remove the target network without suffering catastrophic divergences, which further reduces TD}(0)'s estimation bias on a few tasks.  ( 2 min )
    Learning to generalize Dispatching rules on the Job Shop Scheduling. (arXiv:2206.04423v1 [cs.LG])
    This paper introduces a Reinforcement Learning approach to better generalize heuristic dispatching rules on the Job-shop Scheduling Problem (JSP). Current models on the JSP do not focus on generalization, although, as we show in this work, this is key to learning better heuristics on the problem. A well-known technique to improve generalization is to learn on increasingly complex instances using Curriculum Learning (CL). However, as many works in the literature indicate, this technique might suffer from catastrophic forgetting when transferring the learned skills between different problem sizes. To address this issue, we introduce a novel Adversarial Curriculum Learning (ACL) strategy, which dynamically adjusts the difficulty level during the learning process to revisit the worst-performing instances. This work also presents a deep learning model to solve the JSP, which is equivariant w.r.t. the job definition and size-agnostic. Conducted experiments on Taillard's and Demirkol's instances show that the presented approach significantly improves the current state-of-the-art models on the JSP. It reduces the average optimality gap from 19.35\% to 10.46\% on Taillard's instances and from 38.43\% to 18.85\% on Demirkol's instances. Our implementation is available online.  ( 2 min )
    Multi-modal Attention Network for Stock Movements Prediction. (arXiv:2112.13593v3 [cs.LG] UPDATED)
    Stock prices move as piece-wise trending fluctuation rather than a purely random walk. Traditionally, the prediction of future stock movements is based on the historical trading record. Nowadays, with the development of social media, many active participants in the market choose to publicize their strategies, which provides a window to glimpse over the whole market's attitude towards future movements by extracting the semantics behind social media. However, social media contains conflicting information and cannot replace historical records completely. In this work, we propose a multi-modality attention network to reduce conflicts and integrate semantic and numeric features to predict future stock movements comprehensively. Specifically, we first extract semantic information from social media and estimate their credibility based on posters' identity and public reputation. Then we incorporate the semantic from online posts and numeric features from historical records to make the trading strategy. Experimental results show that our approach outperforms previous methods by a significant margin in both prediction accuracy (61.20\%) and trading profits (9.13\%). It demonstrates that our method improves the performance of stock movements prediction and informs future research on multi-modality fusion towards stock prediction.  ( 2 min )
    AttX: Attentive Cross-Connections for Fusion of Wearable Signals in Emotion Recognition. (arXiv:2206.04625v1 [cs.LG])
    We propose cross-modal attentive connections, a new dynamic and effective technique for multimodal representation learning from wearable data. Our solution can be integrated into any stage of the pipeline, i.e., after any convolutional layer or block, to create intermediate connections between individual streams responsible for processing each modality. Additionally, our method benefits from two properties. First, it can share information uni-directionally (from one modality to the other) or bi-directionally. Second, it can be integrated into multiple stages at the same time to further allow network gradients to be exchanged in several touch-points. We perform extensive experiments on three public multimodal wearable datasets, WESAD, SWELL-KW, and CASE, and demonstrate that our method can effectively regulate and share information between different modalities to learn better representations. Our experiments further demonstrate that once integrated into simple CNN-based multimodal solutions (2, 3, or 4 modalities), our method can result in superior or competitive performance to state-of-the-art and outperform a variety of baseline uni-modal and classical multimodal methods.  ( 2 min )
    Neo-GNNs: Neighborhood Overlap-aware Graph Neural Networks for Link Prediction. (arXiv:2206.04216v1 [cs.LG])
    Graph Neural Networks (GNNs) have been widely applied to various fields for learning over graph-structured data. They have shown significant improvements over traditional heuristic methods in various tasks such as node classification and graph classification. However, since GNNs heavily rely on smoothed node features rather than graph structure, they often show poor performance than simple heuristic methods in link prediction where the structural information, e.g., overlapped neighborhoods, degrees, and shortest paths, is crucial. To address this limitation, we propose Neighborhood Overlap-aware Graph Neural Networks (Neo-GNNs) that learn useful structural features from an adjacency matrix and estimate overlapped neighborhoods for link prediction. Our Neo-GNNs generalize neighborhood overlap-based heuristic methods and handle overlapped multi-hop neighborhoods. Our extensive experiments on Open Graph Benchmark datasets (OGB) demonstrate that Neo-GNNs consistently achieve state-of-the-art performance in link prediction. Our code is publicly available at https://github.com/seongjunyun/Neo_GNNs.
    Socially Compliant Navigation Dataset (SCAND): A Large-Scale Dataset of Demonstrations for Social Navigation. (arXiv:2203.15041v2 [cs.RO] UPDATED)
    Social navigation is the capability of an autonomous agent, such as a robot, to navigate in a 'socially compliant' manner in the presence of other intelligent agents such as humans. With the emergence of autonomously navigating mobile robots in human populated environments (e.g., domestic service robots in homes and restaurants and food delivery robots on public sidewalks), incorporating socially compliant navigation behaviors on these robots becomes critical to ensuring safe and comfortable human robot coexistence. To address this challenge, imitation learning is a promising framework, since it is easier for humans to demonstrate the task of social navigation rather than to formulate reward functions that accurately capture the complex multi objective setting of social navigation. The use of imitation learning and inverse reinforcement learning to social navigation for mobile robots, however, is currently hindered by a lack of large scale datasets that capture socially compliant robot navigation demonstrations in the wild. To fill this gap, we introduce Socially CompliAnt Navigation Dataset (SCAND) a large scale, first person view dataset of socially compliant navigation demonstrations. Our dataset contains 8.7 hours, 138 trajectories, 25 miles of socially compliant, human teleoperated driving demonstrations that comprises multi modal data streams including 3D lidar, joystick commands, odometry, visual and inertial information, collected on two morphologically different mobile robots a Boston Dynamics Spot and a Clearpath Jackal by four different human demonstrators in both indoor and outdoor environments. We additionally perform preliminary analysis and validation through real world robot experiments and show that navigation policies learned by imitation learning on SCAND generate socially compliant behaviors
    Learning in Distributed Contextual Linear Bandits Without Sharing the Context. (arXiv:2206.04180v1 [cs.LG])
    Contextual linear bandits is a rich and theoretically important model that has many practical applications. Recently, this setup gained a lot of interest in applications over wireless where communication constraints can be a performance bottleneck, especially when the contexts come from a large $d$-dimensional space. In this paper, we consider a distributed memoryless contextual linear bandit learning problem, where the agents who observe the contexts and take actions are geographically separated from the learner who performs the learning while not seeing the contexts. We assume that contexts are generated from a distribution and propose a method that uses $\approx 5d$ bits per context for the case of unknown context distribution and $0$ bits per context if the context distribution is known, while achieving nearly the same regret bound as if the contexts were directly observable. The former bound improves upon existing bounds by a $\log(T)$ factor, where $T$ is the length of the horizon, while the latter achieves information theoretical tightness.
    Generative Flow Networks for Discrete Probabilistic Modeling. (arXiv:2202.01361v2 [cs.LG] UPDATED)
    We present energy-based generative flow networks (EB-GFN), a novel probabilistic modeling algorithm for high-dimensional discrete data. Building upon the theory of generative flow networks (GFlowNets), we model the generation process by a stochastic data construction policy and thus amortize expensive MCMC exploration into a fixed number of actions sampled from a GFlowNet. We show how GFlowNets can approximately perform large-block Gibbs sampling to mix between modes. We propose a framework to jointly train a GFlowNet with an energy function, so that the GFlowNet learns to sample from the energy distribution, while the energy learns with an approximate MLE objective with negative samples from the GFlowNet. We demonstrate EB-GFN's effectiveness on various probabilistic modeling tasks. Code is publicly available at https://github.com/zdhNarsil/EB_GFN.
    What Makes Transfer Learning Work For Medical Images: Feature Reuse & Other Factors. (arXiv:2203.01825v2 [cs.LG] UPDATED)
    Transfer learning is a standard technique to transfer knowledge from one domain to another. For applications in medical imaging, transfer from ImageNet has become the de-facto approach, despite differences in the tasks and image characteristics between the domains. However, it is unclear what factors determine whether - and to what extent - transfer learning to the medical domain is useful. The long-standing assumption that features from the source domain get reused has recently been called into question. Through a series of experiments on several medical image benchmark datasets, we explore the relationship between transfer learning, data size, the capacity and inductive bias of the model, as well as the distance between the source and target domain. Our findings suggest that transfer learning is beneficial in most cases, and we characterize the important role feature reuse plays in its success.
    Quick survey of graph-based fraud detection methods. (arXiv:1910.11299v3 [cs.LG] CROSS LISTED)
    In general, anomaly detection is the problem of distinguishing between normal data samples with well defined patterns or signatures and those that do not conform to the expected profiles. Financial transactions, customer reviews, social media posts are all characterized by relational information. In these networks, fraudulent behaviour may appear as a distinctive graph edge, such as spam message, a node or a larger subgraph structure, such as when a group of clients engage in money laundering schemes. Most commonly, these networks are represented as attributed graphs, with numerical features complementing relational information. We present a survey on anomaly detection techniques used for fraud detection that exploit both the graph structure underlying the data and the contextual information contained in the attributes.
    Robust Matrix Completion with Heavy-tailed Noise. (arXiv:2206.04276v1 [math.ST])
    This paper studies low-rank matrix completion in the presence of heavy-tailed and possibly asymmetric noise, where we aim to estimate an underlying low-rank matrix given a set of highly incomplete noisy entries. Though the matrix completion problem has attracted much attention in the past decade, there is still lack of theoretical understanding when the observations are contaminated by heavy-tailed noises. Prior theory falls short of explaining the empirical results and is unable to capture the optimal dependence of the estimation error on the noise level. In this paper, we adopt an adaptive Huber loss to accommodate heavy-tailed noise, which is robust against large and possibly asymmetric errors when the parameter in the loss function is carefully designed to balance the Huberization biases and robustness to outliers. Then, we propose an efficient nonconvex algorithm via a balanced low-rank Burer-Monteiro matrix factorization and gradient decent with robust spectral initialization. We prove that under merely bounded second moment condition on the error distributions, rather than the sub-Gaussian assumption, the Euclidean error of the iterates generated by the proposed algorithm decrease geometrically fast until achieving a minimax-optimal statistical estimation error, which has the same order as that in the sub-Gaussian case. The key technique behind this significant advancement is a powerful leave-one-out analysis framework. The theoretical results are corroborated by our simulation studies.
    Hilbert Curve Projection Distance for Distribution Comparison. (arXiv:2205.15059v2 [cs.LG] UPDATED)
    Distribution comparison plays a central role in many machine learning tasks like data classification and generative modeling. In this study, we propose a novel metric, called Hilbert curve projection (HCP) distance, to measure the distance between two probability distributions with high robustness and low complexity. In particular, we first project two high-dimensional probability densities using Hilbert curve to obtain a coupling between them, and then calculate the transport distance between these two densities in the original space, according to the coupling. We show that HCP distance is a proper metric and is well-defined for absolutely continuous probability measures. Furthermore, we demonstrate that the empirical HCP distance converges to its population counterpart at a rate of no more than $O(n^{-1/2d})$ under regularity conditions. To suppress the curse-of-dimensionality, we also develop two variants of the HCP distance using (learnable) subspace projections. Experiments on both synthetic and real-world data show that our HCP distance works as an effective surrogate of the Wasserstein distance with low complexity and overcomes the drawbacks of the sliced Wasserstein distance.
    Contextual Information-Directed Sampling. (arXiv:2205.10895v2 [cs.LG] UPDATED)
    Information-directed sampling (IDS) has recently demonstrated its potential as a data-efficient reinforcement learning algorithm. However, it is still unclear what is the right form of information ratio to optimize when contextual information is available. We investigate the IDS design through two contextual bandit problems: contextual bandits with graph feedback and sparse linear contextual bandits. We provably demonstrate the advantage of contextual IDS over conditional IDS and emphasize the importance of considering the context distribution. The main message is that an intelligent agent should invest more on the actions that are beneficial for the future unseen contexts while the conditional IDS can be myopic. We further propose a computationally-efficient version of contextual IDS based on Actor-Critic and evaluate it empirically on a neural network contextual bandit.
    Improved Differential Privacy for SGD via Optimal Private Linear Operators on Adaptive Streams. (arXiv:2202.08312v2 [cs.LG] UPDATED)
    Motivated by recent applications requiring differential privacy over adaptive streams, we investigate the question of optimal instantiations of the matrix mechanism in this setting. We prove fundamental theoretical results on the applicability of matrix factorizations to adaptive streams, and provide a parameter-free fixed-point algorithm for computing optimal factorizations. We instantiate this framework with respect to concrete matrices which arise naturally in machine learning, and train user-level differentially private models with the resulting optimal mechanisms, yielding significant improvements in a notable problem in federated learning with user-level differential privacy.
    Privacy-Aware Compression for Federated Data Analysis. (arXiv:2203.08134v2 [cs.LG] UPDATED)
    Federated data analytics is a framework for distributed data analysis where a server compiles noisy responses from a group of distributed low-bandwidth user devices to estimate aggregate statistics. Two major challenges in this framework are privacy, since user data is often sensitive, and compression, since the user devices have low network bandwidth. Prior work has addressed these challenges separately by combining standard compression algorithms with known privacy mechanisms. In this work, we take a holistic look at the problem and design a family of privacy-aware compression mechanisms that work for any given communication budget. We first propose a mechanism for transmitting a single real number that has optimal variance under certain conditions. We then show how to extend it to metric differential privacy for location privacy use-cases, as well as vectors, for application to federated learning. Our experiments illustrate that our mechanism can lead to better utility vs. compression trade-offs for the same privacy loss in a number of settings.
    Evaluating State of the Art, Forecasting Ensembles- and Meta-learning Strategies for Model Fusion. (arXiv:2203.03279v2 [cs.LG] UPDATED)
    Techniques of hybridisation and ensemble learning are popular model fusion techniques for improving the predictive power of forecasting methods. With limited research that instigates combining these two promising approaches, this paper focuses on the utility of the Exponential-Smoothing-Recurrent Neural Network (ES-RNN) in the pool of base models for different ensembles. We compare against some state of the art ensembling techniques and arithmetic model averaging as a benchmark. We experiment with the M4 forecasting data set of 100,000 time-series, and the results show that the Feature-based Forecast Model Averaging (FFORMA), on average, is the best technique for late data fusion with the ES-RNN. However, considering the M4's Daily subset of data, stacking was the only successful ensemble at dealing with the case where all base model performances are similar. Our experimental results indicate that we attain state of the art forecasting results compared to N-BEATS as a benchmark. We conclude that model averaging is a more robust ensemble than model selection and stacking strategies. Further, the results show that gradient boosting is superior for implementing ensemble learning strategies.
    Markovian Interference in Experiments. (arXiv:2206.02371v2 [cs.LG] UPDATED)
    We consider experiments in dynamical systems where interventions on some experimental units impact other units through a limiting constraint (such as a limited inventory). Despite outsize practical importance, the best estimators for this `Markovian' interference problem are largely heuristic in nature, and their bias is not well understood. We formalize the problem of inference in such experiments as one of policy evaluation. Off-policy estimators, while unbiased, apparently incur a large penalty in variance relative to state-of-the-art heuristics. We introduce an on-policy estimator: the Differences-In-Q's (DQ) estimator. We show that the DQ estimator can in general have exponentially smaller variance than off-policy evaluation. At the same time, its bias is second order in the impact of the intervention. This yields a striking bias-variance tradeoff so that the DQ estimator effectively dominates state-of-the-art alternatives. From a theoretical perspective, we introduce three separate novel techniques that are of independent interest in the theory of Reinforcement Learning (RL). Our empirical evaluation includes a set of experiments on a city-scale ride-hailing simulator.
    Time Delay Estimation of Traffic Congestion Propagation based on Transfer Entropy. (arXiv:2108.06717v2 [stat.ML] UPDATED)
    Considering how congestion will propagate in the near future, understanding traffic congestion propagation has become crucial in GPS navigation systems for providing users with a more accurate estimated time of arrival (ETA). However, providing the exact ETA during congestion is a challenge owing to the complex propagation process between roads and high uncertainty regarding the future behavior of the process. Recent studies have focused on finding frequent congestion propagation patterns and determining the propagation probabilities. By contrast, this study proposes a novel time delay estimation method for traffic congestion propagation between roads using lag-specific transfer entropy (TE). Nonlinear normalization with a sliding window is used to effectively reveal the causal relationship between the source and target time series in calculating the TE. Moreover, Markov bootstrap techniques were adopted to quantify the uncertainty in the time delay estimator. To the best of our knowledge, the time delay estimation method presented in this article is the first to determine the time delay between roads for any congestion propagation pattern. The proposed method was validated using simulated data as well as real user trajectory data obtained from a major GPS navigation system applied in South Korea.
    Vector Optimization with Stochastic Bandit Feedback. (arXiv:2110.12311v3 [cs.LG] UPDATED)
    We introduce vector optimization problems with stochastic bandit feedback, which extends the best arm identification problem to vector-valued rewards. We consider $K$ designs with multi-dimensional mean reward vectors, which are partially ordered according to a polyhedral ordering cone $C$. This generalizes the concept of the Pareto set in multi-objective optimization and allows different sets of preferences of decision-makers to be encoded by $C$. Different than prior work, we define approximations of the Pareto set based on direction-free covering and gap notions. We study an ($\epsilon,\delta$)-PAC Pareto set identification problem where an evaluation of each design yields a noisy observation of the mean reward vector. In order to characterize the difficulty of learning the Pareto set, we introduce the concept of {\em ordering complexity}, i.e., geometric conditions on the deviations of empirical reward vectors from their mean under which the Pareto front can be approximated accurately. We show how to compute the ordering complexity of any polyhedral ordering cone. We provide gap-dependent and worst-case lower bounds on the sample complexity and show that in the worst-case the sample complexity scales with the square of ordering complexity. Furthermore, we investigate the sample complexity of the na\"ive elimination algorithm and prove that it nearly matches the worst-case sample complexity. Finally, we run experiments to verify our theoretical results and illustrate how $C$ and sampling budget affect the Pareto set, returned ($\epsilon,\delta$)-PAC Pareto set and the success of identification.
    TubeDETR: Spatio-Temporal Video Grounding with Transformers. (arXiv:2203.16434v2 [cs.CV] UPDATED)
    We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query. This is a challenging task that requires the joint and efficient modeling of temporal, spatial and multi-modal interactions. To address this task, we propose TubeDETR, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection. Our model notably includes: (i) an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and (ii) a space-time decoder that jointly performs spatio-temporal localization. We demonstrate the advantage of our proposed components through an extensive ablation study. We also evaluate our full approach on the spatio-temporal video grounding task and demonstrate improvements over the state of the art on the challenging VidSTG and HC-STVG benchmarks. Code and trained models are publicly available at https://antoyang.github.io/tubedetr.html.
    Multi-task Self-distillation for Graph-based Semi-Supervised Learning. (arXiv:2112.01174v2 [cs.LG] UPDATED)
    Graph convolutional networks have made great progress in graph-based semi-supervised learning. Existing methods mainly assume that nodes connected by graph edges are prone to have similar attributes and labels, so that the features smoothed by local graph structures can reveal the class similarities. However, there often exist mismatches between graph structures and labels in many real-world scenarios, where the structures may propagate misleading features or labels that eventually affect the model performance. In this paper, we propose a multi-task self-distillation framework that injects self-supervised learning and self-distillation into graph convolutional networks to separately address the mismatch problem from the structure side and the label side. First, we formulate a self-supervision pipeline based on pre-text tasks to capture different levels of similarities in graphs. The feature extraction process is encouraged to capture more complex proximity by jointly optimizing the pre-text task and the target task. Consequently, the local feature aggregations are improved from the structure side. Second, self-distillation uses soft labels of the model itself as additional supervision, which has similar effects as label smoothing. The knowledge from the classification pipeline and the self-supervision pipeline is collectively distilled to improve the generalization ability of the model from the label side. Experiment results show that the proposed method obtains remarkable performance gains under several classic graph convolutional architectures.
    Objective-Based Hierarchical Clustering of Deep Embedding Vectors. (arXiv:2012.08466v2 [cs.LG] UPDATED)
    We initiate a comprehensive experimental study of objective-based hierarchical clustering methods on massive datasets consisting of deep embedding vectors from computer vision and NLP applications. This includes a large variety of image embedding (ImageNet, ImageNetV2, NaBirds), word embedding (Twitter, Wikipedia), and sentence embedding (SST-2) vectors from several popular recent models (e.g. ResNet, ResNext, Inception V3, SBERT). Our study includes datasets with up to $4.5$ million entries with embedding dimensions up to $2048$. In order to address the challenge of scaling up hierarchical clustering to such large datasets we propose a new practical hierarchical clustering algorithm B++&C. It gives a 5%/20% improvement on average for the popular Moseley-Wang (MW) / Cohen-Addad et al. (CKMM) objectives (normalized) compared to a wide range of classic methods and recent heuristics. We also introduce a theoretical algorithm B2SAT&C which achieves a $0.74$-approximation for the CKMM objective in polynomial time. This is the first substantial improvement over the trivial $2/3$-approximation achieved by a random binary tree. Prior to this work, the best poly-time approximation of $\approx 2/3 + 0.0004$ was due to Charikar et al. (SODA'19).
    Globally Optimal Algorithms for Fixed-Budged Best Arm Identification. (arXiv:2206.04646v1 [stat.ML])
    We consider the fixed-budget best arm identification problem where the goal is to find the arm of the largest mean with a fixed number of samples. It is known that the probability of misidentifying the best arm is exponentially small to the number of rounds. However, limited characterizations have been discussed on the rate (exponent) of this value. In this paper, we characterize the optimal rate as a result of global optimization over all possible parameters. We introduce two rates, $R^{\mathrm{go}}$ and $R^{\mathrm{go}}_{\infty}$, corresponding to lower bounds on the misidentification probability, each of which is associated with a proposed algorithm. The rate $R^{\mathrm{go}}$ is associated with $R^{\mathrm{go}}$-tracking, which can be efficiently implemented by a neural network and is shown to outperform existing algorithms. However, this rate requires a nontrivial condition to be achievable. To deal with this issue, we introduce the second rate $R^{\mathrm{go}}_\infty$. We show that this rate is indeed achievable by introducing a conceptual algorithm called delayed optimal tracking (DOT).
    A Critical Review on the Use (and Misuse) of Differential Privacy in Machine Learning. (arXiv:2206.04621v1 [cs.CR])
    We review the use of differential privacy (DP) for privacy protection in machine learning (ML). We show that, driven by the aim of preserving the accuracy of the learned models, DP-based ML implementations are so loose that they do not offer the ex ante privacy guarantees of DP. Instead, what they deliver is basically noise addition similar to the traditional (and often criticized) statistical disclosure control approach. Due to the lack of formal privacy guarantees, the actual level of privacy offered must be experimentally assessed ex post, which is done very seldom. In this respect, we present empirical results showing that standard anti-overfitting techniques in ML can achieve a better utility/privacy/efficiency trade-off than DP.
    Fast Hierarchical Games for Image Explanations. (arXiv:2104.06164v2 [cs.CV] UPDATED)
    As modern complex neural networks keep breaking records and solving harder problems, their predictions also become less and less intelligible. The current lack of interpretability often undermines the deployment of accurate machine learning tools in sensitive settings. In this work, we present a model-agnostic explanation method for image classification based on a hierarchical extension of Shapley coefficients--Hierarchical Shap (h-Shap)--that resolves some of the limitations of current approaches. Unlike other Shapley-based explanation methods, h-Shap is scalable and can be computed without the need of approximation. Under certain distributional assumptions, such as those common in multiple instance learning, h-Shap retrieves the exact Shapley coefficients with an exponential improvement in computational complexity. We compare our hierarchical approach with popular Shapley-based and non-Shapley-based methods on a synthetic dataset, a medical imaging scenario, and a general computer vision problem, showing that h-Shap outperforms the state of the art in both accuracy and runtime. Code and experiments are made publicly available.
    On the Parameter Combinations That Matter and on Those That do Not. (arXiv:2110.06717v2 [cs.LG] UPDATED)
    We present a data-driven approach to characterizing nonidentifiability of a model's parameters and illustrate it through dynamic as well as steady kinetic models. By employing Diffusion Maps and their extensions, we discover the minimal combinations of parameters required to characterize the output behavior of a chemical system: a set of effective parameters for the model. Furthermore, we introduce and use a Conformal Autoencoder Neural Network technique, as well as a kernel-based Jointly Smooth Function technique, to disentangle the redundant parameter combinations that do not affect the output behavior from the ones that do. We discuss the interpretability of our data-driven effective parameters, and demonstrate the utility of the approach both for behavior prediction and parameter estimation. In the latter task, it becomes important to describe level sets in parameter space that are consistent with a particular output behavior. We validate our approach on a model of multisite phosphorylation, where a reduced set of effective parameters (nonlinear combinations of the physical ones) has previously been established analytically.
    Multivariate feature ranking of gene expression data. (arXiv:2111.02357v4 [cs.LG] UPDATED)
    Gene expression datasets are usually of high dimensionality and therefore require efficient and effective methods for identifying the relative importance of their attributes. Due to the huge size of the search space of the possible solutions, the attribute subset evaluation feature selection methods tend to be not applicable, so in these scenarios feature ranking methods are used. Most of the feature ranking methods described in the literature are univariate methods, so they do not detect interactions between factors. In this paper we propose two new multivariate feature ranking methods based on pairwise correlation and pairwise consistency, which we have applied in three gene expression classification problems. We statistically prove that the proposed methods outperform the state of the art feature ranking methods Clustering Variation, Chi Squared, Correlation, Information Gain, ReliefF and Significance, as well as feature selection methods of attribute subset evaluation based on correlation and consistency with multi-objective evolutionary search strategy.
    The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training. (arXiv:2007.12826v3 [stat.ML] UPDATED)
    Modern neural networks are often operated in a strongly overparametrized regime: they comprise so many parameters that they can interpolate the training set, even if actual labels are replaced by purely random ones. Despite this, they achieve good prediction error on unseen data: interpolating the training set does not lead to a large generalization error. Further, overparametrization appears to be beneficial in that it simplifies the optimization landscape. Here we study these phenomena in the context of two-layers neural networks in the neural tangent (NT) regime. We consider a simple data model, with isotropic covariates vectors in $d$ dimensions, and $N$ hidden neurons. We assume that both the sample size $n$ and the dimension $d$ are large, and they are polynomially related. Our first main result is a characterization of the eigenstructure of the empirical NT kernel in the overparametrized regime $Nd\gg n$. This characterization implies as a corollary that the minimum eigenvalue of the empirical NT kernel is bounded away from zero as soon as $Nd\gg n$, and therefore the network can exactly interpolate arbitrary labels in the same regime. Our second main result is a characterization of the generalization error of NT ridge regression including, as a special case, min-$\ell_2$ norm interpolation. We prove that, as soon as $Nd\gg n$, the test error is well approximated by the one of kernel ridge regression with respect to the infinite-width kernel. The latter is in turn well approximated by the error of polynomial ridge regression, whereby the regularization parameter is increased by a `self-induced' term related to the high-degree components of the activation function. The polynomial degree depends on the sample size and the dimension (in particular on $\log n/\log d$).
    On Margins and Generalisation for Voting Classifiers. (arXiv:2206.04607v1 [cs.LG])
    We study the generalisation properties of majority voting on finite ensembles of classifiers, proving margin-based generalisation bounds via the PAC-Bayes theory. These provide state-of-the-art guarantees on a number of classification tasks. Our central results leverage the Dirichlet posteriors studied recently by Zantedeschi et al. [2021] for training voting classifiers; in contrast to that work our bounds apply to non-randomised votes via the use of margins. Our contributions add perspective to the debate on the "margins theory" proposed by Schapire et al. [1998] for the generalisation of ensemble classifiers.
    Conformal Off-Policy Prediction in Contextual Bandits. (arXiv:2206.04405v1 [stat.ML])
    Most off-policy evaluation methods for contextual bandits have focused on the expected outcome of a policy, which is estimated via methods that at best provide only asymptotic guarantees. However, in many applications, the expectation may not be the best measure of performance as it does not capture the variability of the outcome. In addition, particularly in safety-critical settings, stronger guarantees than asymptotic correctness may be required. To address these limitations, we consider a novel application of conformal prediction to contextual bandits. Given data collected under a behavioral policy, we propose \emph{conformal off-policy prediction} (COPP), which can output reliable predictive intervals for the outcome under a new target policy. We provide theoretical finite-sample guarantees without making any additional assumptions beyond the standard contextual bandit setup, and empirically demonstrate the utility of COPP compared with existing methods on synthetic and real-world data.
    FogAdapt: Self-Supervised Domain Adaptation for Semantic Segmentation of Foggy Images. (arXiv:2201.02588v3 [cs.CV] UPDATED)
    This paper presents FogAdapt, a novel approach for domain adaptation of semantic segmentation for dense foggy scenes. Although significant research has been directed to reduce the domain shift in semantic segmentation, adaptation to scenes with adverse weather conditions remains an open question. Large variations in the visibility of the scene due to weather conditions, such as fog, smog, and haze, exacerbate the domain shift, thus making unsupervised adaptation in such scenarios challenging. We propose a self-entropy and multi-scale information augmented self-supervised domain adaptation method (FogAdapt) to minimize the domain shift in foggy scenes segmentation. Supported by the empirical evidence that an increase in fog density results in high self-entropy for segmentation probabilities, we introduce a self-entropy based loss function to guide the adaptation method. Furthermore, inferences obtained at different image scales are combined and weighted by the uncertainty to generate scale-invariant pseudo-labels for the target domain. These scale-invariant pseudo-labels are robust to visibility and scale variations. We evaluate the proposed model on real clear-weather scenes to real foggy scenes adaptation and synthetic non-foggy images to real foggy scenes adaptation scenarios. Our experiments demonstrate that FogAdapt significantly outperforms the current state-of-the-art in semantic segmentation of foggy images. Specifically, by considering the standard settings compared to state-of-the-art (SOTA) methods, FogAdapt gains 3.8% on Foggy Zurich, 6.0% on Foggy Driving-dense, and 3.6% on Foggy Driving in mIoU when adapted from Cityscapes to Foggy Zurich.
    Accurate Node Feature Estimation with Structured Variational Graph Autoencoder. (arXiv:2206.04516v1 [cs.LG])
    Given a graph with partial observations of node features, how can we estimate the missing features accurately? Feature estimation is a crucial problem for analyzing real-world graphs whose features are commonly missing during the data collection process. Accurate estimation not only provides diverse information of nodes but also supports the inference of graph neural networks that require the full observation of node features. However, designing an effective approach for estimating high-dimensional features is challenging, since it requires an estimator to have large representation power, increasing the risk of overfitting. In this work, we propose SVGA (Structured Variational Graph Autoencoder), an accurate method for feature estimation. SVGA applies strong regularization to the distribution of latent variables by structured variational inference, which models the prior of variables as Gaussian Markov random field based on the graph structure. As a result, SVGA combines the advantages of probabilistic inference and graph neural networks, achieving state-of-the-art performance in real datasets.
    Clustering with Queries under Semi-Random Noise. (arXiv:2206.04583v1 [cs.LG])
    The seminal paper by Mazumdar and Saha \cite{MS17a} introduced an extensive line of work on clustering with noisy queries. Yet, despite significant progress on the problem, the proposed methods depend crucially on knowing the exact probabilities of errors of the underlying fully-random oracle. In this work, we develop robust learning methods that tolerate general semi-random noise obtaining qualitatively the same guarantees as the best possible methods in the fully-random model. More specifically, given a set of $n$ points with an unknown underlying partition, we are allowed to query pairs of points $u,v$ to check if they are in the same cluster, but with probability $p$, the answer may be adversarially chosen. We show that information theoretically $O\left(\frac{nk \log n} {(1-2p)^2}\right)$ queries suffice to learn any cluster of sufficiently large size. Our main result is a computationally efficient algorithm that can identify large clusters with $O\left(\frac{nk \log n} {(1-2p)^2}\right) + \text{poly}\left(\log n, k, \frac{1}{1-2p} \right)$ queries, matching the guarantees of the best known algorithms in the fully-random model. As a corollary of our approach, we develop the first parameter-free algorithm for the fully-random model, answering an open question by \cite{MS17a}.
    Contrastive Regularization for Semi-Supervised Learning. (arXiv:2201.06247v2 [cs.LG] UPDATED)
    Consistency regularization on label predictions becomes a fundamental technique in semi-supervised learning, but it still requires a large number of training iterations for high performance. In this study, we analyze that the consistency regularization restricts the propagation of labeling information due to the exclusion of samples with unconfident pseudo-labels in the model updates. Then, we propose contrastive regularization to improve both efficiency and accuracy of the consistency regularization by well-clustered features of unlabeled data. In specific, after strongly augmented samples are assigned to clusters by their pseudo-labels, our contrastive regularization updates the model so that the features with confident pseudo-labels aggregate the features in the same cluster, while pushing away features in different clusters. As a result, the information of confident pseudo-labels can be effectively propagated into more unlabeled samples during training by the well-clustered features. On benchmarks of semi-supervised learning tasks, our contrastive regularization improves the previous consistency-based methods and achieves state-of-the-art results, especially with fewer training iterations. Our method also shows robust performance on open-set semi-supervised learning where unlabeled data includes out-of-distribution samples.
    Variational Physics Informed Neural Networks: the role of quadratures and test functions. (arXiv:2109.02035v2 [math.NA] UPDATED)
    In this work we analyze how quadrature rules of different precisions and piecewise polynomial test functions of different degrees affect the convergence rate of Variational Physics Informed Neural Networks (VPINN) with respect to mesh refinement, while solving elliptic boundary-value problems. Using a Petrov-Galerkin framework relying on an inf-sup condition, we derive an a priori error estimate in the energy norm between the exact solution and a suitable high-order piecewise interpolant of a computed neural network. Numerical experiments confirm the theoretical predictions and highlight the importance of the inf-sup condition. Our results suggest, somehow counterintuitively, that for smooth solutions the best strategy to achieve a high decay rate of the error consists in choosing test functions of the lowest polynomial degree, while using quadrature formulas of suitably high precision.
    ECLAD: Extracting Concepts with Local Aggregated Descriptors. (arXiv:2206.04531v1 [cs.CV])
    Convolutional neural networks are being increasingly used in critical systems, where ensuring their robustness and alignment is crucial. In this context, the field of explainable artificial intelligence has proposed the generation of high-level explanations through concept extraction. These methods detect whether a concept is present in an image, but are incapable of locating where. What is more, a fair comparison of approaches is difficult, as proper validation procedures are missing. To fill these gaps, we propose a novel method for automatic concept extraction and localization based on representations obtained through the pixel-wise aggregations of activation maps of CNNs. Further, we introduce a process for the validation of concept-extraction techniques based on synthetic datasets with pixel-wise annotations of their main components, reducing human intervention. Through extensive experimentation on both synthetic and real-world datasets, our method achieves better performance in comparison to state-of-the-art alternatives.
    Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval. (arXiv:2201.12431v2 [cs.CL] UPDATED)
    Retrieval-based language models (R-LM) model the probability of natural language text by combining a standard language model (LM) with examples retrieved from an external datastore at test time. While effective, a major bottleneck of using these models in practice is the computationally costly datastore search, which can be performed as frequently as every time step. In this paper, we present RetoMaton - retrieval automaton - which approximates the datastore search, based on (1) saving pointers between consecutive datastore entries, and (2) clustering of entries into "states". This effectively results in a weighted finite automaton built on top of the datastore, instead of representing the datastore as a flat list. The creation of the automaton is unsupervised, and a RetoMaton can be constructed from any text collection: either the original training corpus or from another domain. Traversing this automaton at inference time, in parallel to the LM inference, reduces its perplexity by up to 1.85, or alternatively saves up to 83% of the nearest neighbor searches over $k$NN-LM (Khandelwal et al., 2020) without hurting perplexity. Our code and trained models are available at https://github.com/neulab/retomaton .
    Explicit Regularization in Overparametrized Models via Noise Injection. (arXiv:2206.04613v1 [cs.LG])
    Injecting noise within gradient descent has several desirable features. In this paper, we explore noise injection before computing a gradient step, which is known to have smoothing and regularizing properties. We show that small perturbations induce explicit regularization for simple finite-dimensional models based on the l1-norm, group l1-norms, or nuclear norms. When applied to overparametrized neural networks with large widths, we show that the same perturbations do not work due to variance explosion resulting from overparametrization. However, we also show that independent layer wise perturbations allow to avoid the exploding variance term, and explicit regularizers can then be obtained. We empirically show that the small perturbations lead to better generalization performance than vanilla (stochastic) gradient descent training, with minor adjustments to the training procedure.
    Probability flow solution of the Fokker-Planck equation. (arXiv:2206.04642v1 [cs.LG])
    The method of choice for integrating the time-dependent Fokker-Planck equation in high-dimension is to generate samples from the solution via integration of the associated stochastic differential equation. Here, we introduce an alternative scheme based on integrating an ordinary differential equation that describes the flow of probability. Unlike the stochastic dynamics, this equation deterministically pushes samples from the initial density onto samples from the solution at any later time. The method has the advantage of giving direct access to quantities that are challenging to estimate only given samples from the solution, such as the probability current, the density itself, and its entropy. The probability flow equation depends on the gradient of the logarithm of the solution (its "score"), and so is a-priori unknown. To resolve this dependence, we model the score with a deep neural network that is learned on-the-fly by propagating a set of particles according to the instantaneous probability current. Our approach is based on recent advances in score-based diffusion for generative modeling, with the important difference that the training procedure is self-contained and does not require samples from the target density to be available beforehand. To demonstrate the validity of the approach, we consider several examples from the physics of interacting particle systems; we find that the method scales well to high-dimensional systems, and accurately matches available analytical solutions and moments computed via Monte-Carlo.
    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. (arXiv:2206.04615v1 [cs.CL])
    Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
    Physics-aware Reduced-order Modeling of Transonic Flow via $\beta$-Variational Autoencoder. (arXiv:2205.00608v2 [physics.flu-dyn] UPDATED)
    Autoencoder-based reduced-order modeling (ROM) has recently attracted significant attention, owing to its ability to capture underlying nonlinear features. However, two critical drawbacks severely undermine its scalability to various physical applications: entangled and therefore uninterpretable latent variables (LVs) and the blindfold determination of latent space dimension. In this regard, this study proposes the physics-aware ROM using only interpretable and information-intensive LVs extracted by $\beta$-variational autoencoder, which are referred to as physics-aware LVs throughout this paper. To extract these LVs, their independence and information intensity are quantitatively scrutinized in a two-dimensional transonic flow benchmark problem. Then, the physical meanings of the physics-aware LVs are thoroughly investigated and we confirmed that with appropriate hyperparameter $\beta$, they actually correspond to the generating factors of the training dataset, Mach number and angle of attack. To the best of the authors' knowledge, our work is the first to practically confirm that $\beta$-variational autoencoder can automatically extract the physical generating factors in the field of applied physics. Finally, physics-aware ROM, which utilizes only physics-aware LVs, is compared with conventional ROMs, and its validity and efficiency are successfully verified.
    Transformer based Urdu Handwritten Text Optical Character Reader. (arXiv:2206.04575v1 [cs.CV])
    Extracting Handwritten text is one of the most important components of digitizing information and making it available for large scale setting. Handwriting Optical Character Reader (OCR) is a research problem in computer vision and natural language processing computing, and a lot of work has been done for English, but unfortunately, very little work has been done for low resourced languages such as Urdu. Urdu language script is very difficult because of its cursive nature and change of shape of characters based on it's relative position, therefore, a need arises to propose a model which can understand complex features and generalize it for every kind of handwriting style. In this work, we propose a transformer based Urdu Handwritten text extraction model. As transformers have been very successful in Natural Language Understanding task, we explore them further to understand complex Urdu Handwriting.
    RoMA: a Method for Neural Network Robustness Measurement and Assessment. (arXiv:2110.11088v4 [cs.LG] UPDATED)
    Neural network models have become the leading solution for a large variety of tasks, such as classification, language processing, protein folding, and others. However, their reliability is heavily plagued by adversarial inputs: small input perturbations that cause the model to produce erroneous outputs. Adversarial inputs can occur naturally when the system's environment behaves randomly, even in the absence of a malicious adversary, and are a severe cause for concern when attempting to deploy neural networks within critical systems. In this paper, we present a new statistical method, called Robustness Measurement and Assessment (RoMA), which can measure the expected robustness of a neural network model. Specifically, RoMA determines the probability that a random input perturbation might cause misclassification. The method allows us to provide formal guarantees regarding the expected frequency of errors that a trained model will encounter after deployment. Our approach can be applied to large-scale, black-box neural networks, which is a significant advantage compared to recently proposed verification methods. We apply our approach in two ways: comparing the robustness of different models, and measuring how a model's robustness is affected by the magnitude of input perturbation. One interesting insight obtained through this work is that, in a classification network, different output labels can exhibit very different robustness levels. We term this phenomenon categorial robustness. Our ability to perform risk and robustness assessments on a categorial basis opens the door to risk mitigation, which may prove to be a significant step towards neural network certification in safety-critical applications.
    A Simple Unified Approach to Testing High-Dimensional Conditional Independences for Categorical and Ordinal Data. (arXiv:2206.04356v1 [stat.ML])
    Conditional independence (CI) tests underlie many approaches to model testing and structure learning in causal inference. Most existing CI tests for categorical and ordinal data stratify the sample by the conditioning variables, perform simple independence tests in each stratum, and combine the results. Unfortunately, the statistical power of this approach degrades rapidly as the number of conditioning variables increases. Here we propose a simple unified CI test for ordinal and categorical data that maintains reasonable calibration and power in high dimensions. We show that our test outperforms existing baselines in model testing and structure learning for dense directed graphical models while being comparable for sparse models. Our approach could be attractive for causal model testing because it is easy to implement, can be used with non-parametric or parametric probability models, has the symmetry property, and has reasonable computational requirements.
    BigVGAN: A Universal Neural Vocoder with Large-Scale Training. (arXiv:2206.04658v1 [cs.SD])
    Despite recent progress in generative adversarial network(GAN)-based vocoders, where the model generates raw waveform conditioned on mel spectrogram, it is still challenging to synthesize high-fidelity audio for numerous speakers across varied recording environments. In this work, we present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting. We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform synthesis and significantly improves audio quality. Based on our improved generator and the state-of-the-art discriminators, we train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature. In particular, we identify and address the training instabilities specific to such scale, while maintaining high-fidelity output without over-regularization. Our BigVGAN achieves the state-of-the-art zero-shot performance for various out-of-distribution scenarios, including new speakers, novel languages, singing voices, music and instrumental audio in unseen (even noisy) recording environments. We will release our code and model at: https://github.com/NVIDIA/BigVGAN
    Pragmatically Learning from Pedagogical Demonstrations in Multi-Goal Environments. (arXiv:2206.04546v1 [cs.LG])
    Learning from demonstration methods usually leverage close to optimal demonstrations to accelerate training. By contrast, when demonstrating a task, human teachers deviate from optimal demonstrations and pedagogically modify their behavior by giving demonstrations that best disambiguate the goal they want to demonstrate. Analogously, human learners excel at pragmatically inferring the intent of the teacher, facilitating communication between the two agents. These mechanisms are critical in the few demonstrations regime, where inferring the goal is more difficult. In this paper, we implement pedagogy and pragmatism mechanisms by leveraging a Bayesian model of goal inference from demonstrations. We highlight the benefits of this model in multi-goal teacher-learner setups with two artificial agents that learn with goal-conditioned Reinforcement Learning. We show that combining a pedagogical teacher and a pragmatic learner results in faster learning and reduced goal ambiguity over standard learning from demonstrations, especially in the few demonstrations regime.
    Simple lessons from complex learning: what a neural network model learns about cosmic structure formation. (arXiv:2206.04573v1 [astro-ph.CO])
    We train a neural network model to predict the full phase space evolution of cosmological N-body simulations. Its success implies that the neural network model is accurately approximating the Green's function expansion that relates the initial conditions of the simulations to its outcome at later times in the deeply nonlinear regime. We test the accuracy of this approximation by assessing its performance on well understood simple cases that have either known exact solutions or well understood expansions. These scenarios include spherical configurations, isolated plane waves, and two interacting plane waves: initial conditions that are very different from the Gaussian random fields used for training. We find our model generalizes well to these well understood scenarios, demonstrating that the networks have inferred general physical principles and learned the nonlinear mode couplings from the complex, random Gaussian training data. These tests also provide a useful diagnostic for finding the model's strengths and weaknesses, and identifying strategies for model improvement. We also test the model on initial conditions that contain only transverse modes, a family of modes that differ not only in their phases but also in their evolution from the longitudinal growing modes used in the training set. When the network encounters these initial conditions that are orthogonal to the training set, the model fails completely. In addition to these simple configurations, we evaluate the model's predictions for the density, displacement, and momentum power spectra with standard initial conditions for N-body simulations. We compare these summary statistics against N-body results and an approximate, fast simulation method called COLA. Our model achieves percent level accuracy at nonlinear scales of $k\sim 1\ \mathrm{Mpc}^{-1}\, h$, representing a significant improvement over COLA.
    Bounding Training Data Reconstruction in Private (Deep) Learning. (arXiv:2201.12383v3 [cs.LG] UPDATED)
    Differential privacy is widely accepted as the de facto method for preventing data leakage in ML, and conventional wisdom suggests that it offers strong protection against privacy attacks. However, existing semantic guarantees for DP focus on membership inference, which may overestimate the adversary's capabilities and is not applicable when membership status itself is non-sensitive. In this paper, we derive the first semantic guarantees for DP mechanisms against training data reconstruction attacks under a formal threat model. We show that two distinct privacy accounting methods -- Renyi differential privacy and Fisher information leakage -- both offer strong semantic protection against data reconstruction attacks.
    Network insensitivity to parameter noise via adversarial regularization. (arXiv:2106.05009v3 [cs.LG] UPDATED)
    Neuromorphic neural network processors, in the form of compute-in-memory crossbar arrays of memristors, or in the form of subthreshold analog and mixed-signal ASICs, promise enormous advantages in compute density and energy efficiency for NN-based ML tasks. However, these technologies are prone to computational non-idealities, due to process variation and intrinsic device physics. This degrades the task performance of networks deployed to the processor, by introducing parameter noise into the deployed model. While it is possible to calibrate each device, or train networks individually for each processor, these approaches are expensive and impractical for commercial deployment. Alternative methods are therefore needed to train networks that are inherently robust against parameter variation, as a consequence of network architecture and parameters. We present a new adversarial network optimisation algorithm that attacks network parameters during training, and promotes robust performance during inference in the face of parameter variation. Our approach introduces a regularization term penalising the susceptibility of a network to weight perturbation. We compare against previous approaches for producing parameter insensitivity such as dropout, weight smoothing and introducing parameter noise during training. We show that our approach produces models that are more robust to targeted parameter variation, and equally robust to random parameter variation. Our approach finds minima in flatter locations in the weight-loss landscape compared with other approaches, highlighting that the networks found by our technique are less sensitive to parameter perturbation. Our work provides an approach to deploy neural network architectures to inference devices that suffer from computational non-idealities, with minimal loss of performance. ...
    The CLEAR Benchmark: Continual LEArning on Real-World Imagery. (arXiv:2201.06289v3 [cs.CV] UPDATED)
    Continual learning (CL) is widely regarded as crucial challenge for lifelong AI. However, existing CL benchmarks, e.g. Permuted-MNIST and Split-CIFAR, make use of artificial temporal variation and do not align with or generalize to the real-world. In this paper, we introduce CLEAR, the first continual image classification benchmark dataset with a natural temporal evolution of visual concepts in the real world that spans a decade (2004-2014). We build CLEAR from existing large-scale image collections (YFCC100M) through a novel and scalable low-cost approach to visio-linguistic dataset curation. Our pipeline makes use of pretrained vision-language models (e.g. CLIP) to interactively build labeled datasets, which are further validated with crowd-sourcing to remove errors and even inappropriate images (hidden in original YFCC100M). The major strength of CLEAR over prior CL benchmarks is the smooth temporal evolution of visual concepts with real-world imagery, including both high-quality labeled data along with abundant unlabeled samples per time period for continual semi-supervised learning. We find that a simple unsupervised pre-training step can already boost state-of-the-art CL algorithms that only utilize fully-supervised data. Our analysis also reveals that mainstream CL evaluation protocols that train and test on iid data artificially inflate performance of CL system. To address this, we propose novel "streaming" protocols for CL that always test on the (near) future. Interestingly, streaming protocols (a) can simplify dataset curation since today's testset can be repurposed for tomorrow's trainset and (b) can produce more generalizable models with more accurate estimates of performance since all labeled data from each time-period is used for both training and testing (unlike classic iid train-test splits).
    An FPGA-based Solution for Convolution Operation Acceleration. (arXiv:2206.04520v1 [cs.AR])
    Hardware-based acceleration is an extensive attempt to facilitate many computationally-intensive mathematics operations. This paper proposes an FPGA-based architecture to accelerate the convolution operation - a complex and expensive computing step that appears in many Convolutional Neural Network models. We target the design to the standard convolution operation, intending to launch the product as an edge-AI solution. The project's purpose is to produce an FPGA IP core that can process a convolutional layer at a time. System developers can deploy the IP core with various FPGA families by using Verilog HDL as the primary design language for the architecture. The experimental results show that our single computing core synthesized on a simple edge computing FPGA board can offer 0.224 GOPS. When the board is fully utilized, 4.48 GOPS can be achieved.
    Field Level Neural Network Emulator for Cosmological N-body Simulations. (arXiv:2206.04594v1 [astro-ph.CO])
    We build a field level emulator for cosmic structure formation that is accurate in the nonlinear regime. Our emulator consists of two convolutional neural networks trained to output the nonlinear displacements and velocities of N-body simulation particles based on their linear inputs. Cosmology dependence is encoded in the form of style parameters at each layer of the neural network, enabling the emulator to effectively interpolate the outcomes of structure formation between different flat $\Lambda$CDM cosmologies over a wide range of background matter densities. The neural network architecture makes the model differentiable by construction, providing a powerful tool for fast field level inference. We test the accuracy of our method by considering several summary statistics, including the density power spectrum with and without redshift space distortions, the displacement power spectrum, the momentum power spectrum, the density bispectrum, halo abundances, and halo profiles with and without redshift space distortions. We compare these statistics from our emulator with the full N-body results, the COLA method, and a fiducial neural network with no cosmological dependence. We find our emulator gives accurate results down to scales of $k \sim 1\ \mathrm{Mpc}^{-1}\, h$, representing a considerable improvement over both COLA and the fiducial neural network. We also demonstrate that our emulator generalizes well to initial conditions containing primordial non-Gaussianity, without the need for any additional style parameters or retraining.
    Spatial Entropy Regularization for Vision Transformers. (arXiv:2206.04636v1 [cs.CV])
    Recent work has shown that the attention maps of Vision Transformers (VTs), when trained with self-supervision, can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised. In this paper, we explicitly encourage the emergence of this spatial clustering as a form of training regularization, this way including a self-supervised pretext task into the standard supervised learning. In more detail, we propose a VT regularization method based on a spatial formulation of the information entropy. By minimizing the proposed spatial entropy, we explicitly ask the VT to produce spatially ordered attention maps, this way including an object-based prior during training. Using extensive experiments, we show that the proposed regularization approach is beneficial with different training scenarios, datasets, downstream tasks and VT architectures. The code will be available upon acceptance.
    Push--Pull with Device Sampling. (arXiv:2206.04113v1 [math.OC])
    We consider decentralized optimization problems in which a number of agents collaborate to minimize the average of their local functions by exchanging over an underlying communication graph. Specifically, we place ourselves in an asynchronous model where only a random portion of nodes perform computation at each iteration, while the information exchange can be conducted between all the nodes and in an asymmetric fashion. For this setting, we propose an algorithm that combines gradient tracking and variance reduction over the entire network. This enables each node to track the average of the gradients of the objective functions. Our theoretical analysis shows that the algorithm converges linearly, when the local objective functions are strongly convex, under mild connectivity conditions on the expected mixing matrices. In particular, our result does not require the mixing matrices to be doubly stochastic. In the experiments, we investigate a broadcast mechanism that transmits information from computing nodes to their neighbors, and confirm the linear convergence of our method on both synthetic and real-world datasets.
    Depression Recognition using Remote Photoplethysmography from Facial Videos. (arXiv:2206.04399v1 [cs.CV])
    Depression is a mental illness that may be harmful to an individual's health. The detection of mental health disorders in the early stages and a precise diagnosis are critical to avoid social, physiological, or psychological side effects. This work analyzes physiological signals to observe if different depressive states have a noticeable impact on the blood volume pulse (BVP) and the heart rate variability (HRV) response. Although typically, HRV features are calculated from biosignals obtained with contact-based sensors such as wearables, we propose instead a novel scheme that directly extracts them from facial videos, just based on visual information, removing the need for any contact-based device. Our solution is based on a pipeline that is able to extract complete remote photoplethysmography signals (rPPG) in a fully unsupervised manner. We use these rPPG signals to calculate over 60 statistical, geometrical, and physiological features that are further used to train several machine learning regressors to recognize different levels of depression. Experiments on two benchmark datasets indicate that this approach offers comparable results to other audiovisual modalities based on voice or facial expression, potentially complementing them. In addition, the results achieved for the proposed method show promising and solid performance that outperforms hand-engineered methods and is comparable to deep learning-based approaches.
    Explaining Clinical Decision Support Systems in Medical Imaging using Cycle-Consistent Activation Maximization. (arXiv:2010.05759v3 [eess.IV] UPDATED)
    Clinical decision support using deep neural networks has become a topic of steadily growing interest. While recent work has repeatedly demonstrated that deep learning offers major advantages for medical image classification over traditional methods, clinicians are often hesitant to adopt the technology because its underlying decision-making process is considered to be intransparent and difficult to comprehend. In recent years, this has been addressed by a variety of approaches that have successfully contributed to providing deeper insight. Most notably, additive feature attribution methods are able to propagate decisions back into the input space by creating a saliency map which allows the practitioner to "see what the network sees." However, the quality of the generated maps can become poor and the images noisy if only limited data is available - a typical scenario in clinical contexts. We propose a novel decision explanation scheme based on CycleGAN activation maximization which generates high-quality visualizations of classifier decisions even in smaller data sets. We conducted a user study in which we evaluated our method on the LIDC dataset for lung lesion malignancy classification, the BreastMNIST dataset for ultrasound image breast cancer detection, as well as two subsets of the CIFAR-10 dataset for RBG image object recognition. Within this user study, our method clearly outperformed existing approaches on the medical imaging datasets and ranked second in the natural image setting. With our approach we make a significant contribution towards a better understanding of clinical decision support systems based on deep neural networks and thus aim to foster overall clinical acceptance.
    Alternating Mirror Descent for Constrained Min-Max Games. (arXiv:2206.04160v1 [cs.GT])
    In this paper we study two-player bilinear zero-sum games with constrained strategy spaces. An instance of natural occurrences of such constraints is when mixed strategies are used, which correspond to a probability simplex constraint. We propose and analyze the alternating mirror descent algorithm, in which each player takes turns to take action following the mirror descent algorithm for constrained optimization. We interpret alternating mirror descent as an alternating discretization of a skew-gradient flow in the dual space, and use tools from convex optimization and modified energy function to establish an $O(K^{-2/3})$ bound on its average regret after $K$ iterations. This quantitatively verifies the algorithm's better behavior than the simultaneous version of mirror descent algorithm, which is known to diverge and yields an $O(K^{-1/2})$ average regret bound. In the special case of an unconstrained setting, our results recover the behavior of alternating gradient descent algorithm for zero-sum games which was studied in (Bailey et al., COLT 2020).
    Unlearning Protected User Attributes in Recommendations with Adversarial Training. (arXiv:2206.04500v1 [cs.IR])
    Collaborative filtering algorithms capture underlying consumption patterns, including the ones specific to particular demographics or protected information of users, e.g. gender, race, and location. These encoded biases can influence the decision of a recommendation system (RS) towards further separation of the contents provided to various demographic subgroups, and raise privacy concerns regarding the disclosure of users' protected attributes. In this work, we investigate the possibility and challenges of removing specific protected information of users from the learned interaction representations of a RS algorithm, while maintaining its effectiveness. Specifically, we incorporate adversarial training into the state-of-the-art MultVAE architecture, resulting in a novel model, Adversarial Variational Auto-Encoder with Multinomial Likelihood (Adv-MultVAE), which aims at removing the implicit information of protected attributes while preserving recommendation performance. We conduct experiments on the MovieLens-1M and LFM-2b-DemoBias datasets, and evaluate the effectiveness of the bias mitigation method based on the inability of external attackers in revealing the users' gender information from the model. Comparing with baseline MultVAE, the results show that Adv-MultVAE, with marginal deterioration in performance (w.r.t. NDCG and recall), largely mitigates inherent biases in the model on both datasets.
    ScatterSample: Diversified Label Sampling for Data Efficient Graph Neural Network Learning. (arXiv:2206.04255v1 [cs.LG])
    What target labels are most effective for graph neural network (GNN) training? In some applications where GNNs excel-like drug design or fraud detection, labeling new instances is expensive. We develop a data-efficient active sampling framework, ScatterSample, to train GNNs under an active learning setting. ScatterSample employs a sampling module termed DiverseUncertainty to collect instances with large uncertainty from different regions of the sample space for labeling. To ensure diversification of the selected nodes, DiverseUncertainty clusters the high uncertainty nodes and selects the representative nodes from each cluster. Our ScatterSample algorithm is further supported by rigorous theoretical analysis demonstrating its advantage compared to standard active sampling methods that aim to simply maximize the uncertainty and not diversify the samples. In particular, we show that ScatterSample is able to efficiently reduce the model uncertainty over the whole sample space. Our experiments on five datasets show that ScatterSample significantly outperforms the other GNN active learning baselines, specifically it reduces the sampling cost by up to 50% while achieving the same test accuracy.
    Data-Efficient Brain Connectome Analysis via Multi-Task Meta-Learning. (arXiv:2206.04486v1 [cs.LG])
    Brain networks characterize complex connectivities among brain regions as graph structures, which provide a powerful means to study brain connectomes. In recent years, graph neural networks have emerged as a prevalent paradigm of learning with structured data. However, most brain network datasets are limited in sample sizes due to the relatively high cost of data acquisition, which hinders the deep learning models from sufficient training. Inspired by meta-learning that learns new concepts fast with limited training examples, this paper studies data-efficient training strategies for analyzing brain connectomes in a cross-dataset setting. Specifically, we propose to meta-train the model on datasets of large sample sizes and transfer the knowledge to small datasets. In addition, we also explore two brain-network-oriented designs, including atlas transformation and adaptive task reweighing. Compared to other pre-training strategies, our meta-learning-based approach achieves higher and stabler performance, which demonstrates the effectiveness of our proposed solutions. The framework is also able to derive new insights regarding the similarities among datasets and diseases in a data-driven fashion.
    Local Spatiotemporal Representation Learning for Longitudinally-consistent Neuroimage Analysis. (arXiv:2206.04281v1 [cs.CV])
    Recent self-supervised advances in medical computer vision exploit global and local anatomical self-similarity for pretraining prior to downstream tasks such as segmentation. However, current methods assume i.i.d. image acquisition, which is invalid in clinical study designs where follow-up longitudinal scans track subject-specific temporal changes. Further, existing self-supervised methods for medically-relevant image-to-image architectures exploit only spatial or temporal self-similarity and only do so via a loss applied at a single image-scale, with naive multi-scale spatiotemporal extensions collapsing to degenerate solutions. To these ends, this paper makes two contributions: (1) It presents a local and multi-scale spatiotemporal representation learning method for image-to-image architectures trained on longitudinal images. It exploits the spatiotemporal self-similarity of learned multi-scale intra-subject features for pretraining and develops several feature-wise regularizations that avoid collapsed identity representations; (2) During finetuning, it proposes a surprisingly simple self-supervised segmentation consistency regularization to exploit intra-subject correlation. Benchmarked in the one-shot segmentation setting, the proposed framework outperforms both well-tuned randomly-initialized baselines and current self-supervised techniques designed for both i.i.d. and longitudinal datasets. These improvements are demonstrated across both longitudinal neurodegenerative adult MRI and developing infant brain MRI and yield both higher performance and longitudinal consistency.
    GSmooth: Certified Robustness against Semantic Transformations via Generalized Randomized Smoothing. (arXiv:2206.04310v1 [cs.LG])
    Certified defenses such as randomized smoothing have shown promise towards building reliable machine learning systems against $\ell_p$-norm bounded attacks. However, existing methods are insufficient or unable to provably defend against semantic transformations, especially those without closed-form expressions (such as defocus blur and pixelate), which are more common in practice and often unrestricted. To fill up this gap, we propose generalized randomized smoothing (GSmooth), a unified theoretical framework for certifying robustness against general semantic transformations via a novel dimension augmentation strategy. Under the GSmooth framework, we present a scalable algorithm that uses a surrogate image-to-image network to approximate the complex transformation. The surrogate model provides a powerful tool for studying the properties of semantic transformations and certifying robustness. Experimental results on several datasets demonstrate the effectiveness of our approach for robustness certification against multiple kinds of semantic transformations and corruptions, which is not achievable by the alternative baselines.
    Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem. (arXiv:2206.04119v1 [q-bio.BM])
    Construction of a scaffold structure that supports a desired motif, conferring protein function, shows promise for the design of vaccines and enzymes. But a general solution to this motif-scaffolding problem remains open. Current machine-learning techniques for scaffold design are either limited to unrealistically small scaffolds (up to length 20) or struggle to produce multiple diverse scaffolds. We propose to learn a distribution over diverse and longer protein backbone structures via an E(3)-equivariant graph neural network. We develop SMCDiff to efficiently sample scaffolds from this distribution conditioned on a given motif; our algorithm is the first to theoretically guarantee conditional samples from a diffusion model in the large-compute limit. We evaluate our designed backbones by how well they align with AlphaFold2-predicted structures. We show that our method can (1) sample scaffolds up to 80 residues and (2) achieve structurally diverse scaffolds for a fixed motif.
    Unsupervised Dictionary Learning for Anomaly Detection. (arXiv:2003.00293v2 [cs.LG] CROSS LISTED)
    We investigate the possibilities of employing dictionary learning to address the requirements of most anomaly detection applications, such as absence of supervision, online formulations, low false positive rates. We present new results of our recent semi-supervised online algorithm, TODDLeR, on a anti-money laundering application. We also introduce a novel unsupervised method of using the performance of the learning algorithm as indication of the nature of the samples.
    Wireless for Machine Learning. (arXiv:2008.13492v3 [eess.SP] UPDATED)
    As data generation increasingly takes place on devices without a wired connection, machine learning (ML) related traffic will be ubiquitous in wireless networks. Many studies have shown that traditional wireless protocols are highly inefficient or unsustainable to support ML, which creates the need for new wireless communication methods. In this survey, we give an exhaustive review of the state-of-the-art wireless methods that are specifically designed to support ML services over distributed datasets. Currently, there are two clear themes within the literature, analog over-the-air computation and digital radio resource management optimized for ML. This survey gives a comprehensive introduction to these methods, reviews the most important works, highlights open problems, and discusses application scenarios.
    What is a Good Metric to Study Generalization of Minimax Learners?. (arXiv:2206.04502v1 [stat.ML])
    Minimax optimization has served as the backbone of many machine learning (ML) problems. Although the convergence behavior of optimization algorithms has been extensively studied in minimax settings, their generalization guarantees in the stochastic setting, i.e., how the solution trained on empirical data performs on the unseen testing data, have been relatively underexplored. A fundamental question remains elusive: What is a good metric to study generalization of minimax learners? In this paper, we aim to answer this question by first showing that primal risk, a universal metric to study generalization in minimization, fails in simple examples of minimax problems. Furthermore, another popular metric, the primal-dual risk, also fails to characterize the generalization behavior for minimax problems with nonconvexity, due to non-existence of saddle points. We thus propose a new metric to study generalization of minimax learners: the primal gap, to circumvent these issues. Next, we derive generalization bounds for the primal gap in nonconvex-concave settings. As byproducts of our analysis, we also solve two open questions: establishing generalization bounds for primal risk and primal-dual risk in the strong sense, i.e., without strong concavity or assuming that the maximization and expectation can be interchanged, while either of these assumptions was needed in the literature. Finally, we leverage this new metric to compare the generalization behavior of two popular algorithms -- gradient descent-ascent (GDA) and gradient descent-max (GDMax) in stochastic minimax optimization.
    Uncovering bias in the PlantVillage dataset. (arXiv:2206.04374v1 [cs.CV])
    We report our investigation on the use of the popular PlantVillage dataset for training deep learning based plant disease detection models. We trained a machine learning model using only 8 pixels from the PlantVillage image backgrounds. The model achieved 49.0% accuracy on the held-out test set, well above the random guessing accuracy of 2.6%. This result indicates that the PlantVillage dataset contains noise correlated with the labels and deep learning models can easily exploit this bias to make predictions. Possible approaches to alleviate this problem are discussed.
    Boosting Fast Adversarial Training with Learnable Adversarial Initialization. (arXiv:2110.05007v2 [cs.CV] UPDATED)
    Adversarial training (AT) has been demonstrated to be effective in improving model robustness by leveraging adversarial examples for training. However, most AT methods are in face of expensive time and computational cost for calculating gradients at multiple steps in generating adversarial examples. To boost training efficiency, fast gradient sign method (FGSM) is adopted in fast AT methods by calculating gradient only once. Unfortunately, the robustness is far from satisfactory. One reason may arise from the initialization fashion. Existing fast AT generally uses a random sample-agnostic initialization, which facilitates the efficiency yet hinders a further robustness improvement. Up to now, the initialization in fast AT is still not extensively explored. In this paper, we boost fast AT with a sample-dependent adversarial initialization, i.e., an output from a generative network conditioned on a benign image and its gradient information from the target network. As the generative network and the target network are optimized jointly in the training phase, the former can adaptively generate an effective initialization with respect to the latter, which motivates gradually improved robustness. Experimental evaluations on four benchmark databases demonstrate the superiority of our proposed method over state-of-the-art fast AT methods, as well as comparable robustness to advanced multi-step AT methods. The code is released at https://github.com//jiaxiaojunQAQ//FGSM-SDI.
    Multi-Mask Self-Supervised Learning for Physics-Guided Neural Networks in Highly Accelerated MRI. (arXiv:2008.06029v2 [eess.IV] UPDATED)
    Self-supervised learning has shown great promise due to its capability to train deep learning MRI reconstruction methods without fully-sampled data. Current self-supervised learning methods for physics-guided reconstruction networks split acquired undersampled data into two disjoint sets, where one is used for data consistency (DC) in the unrolled network and the other to define the training loss. In this study, we propose an improved self-supervised learning strategy that more efficiently uses the acquired data to train a physics-guided reconstruction network without a database of fully-sampled data. The proposed multi-mask self-supervised learning via data undersampling (SSDU) applies a hold-out masking operation on acquired measurements to split it into multiple pairs of disjoint sets for each training sample, while using one of these pairs for DC units and the other for defining loss, thereby more efficiently using the undersampled data. Multi-mask SSDU is applied on fully-sampled 3D knee and prospectively undersampled 3D brain MRI datasets, for various acceleration rates and patterns, and compared to CG-SENSE and single-mask SSDU DL-MRI, as well as supervised DL-MRI when fully-sampled data is available. Results on knee MRI show that the proposed multi-mask SSDU outperforms SSDU and performs closely with supervised DL-MRI. A clinical reader study further ranks the multi-mask SSDU higher than supervised DL-MRI in terms of SNR and aliasing artifacts. Results on brain MRI show that multi-mask SSDU achieves better reconstruction quality compared to SSDU. Reader study demonstrates that multi-mask SSDU at R=8 significantly improves reconstruction compared to single-mask SSDU at R=8, as well as CG-SENSE at R=2.
    Community-Level Anomaly Detection for Anti-Money Laundering. (arXiv:1910.11313v1 [cs.LG] CROSS LISTED)
    Anomaly detection in networks often boils down to identifying an underlying graph structure on which the abnormal occurrence rests on. Financial fraud schemes are one such example, where more or less intricate schemes are employed in order to elude transaction security protocols. We investigate the problem of learning graph structure representations using adaptations of dictionary learning aimed at encoding connectivity patterns. In particular, we adapt dictionary learning strategies to the specificity of network topologies and propose new methods that impose Laplacian structure on the dictionaries themselves. In one adaption we focus on classifying topologies by working directly on the graph Laplacian and cast the learning problem to accommodate its 2D structure. We tackle the same problem by learning dictionaries which consist of vectorized atomic Laplacians, and provide a block coordinate descent scheme to solve the new dictionary learning formulation. Imposing Laplacian structure on the dictionaries is also proposed in an adaptation of the Single Block Orthogonal learning method. Results on synthetic graph datasets comprising different graph topologies confirm the potential of dictionaries to directly represent graph structure information.
    TAG: Toward Accurate Social Media Content Tagging with a Concept Graph. (arXiv:2110.06892v3 [cs.LG] UPDATED)
    Although conceptualization has been widely studied in semantics and knowledge representation, it is still challenging to find the most accurate concept phrases to characterize the main idea of a text snippet on the fast-growing social media. This is partly attributed to the fact that most knowledge bases contain general terms of the world, such as trees and cars, which do not have the defining power or are not interesting enough to social media app users. Another reason is that the intricacy of natural language allows the use of tense, negation and grammar to change the logic or emphasis of language, thus conveying completely different meanings. In this paper, we present TAG, a high-quality concept matching dataset consisting of 10,000 labeled pairs of fine-grained concepts and web-styled natural language sentences, mined from the open-domain social media. The concepts we consider represent the trending interests of online users. Associated with TAG is a concept graph of these fine-grained concepts and entities to provide the structural context information. We evaluate a wide range of popular neural text matching models as well as pre-trained language models on TAG, and point out their insufficiency to tag social media content with the most appropriate concept. We further propose a novel graph-graph matching method that demonstrates superior abstraction and generalization performance by better utilizing both the structural context in the concept graph and logic interactions between semantic units in the sentence via syntactic dependency parsing. We open-source both the TAG dataset and the proposed methods to facilitate further research.
    Privacy Leakage in Text Classification: A Data Extraction Approach. (arXiv:2206.04591v1 [cs.CL])
    Recent work has demonstrated the successful extraction of training data from generative language models. However, it is not evident whether such extraction is feasible in text classification models since the training objective is to predict the class label as opposed to next-word prediction. This poses an interesting challenge and raises an important question regarding the privacy of training data in text classification settings. Therefore, we study the potential privacy leakage in the text classification domain by investigating the problem of unintended memorization of training data that is not pertinent to the learning task. We propose an algorithm to extract missing tokens of a partial text by exploiting the likelihood of the class label provided by the model. We test the effectiveness of our algorithm by inserting canaries into the training set and attempting to extract tokens in these canaries post-training. In our experiments, we demonstrate that successful extraction is possible to some extent. This can also be used as an auditing strategy to assess any potential unauthorized use of personal data without consent.
    ADG-Pose: Automated Dataset Generation for Real-World Human Pose Estimation. (arXiv:2202.00753v2 [cs.CV] UPDATED)
    Recent advancements in computer vision have seen a rise in the prominence of applications using neural networks to understand human poses. However, while accuracy has been steadily increasing on State-of-the-Art datasets, these datasets often do not address the challenges seen in real-world applications. These challenges are dealing with people distant from the camera, people in crowds, and heavily occluded people. As a result, many real-world applications have trained on data that does not reflect the data present in deployment, leading to significant underperformance. This article presents ADG-Pose, a method for automatically generating datasets for real-world human pose estimation. These datasets can be customized to determine person distances, crowdedness, and occlusion distributions. Models trained with our method are able to perform in the presence of these challenges where those trained on other datasets fail. Using ADG-Pose, end-to-end accuracy for real-world skeleton-based action recognition sees a 20% increase on scenes with moderate distance and occlusion levels, and a 4X increase on distant scenes where other models failed to perform better than random.
    SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization. (arXiv:2205.07547v2 [cs.LG] UPDATED)
    One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned discrete representation uses only a fraction of the full capacity of the codebook, also known as codebook collapse. We hypothesize that the training scheme of VQ-VAE, which involves some carefully designed heuristics, underlies this issue. In this paper, we propose a new training scheme that extends the standard VAE via novel stochastic dequantization and quantization, called stochastically quantized variational autoencoder (SQ-VAE). In SQ-VAE, we observe a trend that the quantization is stochastic at the initial stage of the training but gradually converges toward a deterministic quantization, which we call self-annealing. Our experiments show that SQ-VAE improves codebook utilization without using common heuristics. Furthermore, we empirically show that SQ-VAE is superior to VAE and VQ-VAE in vision- and speech-related tasks.
    Graph Attention Multi-Layer Perceptron. (arXiv:2206.04355v1 [cs.LG])
    Graph neural networks (GNNs) have achieved great success in many graph-based applications. However, the enormous size and high sparsity level of graphs hinder their applications under industrial scenarios. Although some scalable GNNs are proposed for large-scale graphs, they adopt a fixed $K$-hop neighborhood for each node, thus facing the over-smoothing issue when adopting large propagation depths for nodes within sparse regions. To tackle the above issue, we propose a new GNN architecture -- Graph Attention Multi-Layer Perceptron (GAMLP), which can capture the underlying correlations between different scales of graph knowledge. We have deployed GAMLP in Tencent with the Angel platform, and we further evaluate GAMLP on both real-world datasets and large-scale industrial datasets. Extensive experiments on these 14 graph datasets demonstrate that GAMLP achieves state-of-the-art performance while enjoying high scalability and efficiency. Specifically, it outperforms GAT by 1.3\% regarding predictive accuracy on our large-scale Tencent Video dataset while achieving up to $50\times$ training speedup. Besides, it ranks top-1 on both the leaderboards of the largest homogeneous and heterogeneous graph (i.e., ogbn-papers100M and ogbn-mag) of Open Graph Benchmark.
    Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent. (arXiv:2002.04861v3 [stat.ML] UPDATED)
    We prove that two-layer (Leaky)ReLU networks initialized by e.g. the widely used method proposed by He et al. (2015) and trained using gradient descent on a least-squares loss are not universally consistent. Specifically, we describe a large class of one-dimensional data-generating distributions for which, with high probability, gradient descent only finds a bad local minimum of the optimization landscape, since it is unable to move the biases far away from their initialization at zero. It turns out that in these cases, the found network essentially performs linear regression even if the target function is non-linear. We further provide numerical evidence that this happens in practical situations, for some multi-dimensional distributions and that stochastic gradient descent exhibits similar behavior. We also provide empirical results on how the choice of initialization and optimizer can influence this behavior.
    Generalization and Robustness Implications in Object-Centric Learning. (arXiv:2107.00637v3 [cs.LG] UPDATED)
    The idea behind object-centric representation learning is that natural scenes can better be modeled as compositions of objects and their relations as opposed to distributed representations. This inductive bias can be injected into neural networks to potentially improve systematic generalization and performance of downstream tasks in scenes with multiple objects. In this paper, we train state-of-the-art unsupervised models on five common multi-object datasets and evaluate segmentation metrics and downstream object property prediction. In addition, we study generalization and robustness by investigating the settings where either a single object is out of distribution -- e.g., having an unseen color, texture, or shape -- or global properties of the scene are altered -- e.g., by occlusions, cropping, or increasing the number of objects. From our experimental study, we find object-centric representations to be useful for downstream tasks and generally robust to most distribution shifts affecting objects. However, when the distribution shift affects the input in a less structured manner, robustness in terms of segmentation and downstream task performance may vary significantly across models and distribution shifts.
    It's a super deal -- train recurrent network on noisy data and get smooth prediction free. (arXiv:2206.04215v1 [cs.LG])
    Recent research demonstrate that prediction of time series by predictive recurrent neural networks based on the noisy input generates a {\it smooth} anticipated trajectory. We examine influence of the noise component in both the training data sets and the input sequences on network prediction quality. We propose and discuss an explanation of the observed noise compression in the predictive process. We also discuss importance of this property of recurrent networks in the neuroscience context for the evolution of living organisms.
    Gradient Obfuscation Gives a False Sense of Security in Federated Learning. (arXiv:2206.04055v1 [cs.CR])
    Federated learning has been proposed as a privacy-preserving machine learning framework that enables multiple clients to collaborate without sharing raw data. However, client privacy protection is not guaranteed by design in this framework. Prior work has shown that the gradient sharing strategies in federated learning can be vulnerable to data reconstruction attacks. In practice, though, clients may not transmit raw gradients considering the high communication cost or due to privacy enhancement requirements. Empirical studies have demonstrated that gradient obfuscation, including intentional obfuscation via gradient noise injection and unintentional obfuscation via gradient compression, can provide more privacy protection against reconstruction attacks. In this work, we present a new data reconstruction attack framework targeting the image classification task in federated learning. We show that commonly adopted gradient postprocessing procedures, such as gradient quantization, gradient sparsification, and gradient perturbation, may give a false sense of security in federated learning. Contrary to prior studies, we argue that privacy enhancement should not be treated as a byproduct of gradient compression. Additionally, we design a new method under the proposed framework to reconstruct the image at the semantic level. We quantify the semantic privacy leakage and compare with conventional based on image similarity scores. Our comparisons challenge the image data leakage evaluation schemes in the literature. The results emphasize the importance of revisiting and redesigning the privacy protection mechanisms for client data in existing federated learning algorithms.
    Receding Horizon Inverse Reinforcement Learning. (arXiv:2206.04477v1 [cs.LG])
    Inverse reinforcement learning (IRL) seeks to infer a cost function that explains the underlying goals and preferences of expert demonstrations. This paper presents receding horizon inverse reinforcement learning (RHIRL), a new IRL algorithm for high-dimensional, noisy, continuous systems with black-box dynamic models. RHIRL addresses two key challenges of IRL: scalability and robustness. To handle high-dimensional continuous systems, RHIRL matches the induced optimal trajectories with expert demonstrations locally in a receding horizon manner and 'stitches' together the local solutions to learn the cost; it thereby avoids the 'curse of dimensionality'. This contrasts sharply with earlier algorithms that match with expert demonstrations globally over the entire high-dimensional state space. To be robust against imperfect expert demonstrations and system control noise, RHIRL learns a state-dependent cost function 'disentangled' from system dynamics under mild conditions. Experiments on benchmark tasks show that RHIRL outperforms several leading IRL algorithms in most instances. We also prove that the cumulative error of RHIRL grows linearly with the task duration.
    Enhancement of Healthcare Data Transmission using the Levenberg-Marquardt Algorithm. (arXiv:2206.04240v1 [cs.LG])
    In the healthcare system, patients are required to use wearable devices for the remote data collection and real-time monitoring of health data and the status of health conditions. This adoption of wearables results in a significant increase in the volume of data that is collected and transmitted. As the devices are run by small battery power, they can be quickly diminished due to the high processing requirements of the device for data collection and transmission. Given the importance attached to medical data, it is imperative that all transmitted data adhere to strict integrity and availability requirements. Reducing the volume of healthcare data and the frequency of transmission will improve the device battery life via using inference algorithm. There is an issue of improving transmission metrics with accuracy and efficiency, which trade-off each other such as increasing accuracy reduces the efficiency. This paper demonstrates that machine learning can be used to analyze complex health data metrics such as the accuracy and efficiency of data transmission to overcome the trade-off problem using the Levenberg-Marquardt algorithm to enhance both metrics by taking fewer samples to transmit whilst maintaining the accuracy. The algorithm is tested with a standard heart rate dataset to compare the metrics. The result shows that the LMA has best performed with an efficiency of 3.33 times for reduced sample data size and accuracy of 79.17%, which has the similar accuracies in 7 different sampling cases adopted for testing but demonstrates improved efficiency. These proposed methods significantly improved both metrics using machine learning without sacrificing a metric over the other compared to the existing methods with high efficiency.
    An Optimization Method-Assisted Ensemble Deep Reinforcement Learning Algorithm to Solve Unit Commitment Problems. (arXiv:2206.04249v1 [eess.SY])
    Unit commitment (UC) is a fundamental problem in the day-ahead electricity market, and it is critical to solve UC problems efficiently. Mathematical optimization techniques like dynamic programming, Lagrangian relaxation, and mixed-integer quadratic programming (MIQP) are commonly adopted for UC problems. However, the calculation time of these methods increases at an exponential rate with the amount of generators and energy resources, which is still the main bottleneck in industry. Recent advances in artificial intelligence have demonstrated the capability of reinforcement learning (RL) to solve UC problems. Unfortunately, the existing research on solving UC problems with RL suffers from the curse of dimensionality when the size of UC problems grows. To deal with these problems, we propose an optimization method-assisted ensemble deep reinforcement learning algorithm, where UC problems are formulated as a Markov Decision Process (MDP) and solved by multi-step deep Q-learning in an ensemble framework. The proposed algorithm establishes a candidate action set by solving tailored optimization problems to ensure a relatively high performance and the satisfaction of operational constraints. Numerical studies on IEEE 118 and 300-bus systems show that our algorithm outperforms the baseline RL algorithm and MIQP. Furthermore, the proposed algorithm shows strong generalization capacity under unforeseen operational conditions.
    Pseudo-Poincar\'e: A Unification Framework for Euclidean and Hyperbolic Graph Neural Networks. (arXiv:2206.04285v1 [cs.LG])
    Hyperbolic neural networks have recently gained significant attention due to their promising results on several graph problems including node classification and link prediction. The primary reason for this success is the effectiveness of the hyperbolic space in capturing the inherent hierarchy of graph datasets. However, they are limited in terms of generalization, scalability, and have inferior performance when it comes to non-hierarchical datasets. In this paper, we take a completely orthogonal perspective for modeling hyperbolic networks. We use Poincar\'e disk to model the hyperbolic geometry and also treat it as if the disk itself is a tangent space at origin. This enables us to replace non-scalable M\"obius gyrovector operations with an Euclidean approximation, and thus simplifying the entire hyperbolic model to a Euclidean model cascaded with a hyperbolic normalization function. Our approach does not adhere to M\"obius math, yet it still works in the Riemannian manifold, hence we call it Pseudo-Poincar\'e framework. We applied our non-linear hyperbolic normalization to the current state-of-the-art homogeneous and multi-relational graph networks and demonstrate significant improvements in performance compared to both Euclidean and hyperbolic counterparts. The primary impact of this work lies in its ability to capture hierarchical features in the Euclidean space, and thus, can replace hyperbolic networks without loss in performance metrics while simultaneously leveraging the power of Euclidean networks such as interpretability and efficient execution of various model components.
    Unsupervised Knowledge Adaptation for Passenger Demand Forecasting. (arXiv:2206.04053v1 [cs.LG])
    Considering the multimodal nature of transport systems and potential cross-modal correlations, there is a growing trend of enhancing demand forecasting accuracy by learning from multimodal data. These multimodal forecasting models can improve accuracy but be less practical when different parts of multimodal datasets are owned by different institutions who cannot directly share data among them. While various institutions may can not share their data with each other directly, they may share forecasting models trained by their data, where such models cannot be used to identify the exact information from their datasets. This study proposes an Unsupervised Knowledge Adaptation Demand Forecasting framework to forecast the demand of the target mode by utilizing a pre-trained model based on data of another mode, which does not require direct data sharing of the source mode. The proposed framework utilizes the potential shared patterns among multiple transport modes to improve forecasting performance while avoiding the direct sharing of data among different institutions. Specifically, a pre-trained forecasting model is first learned based on the data of a source mode, which can capture and memorize the source travel patterns. Then, the demand data of the target dataset is encoded into an individual knowledge part and a sharing knowledge part which will extract travel patterns by individual extraction network and sharing extraction network, respectively. The unsupervised knowledge adaptation strategy is utilized to form the sharing features for further forecasting by making the pre-trained network and the sharing extraction network analogous. Our findings illustrate that unsupervised knowledge adaptation by sharing the pre-trained model to the target mode can improve the forecasting performance without the dependence on direct data sharing.
    A General Framework For Proving The Equivariant Strong Lottery Ticket Hypothesis. (arXiv:2206.04270v1 [cs.LG])
    The Strong Lottery Ticket Hypothesis (SLTH) stipulates the existence of a subnetwork within a sufficiently overparameterized (dense) neural network that -- when initialized randomly and without any training -- achieves the accuracy of a fully trained target network. Recent work by \citet{da2022proving} demonstrates that the SLTH can also be extended to translation equivariant networks -- i.e. CNNs -- with the same level of overparametrization as needed for SLTs in dense networks. However, modern neural networks are capable of incorporating more than just translation symmetry, and developing general equivariant architectures such as rotation and permutation has been a powerful design principle. In this paper, we generalize the SLTH to functions that preserve the action of the group $G$ -- i.e. $G$-equivariant network -- and prove, with high probability, that one can prune a randomly initialized overparametrized $G$-equivariant network to a $G$-equivariant subnetwork that approximates another fully trained $G$-equivariant network of fixed width and depth. We further prove that our prescribed overparametrization scheme is also optimal as a function of the error tolerance. We develop our theory for a large range of groups, including important ones such as subgroups of the Euclidean group $\text{E}(n)$ and subgroups of the symmetric group $G \leq \mathcal{S}_n$ -- allowing us to find SLTs for MLPs, CNNs, $\text{E}(2)$-steerable CNNs, and permutation equivariant networks as specific instantiations of our unified framework which completely extends prior work. Empirically, we verify our theory by pruning overparametrized $\text{E}(2)$-steerable CNNs and message passing GNNs to match the performance of trained target networks within a given error tolerance.
    N-ACT: An Interpretable Deep Learning Model for Automatic Cell Type and Salient Gene Identification. (arXiv:2206.04047v1 [q-bio.GN])
    Single-cell RNA sequencing (scRNAseq) is rapidly advancing our understanding of cellular composition within complex tissues and organisms. A major limitation in most scRNAseq analysis pipelines is the reliance on manual annotations to determine cell identities, which are time consuming, subjective, and require expertise. Given the surge in cell sequencing, supervised methods-especially deep learning models-have been developed for automatic cell type identification (ACTI), which achieve high accuracy and scalability. However, all existing deep learning frameworks for ACTI lack interpretability and are used as "black-box" models. We present N-ACT (Neural-Attention for Cell Type identification): the first-of-its-kind interpretable deep neural network for ACTI utilizing neural-attention to detect salient genes for use in cell-type identification. We compare N-ACT to conventional annotation methods on two previously manually annotated data sets, demonstrating that N-ACT accurately identifies marker genes and cell types in an unsupervised manner, while performing comparably on multiple data sets to current state-of-the-art model in traditional supervised ACTI.
    What-Is and How-To for Fairness in Machine Learning: A Survey, Reflection, and Perspective. (arXiv:2206.04101v1 [cs.LG])
    Algorithmic fairness has attracted increasing attention in the machine learning community. Various definitions are proposed in the literature, but the differences and connections among them are not clearly addressed. In this paper, we review and reflect on various fairness notions previously proposed in machine learning literature, and make an attempt to draw connections to arguments in moral and political philosophy, especially theories of justice. We also consider fairness inquiries from a dynamic perspective, and further consider the long-term impact that is induced by current prediction and decision. In light of the differences in the characterized fairness, we present a flowchart that encompasses implicit assumptions and expected outcomes of different types of fairness inquiries on the data generating process, on the predicted outcome, and on the induced impact, respectively. This paper demonstrates the importance of matching the mission (which kind of fairness one would like to enforce) and the means (which spectrum of fairness analysis is of interest, what is the appropriate analyzing scheme) to fulfill the intended purpose.
    Hidden Markov Models with Momentum. (arXiv:2206.04057v1 [cs.LG])
    Momentum is a popular technique for improving convergence rates during gradient descent. In this research, we experiment with adding momentum to the Baum-Welch expectation-maximization algorithm for training Hidden Markov Models. We compare discrete Hidden Markov Models trained with and without momentum on English text and malware opcode data. The effectiveness of momentum is determined by measuring the changes in model score and classification accuracy due to momentum. Our extensive experiments indicate that adding momentum to Baum-Welch can reduce the number of iterations required for initial convergence during HMM training, particularly in cases where the model is slow to converge. However, momentum does not seem to improve the final model performance at a high number of iterations.
    Uplifting Bandits. (arXiv:2206.04091v1 [stat.ML])
    We introduce a multi-armed bandit model where the reward is a sum of multiple random variables, and each action only alters the distributions of some of them. After each action, the agent observes the realizations of all the variables. This model is motivated by marketing campaigns and recommender systems, where the variables represent outcomes on individual customers, such as clicks. We propose UCB-style algorithms that estimate the uplifts of the actions over a baseline. We study multiple variants of the problem, including when the baseline and affected variables are unknown, and prove sublinear regret bounds for all of these. We also provide lower bounds that justify the necessity of our modeling assumptions. Experiments on synthetic and real-world datasets show the benefit of methods that estimate the uplifts over policies that do not use this structure.
    On Transfer Learning in Functional Linear Regression. (arXiv:2206.04277v1 [stat.ML])
    This work studies the problem of transfer learning under the functional linear model framework, which aims to improve the fit of the target model by leveraging the knowledge from related source models. We measure the relatedness between target and source models using Reproducing Kernel Hilbert Spaces, allowing the type of knowledge being transferred to be interpreted by the structure of the spaces. Two algorithms are proposed: one transfers knowledge when the index of transferable sources is known, while the other one utilizes aggregation to achieve knowledge transfer without prior information about the sources. Furthermore, we establish the optimal convergence rates for excess risk, making the statistical gain via transfer learning mathematically provable. The effectiveness of the proposed algorithms is demonstrated on synthetic data as well as real financial data.
    Individually Fair Learning with One-Sided Feedback. (arXiv:2206.04475v1 [cs.LG])
    We consider an online learning problem with one-sided feedback, in which the learner is able to observe the true label only for positively predicted instances. On each round, $k$ instances arrive and receive classification outcomes according to a randomized policy deployed by the learner, whose goal is to maximize accuracy while deploying individually fair policies. We first extend the framework of Bechavod et al. (2020), which relies on the existence of a human fairness auditor for detecting fairness violations, to instead incorporate feedback from dynamically-selected panels of multiple, possibly inconsistent, auditors. We then construct an efficient reduction from our problem of online learning with one-sided feedback and a panel reporting fairness violations to the contextual combinatorial semi-bandit problem (Cesa-Bianchi & Lugosi, 2009, Gy\"{o}rgy et al., 2007). Finally, we show how to leverage the guarantees of two algorithms in the contextual combinatorial semi-bandit setting: Exp2 (Bubeck et al., 2012) and the oracle-efficient Context-Semi-Bandit-FTPL (Syrgkanis et al., 2016), to provide multi-criteria no regret guarantees simultaneously for accuracy and fairness. Our results eliminate two potential sources of bias from prior work: the "hidden outcomes" that are not available to an algorithm operating in the full information setting, and human biases that might be present in any single human auditor, but can be mitigated by selecting a well chosen panel.
    Ensembling Framework for Texture Extraction Techniques for Classification. (arXiv:2206.04158v1 [cs.CV])
    In the past few years, texture-based classification problems have proven their significance in many domains, from industrial inspection to health-related applications. New techniques and CNN-based architectures have been developed in recent years to solve texture-based classification problems. The limitation of these approaches is that none of them claims to be the best suited for all types of textures. Each technique has its advantage over a specific texture type. To address this issue, we are proposing a framework that combines existing techniques to extract texture features and displays better results than the present ones. The proposed framework works well on the most of the texture types, and in this framework, new techniques can also be added to achieve better results than existing ones. We are also presenting the SOTA results on FMD and KTH datasets by combining three existing techniques, using the proposed framework.
    Learning to Break Deep Perceptual Hashing: The Use Case NeuralHash. (arXiv:2111.06628v4 [cs.LG] UPDATED)
    Apple recently revealed its deep perceptual hashing system NeuralHash to detect child sexual abuse material (CSAM) on user devices before files are uploaded to its iCloud service. Public criticism quickly arose regarding the protection of user privacy and the system's reliability. In this paper, we present the first comprehensive empirical analysis of deep perceptual hashing based on NeuralHash. Specifically, we show that current deep perceptual hashing may not be robust. An adversary can manipulate the hash values by applying slight changes in images, either induced by gradient-based approaches or simply by performing standard image transformations, forcing or preventing hash collisions. Such attacks permit malicious actors easily to exploit the detection system: from hiding abusive material to framing innocent users, everything is possible. Moreover, using the hash values, inferences can still be made about the data stored on user devices. In our view, based on our results, deep perceptual hashing in its current form is generally not ready for robust client-side scanning and should not be used from a privacy perspective.
    Neural Prompt Search. (arXiv:2206.04673v1 [cs.CV])
    The size of vision models has grown exponentially over the last few years, especially after the emergence of Vision Transformer. This has motivated the development of parameter-efficient tuning methods, such as learning adapter layers or visual prompt tokens, which allow a tiny portion of model parameters to be trained whereas the vast majority obtained from pre-training are frozen. However, designing a proper tuning method is non-trivial: one might need to try out a lengthy list of design choices, not to mention that each downstream dataset often requires custom designs. In this paper, we view the existing parameter-efficient tuning methods as "prompt modules" and propose Neural prOmpt seArcH (NOAH), a novel approach that learns, for large vision models, the optimal design of prompt modules through a neural architecture search algorithm, specifically for each downstream dataset. By conducting extensive experiments on over 20 vision datasets, we demonstrate that NOAH (i) is superior to individual prompt modules, (ii) has a good few-shot learning ability, and (iii) is domain-generalizable. The code and models are available at https://github.com/Davidzhangyuanhan/NOAH.
    CCP: Correlated Clustering and Projection for Dimensionality Reduction. (arXiv:2206.04189v1 [stat.ML])
    Most dimensionality reduction methods employ frequency domain representations obtained from matrix diagonalization and may not be efficient for large datasets with relatively high intrinsic dimensions. To address this challenge, Correlated Clustering and Projection (CCP) offers a novel data domain strategy that does not need to solve any matrix. CCP partitions high-dimensional features into correlated clusters and then projects correlated features in each cluster into a one-dimensional representation based on sample correlations. Residue-Similarity (R-S) scores and indexes, the shape of data in Riemannian manifolds, and algebraic topology-based persistent Laplacian are introduced for visualization and analysis. Proposed methods are validated with benchmark datasets associated with various machine learning algorithms.
    Original or Translated? A Causal Analysis of the Impact of Translationese on Machine Translation Performance. (arXiv:2205.02293v3 [cs.CL] UPDATED)
    Human-translated text displays distinct features from naturally written text in the same language. This phenomena, known as translationese, has been argued to confound the machine translation (MT) evaluation. Yet, we find that existing work on translationese neglects some important factors and the conclusions are mostly correlational but not causal. In this work, we collect CausalMT, a dataset where the MT training data are also labeled with the human translation directions. We inspect two critical factors, the train-test direction match (whether the human translation directions in the training and test sets are aligned), and data-model direction match (whether the model learns in the same direction as the human translation direction in the dataset). We show that these two factors have a large causal effect on the MT performance, in addition to the test-model direction mismatch highlighted by existing work on the impact of translationese. In light of our findings, we provide a set of suggestions for MT training and evaluation. Our code and data are at https://github.com/EdisonNi-hku/CausalMT
    Benefits of Overparameterized Convolutional Residual Networks: Function Approximation under Smoothness Constraint. (arXiv:2206.04569v1 [stat.ML])
    Overparameterized neural networks enjoy great representation power on complex data, and more importantly yield sufficiently smooth output, which is crucial to their generalization and robustness. Most existing function approximation theories suggest that with sufficiently many parameters, neural networks can well approximate certain classes of functions in terms of the function value. The neural network themselves, however, can be highly nonsmooth. To bridge this gap, we take convolutional residual networks (ConvResNets) as an example, and prove that large ConvResNets can not only approximate a target function in terms of function value, but also exhibit sufficient first-order smoothness. Moreover, we extend our theory to approximating functions supported on a low-dimensional manifold. Our theory partially justifies the benefits of using deep and wide networks in practice. Numerical experiments on adversarial robust image classification are provided to support our theory.  ( 2 min )
    Beyond Time-Average Convergence: Near-Optimal Uncoupled Online Learning via Clairvoyant Multiplicative Weights Update. (arXiv:2111.14737v3 [cs.GT] UPDATED)
    In this paper, we provide a novel and simple algorithm, Clairvoyant Multiplicative Weights Updates (CMWU) for regret minimization in general games. CMWU effectively corresponds to the standard MWU algorithm but where all agents, when updating their mixed strategies, use the payoff profiles based on tomorrow's behavior, i.e. the agents are clairvoyant. CMWU achieves constant regret of $\ln(m)/\eta$ in all normal-form games with m actions and fixed step-sizes $\eta$. Although CMWU encodes in its definition a fixed point computation, which in principle could result in dynamics that are neither computationally efficient nor uncoupled, we show that both of these issues can be largely circumvented. Specifically, as long as the step-size $\eta$ is upper bounded by $\frac{1}{(n-1)V}$, where $n$ is the number of agents and $[0,V]$ is the payoff range, then the CMWU updates can be computed linearly fast via a contraction map. This implementation results in an uncoupled online learning dynamic that admits a $o (\log T)$-sparse sub-sequence where each agent experiences at most $O(nV\log m)$ regret. This implies that the CMWU dynamics converge with rate $O(nV \log mW( T) / T)$ to a Coarse Correlated Equilibrium where $W(T)$ is the inverse of the function $g(t):=t\cdot 2^t$. The latter improves on the current state-of-the-art convergence rate of uncoupled online learning dynamics.
    Adversarial Noises Are Linearly Separable for (Nearly) Random Neural Networks. (arXiv:2206.04316v1 [cs.LG])
    Adversarial examples, which are usually generated for specific inputs with a specific model, are ubiquitous for neural networks. In this paper we unveil a surprising property of adversarial noises when they are put together, i.e., adversarial noises crafted by one-step gradient methods are linearly separable if equipped with the corresponding labels. We theoretically prove this property for a two-layer network with randomly initialized entries and the neural tangent kernel setup where the parameters are not far from initialization. The proof idea is to show the label information can be efficiently backpropagated to the input while keeping the linear separability. Our theory and experimental evidence further show that the linear classifier trained with the adversarial noises of the training data can well classify the adversarial noises of the test data, indicating that adversarial noises actually inject a distributional perturbation to the original data distribution. Furthermore, we empirically demonstrate that the adversarial noises may become less linearly separable when the above conditions are compromised while they are still much easier to classify than original features.  ( 2 min )
    Unveiling Transformers with LEGO: a synthetic reasoning task. (arXiv:2206.04301v1 [cs.LG])
    We propose a synthetic task, LEGO (Learning Equality and Group Operations), that encapsulates the problem of following a chain of reasoning, and we study how the transformer architecture learns this task. We pay special attention to data effects such as pretraining (on seemingly unrelated NLP tasks) and dataset composition (e.g., differing chain length at training and test time), as well as architectural variants such as weight-tied layers or adding convolutional components. We study how the trained models eventually succeed at the task, and in particular, we are able to understand (to some extent) some of the attention heads as well as how the information flows in the network. Based on these observations we propose a hypothesis that here pretraining helps merely due to being a smart initialization rather than some deep knowledge stored in the network. We also observe that in some data regime the trained transformer finds "shortcut" solutions to follow the chain of reasoning, which impedes the model's ability to generalize to simple variants of the main task, and moreover we find that one can prevent such shortcut with appropriate architecture modification or careful data preparation. Motivated by our findings, we begin to explore the task of learning to execute C programs, where a convolutional modification to transformers, namely adding convolutional structures in the key/query/value maps, shows an encouraging edge.  ( 2 min )
    MEDIC: A Multi-Task Learning Dataset for Disaster Image Classification. (arXiv:2108.12828v4 [cs.CV] UPDATED)
    Recent research in disaster informatics demonstrates a practical and important use case of artificial intelligence to save human lives and suffering during natural disasters based on social media contents (text and images). While notable progress has been made using texts, research on exploiting the images remains relatively under-explored. To advance image-based approaches, we propose MEDIC (Available at: https://crisisnlp.qcri.org/medic/index.html), which is the largest social media image classification dataset for humanitarian response consisting of 71,198 images to address four different tasks in a multi-task learning setup. This is the first dataset of its kind: social media images, disaster response, and multi-task learning research. An important property of this dataset is its high potential to facilitate research on multi-task learning, which recently receives much interest from the machine learning community and has shown remarkable results in terms of memory, inference speed, performance, and generalization capability. Therefore, the proposed dataset is an important resource for advancing image-based disaster management and multi-task machine learning research. We experiment with different deep learning architectures and report promising results, which are above the majority baselines for all tasks. Along with the dataset, we also release all relevant scripts (https://github.com/firojalam/medic).
    ExpressivE: A Spatio-Functional Embedding For Knowledge Graph Completion. (arXiv:2206.04192v1 [cs.LG])
    Knowledge graphs are inherently incomplete. Therefore substantial research has been directed towards knowledge graph completion (KGC), i.e., predicting missing triples from the information represented in the knowledge graph (KG). Embedding models have yielded promising results for KGC, yet any current KGC embedding model is incapable of: (1) fully capturing vital inference patterns (e.g., composition), (2) capturing prominent logical rules jointly (e.g., hierarchy and composition), and (3) providing an intuitive interpretation of captured patterns. In this work, we propose ExpressivE, a fully expressive spatio-functional embedding model that solves all these challenges simultaneously. ExpressivE embeds pairs of entities as points and relations as hyper-parallelograms in the virtual triple space $\mathbb{R}^{2d}$. This model design allows ExpressivE not only to capture a rich set of inference patterns jointly but additionally to display any supported inference pattern through the spatial relation of hyper-parallelograms, offering an intuitive and consistent geometric interpretation of ExpressivE embeddings and their captured patterns. Experimental results on standard KGC benchmarks reveal that ExpressivE is competitive with state-of-the-art models and even significantly outperforms them on WN18RR.  ( 2 min )
    Neonatal EEG graded for severity of background abnormalities in hypoxic-ischaemic encephalopathy. (arXiv:2206.04420v1 [physics.med-ph])
    This report describes a set of neonatal electroencephalogram (EEG) recordings graded according to the severity of abnormalities in the background pattern. The dataset consists of 169 hours of multichannel EEG from 53 neonates recorded in a neonatal intensive care unit. All neonates received a diagnosis of hypoxic-ischaemic encephalopathy (HIE), the most common cause of brain injury in full term infants. For each neonate, multiple 1-hour epochs of good quality EEG were selected and then graded for background abnormalities. The grading system assesses EEG attributes such as amplitude and frequency, continuity, sleep-wake cycling, symmetry and synchrony, and abnormal waveforms. Background severity was then categorised into 4 grades: normal or mildly abnormal, moderately abnormal, severely abnormal, and inactive EEG. The data can be used as a reference set of multi-channel EEG for neonates with HIE, for EEG training purposes, or for developing and evaluating automated grading algorithms.  ( 2 min )
    Estimation in Rotationally Invariant Generalized Linear Models via Approximate Message Passing. (arXiv:2112.04330v2 [stat.ML] UPDATED)
    We consider the problem of signal estimation in generalized linear models defined via rotationally invariant design matrices. Since these matrices can have an arbitrary spectral distribution, this model is well suited for capturing complex correlation structures which often arise in applications. We propose a novel family of approximate message passing (AMP) algorithms for signal estimation, and rigorously characterize their performance in the high-dimensional limit via a state evolution recursion. Our rotationally invariant AMP has complexity of the same order as the existing AMP derived under the restrictive assumption of a Gaussian design; our algorithm also recovers this existing AMP as a special case. Numerical results showcase a performance close to Vector AMP (which is conjectured to be Bayes-optimal in some settings), but obtained with a much lower complexity, as the proposed algorithm does not require a computationally expensive singular value decomposition.
    Redundancy in Deep Linear Neural Networks. (arXiv:2206.04490v1 [cs.LG])
    Conventional wisdom states that deep linear neural networks benefit from expressiveness and optimization advantages over a single linear layer. This paper suggests that, in practice, the training process of deep linear fully-connected networks using conventional optimizers is convex in the same manner as a single linear fully-connected layer. This paper aims to explain this claim and demonstrate it. Even though convolutional networks are not aligned with this description, this work aims to attain a new conceptual understanding of fully-connected linear networks that might shed light on the possible constraints of convolutional settings and non-linear architectures.
    Trajectory-dependent Generalization Bounds for Deep Neural Networks via Fractional Brownian Motion. (arXiv:2206.04359v1 [cs.LG])
    Despite being tremendously overparameterized, it is appreciated that deep neural networks trained by stochastic gradient descent (SGD) generalize surprisingly well. Based on the Rademacher complexity of a pre-specified hypothesis set, different norm-based generalization bounds have been developed to explain this phenomenon. However, recent studies suggest these bounds might be problematic as they increase with the training set size, which is contrary to empirical evidence. In this study, we argue that the hypothesis set SGD explores is trajectory-dependent and thus may provide a tighter bound over its Rademacher complexity. To this end, we characterize the SGD recursion via a stochastic differential equation by assuming the incurred stochastic gradient noise follows the fractional Brownian motion. We then identify the Rademacher complexity in terms of the covering numbers and relate it to the Hausdorff dimension of the optimization trajectory. By invoking the hypothesis set stability, we derive a novel generalization bound for deep neural networks. Extensive experiments demonstrate that it predicts well the generalization gap over several common experimental interventions. We further show that the Hurst parameter of the fractional Brownian motion is more informative than existing generalization indicators such as the power-law index and the upper Blumenthal-Getoor index.
    Evaluating Aleatoric Uncertainty via Conditional Generative Models. (arXiv:2206.04287v1 [cs.LG])
    Aleatoric uncertainty quantification seeks for distributional knowledge of random responses, which is important for reliability analysis and robustness improvement in machine learning applications. Previous research on aleatoric uncertainty estimation mainly targets closed-formed conditional densities or variances, which requires strong restrictions on the data distribution or dimensionality. To overcome these restrictions, we study conditional generative models for aleatoric uncertainty estimation. We introduce two metrics to measure the discrepancy between two conditional distributions that suit these models. Both metrics can be easily and unbiasedly computed via Monte Carlo simulation of the conditional generative models, thus facilitating their evaluation and training. We demonstrate numerically how our metrics provide correct measurements of conditional distributional discrepancies and can be used to train conditional models competitive against existing benchmarks.  ( 2 min )
    Early Transferability of Adversarial Examples in Deep Neural Networks. (arXiv:2206.04472v1 [cs.LG])
    This paper will describe and analyze a new phenomenon that was not known before, which we call "Early Transferability". Its essence is that the adversarial perturbations transfer among different networks even at extremely early stages in their training. In fact, one can initialize two networks with two different independent choices of random weights and measure the angle between their adversarial perturbations after each step of the training. What we discovered was that these two adversarial directions started to align with each other already after the first few training steps (which typically use only a small fraction of the available training data), even though the accuracy of the two networks hadn't started to improve from their initial bad values due to the early stage of the training. The purpose of this paper is to present this phenomenon experimentally and propose plausible explanations for some of its properties.  ( 2 min )
    CFA: Coupled-hypersphere-based Feature Adaptation for Target-Oriented Anomaly Localization. (arXiv:2206.04325v1 [cs.CV])
    For a long time, anomaly localization has been widely used in industries. Previous studies focused on approximating the distribution of normal features without adaptation to a target dataset. However, since anomaly localization should precisely discriminate normal and abnormal features, the absence of adaptation may make the normality of abnormal features overestimated. Thus, we propose Coupled-hypersphere-based Feature Adaptation (CFA) which accomplishes sophisticated anomaly localization using features adapted to the target dataset. CFA consists of (1) a learnable patch descriptor that learns and embeds target-oriented features and (2) scalable memory bank independent of the size of the target dataset. And, CFA adopts transfer learning to increase the normal feature density so that abnormal features can be clearly distinguished by applying patch descriptor and memory bank to a pre-trained CNN. The proposed method outperforms the previous methods quantitatively and qualitatively. For example, it provides an AUROC score of 99.5% in anomaly detection and 98.5% in anomaly localization of MVTec AD benchmark. In addition, this paper points out the negative effects of biased features of pre-trained CNNs and emphasizes the importance of the adaptation to the target dataset. The code is publicly available at https://github.com/sungwool/CFA_for_anomaly_localization.  ( 2 min )
    Xplique: A Deep Learning Explainability Toolbox. (arXiv:2206.04394v1 [cs.LG])
    Today's most advanced machine-learning models are hardly scrutable. The key challenge for explainability methods is to help assisting researchers in opening up these black boxes, by revealing the strategy that led to a given decision, by characterizing their internal states or by studying the underlying data representation. To address this challenge, we have developed Xplique: a software library for explainability which includes representative explainability methods as well as associated evaluation metrics. It interfaces with one of the most popular learning libraries: Tensorflow as well as other libraries including PyTorch, scikit-learn and Theano. The code is licensed under the MIT license and is freely available at github.com/deel-ai/xplique.  ( 2 min )
    Value Memory Graph: A Graph-Structured World Model for Offline Reinforcement Learning. (arXiv:2206.04384v1 [cs.LG])
    World models in model-based reinforcement learning usually face unrealistic long-time-horizon prediction issues due to compounding errors as the prediction errors accumulate over timesteps. Recent works in graph-structured world models improve the long-horizon reasoning ability via building a graph to represent the environment, but they are designed in a goal-conditioned setting and cannot guide the agent to maximize episode returns in a traditional reinforcement learning setting without externally given target states. To overcome this limitation, we design a graph-structured world model in offline reinforcement learning by building a directed-graph-based Markov decision process (MDP) with rewards allocated to each directed edge as an abstraction of the original continuous environment. As our world model has small and finite state/action spaces compared to the original environment, value iteration can be easily applied here to estimate state values on the graph and figure out the best future. Unlike previous graph-structured world models that requires externally provided targets, our world model, dubbed Value Memory Graph (VMG), can provide the desired targets with high values by itself. VMG can be used to guide low-level goal-conditioned policies that are trained via supervised learning to maximize episode returns. Experiments on the D4RL benchmark show that VMG can outperform state-of-the-art methods in several tasks where long horizon reasoning ability is crucial. Code will be made publicly available.  ( 2 min )
    Learning to generate imaginary tasks for improving generalization in meta-learning. (arXiv:2206.04335v1 [cs.LG])
    The success of meta-learning on existing benchmarks is predicated on the assumption that the distribution of meta-training tasks covers meta-testing tasks. Frequent violation of the assumption in applications with either insufficient tasks or a very narrow meta-training task distribution leads to memorization or learner overfitting. Recent solutions have pursued augmentation of meta-training tasks, while it is still an open question to generate both correct and sufficiently imaginary tasks. In this paper, we seek an approach that up-samples meta-training tasks from the task representation via a task up-sampling network. Besides, the resulting approach named Adversarial Task Up-sampling (ATU) suffices to generate tasks that can maximally contribute to the latest meta-learner by maximizing an adversarial loss. On few-shot sine regression and image classification datasets, we empirically validate the marked improvement of ATU over state-of-the-art task augmentation strategies in the meta-testing performance and also the quality of up-sampled tasks.  ( 2 min )
    SDQ: Stochastic Differentiable Quantization with Mixed Precision. (arXiv:2206.04459v1 [cs.LG])
    In order to deploy deep models in a computationally efficient manner, model quantization approaches have been frequently used. In addition, as new hardware that supports mixed bitwidth arithmetic operations, recent research on mixed precision quantization (MPQ) begins to fully leverage the capacity of representation by searching optimized bitwidths for different layers and modules in a network. However, previous studies mainly search the MPQ strategy in a costly scheme using reinforcement learning, neural architecture search, etc., or simply utilize partial prior knowledge for bitwidth assignment, which might be biased and sub-optimal. In this work, we present a novel Stochastic Differentiable Quantization (SDQ) method that can automatically learn the MPQ strategy in a more flexible and globally-optimized space with smoother gradient approximation. Particularly, Differentiable Bitwidth Parameters (DBPs) are employed as the probability factors in stochastic quantization between adjacent bitwidth choices. After the optimal MPQ strategy is acquired, we further train our network with entropy-aware bin regularization and knowledge distillation. We extensively evaluate our method for several networks on different hardware (GPUs and FPGA) and datasets. SDQ outperforms all state-of-the-art mixed or single precision quantization with a lower bitwidth and is even better than the full-precision counterparts across various ResNet and MobileNet families, demonstrating the effectiveness and superiority of our method.  ( 2 min )
    Discriminative and Generative Learning for Linear Estimation of Random Signals [Lecture Notes]. (arXiv:2206.04432v1 [eess.SP])
    Inference tasks in signal processing are often characterized by the availability of reliable statistical modeling with some missing instance-specific parameters. One conventional approach uses data to estimate these missing parameters and then infers based on the estimated model. Alternatively, data can also be leveraged to directly learn the inference mapping end-to-end. These approaches for combining partially-known statistical models and data in inference are related to the notions of generative and discriminative models used in the machine learning literature, typically considered in the context of classifiers. The goal of this lecture note is to introduce the concepts of generative and discriminative learning for inference with a partially-known statistical model. While machine learning systems often lack the interpretability of traditional signal processing methods, we focus on a simple setting where one can interpret and compare the approaches in a tractable manner that is accessible and relevant to signal processing readers. In particular, we exemplify the approaches for the task of Bayesian signal estimation in a jointly Gaussian setting with the mean-squared error (MSE) objective, i.e., a linear estimation setting.  ( 2 min )
    On the Generalization and Adaption Performance of Causal Models. (arXiv:2206.04620v1 [cs.LG])
    Learning models that offer robust out-of-distribution generalization and fast adaptation is a key challenge in modern machine learning. Modelling causal structure into neural networks holds the promise to accomplish robust zero and few-shot adaptation. Recent advances in differentiable causal discovery have proposed to factorize the data generating process into a set of modules, i.e. one module for the conditional distribution of every variable where only causal parents are used as predictors. Such a modular decomposition of knowledge enables adaptation to distributions shifts by only updating a subset of parameters. In this work, we systematically study the generalization and adaption performance of such modular neural causal models by comparing it to monolithic models and structured models where the set of predictors is not constrained to causal parents. Our analysis shows that the modular neural causal models outperform other models on both zero and few-shot adaptation in low data regimes and offer robust generalization. We also found that the effects are more significant for sparser graphs as compared to denser graphs.  ( 2 min )
    Towards Safe Reinforcement Learning via Constraining Conditional Value-at-Risk. (arXiv:2206.04436v1 [cs.LG])
    Though deep reinforcement learning (DRL) has obtained substantial success, it may encounter catastrophic failures due to the intrinsic uncertainty of both transition and observation. Most of the existing methods for safe reinforcement learning can only handle transition disturbance or observation disturbance since these two kinds of disturbance affect different parts of the agent; besides, the popular worst-case return may lead to overly pessimistic policies. To address these issues, we first theoretically prove that the performance degradation under transition disturbance and observation disturbance depends on a novel metric of Value Function Range (VFR), which corresponds to the gap in the value function between the best state and the worst state. Based on the analysis, we adopt conditional value-at-risk (CVaR) as an assessment of risk and propose a novel reinforcement learning algorithm of CVaR-Proximal-Policy-Optimization (CPPO) which formalizes the risk-sensitive constrained optimization problem by keeping its CVaR under a given threshold. Experimental results show that CPPO achieves a higher cumulative reward and is more robust against both observation and transition disturbances on a series of continuous control tasks in MuJoCo.  ( 2 min )
    A general approximation lower bound in $L^p$ norm, with applications to feed-forward neural networks. (arXiv:2206.04360v1 [cs.LG])
    We study the fundamental limits to the expressive power of neural networks. Given two sets $F$, $G$ of real-valued functions, we first prove a general lower bound on how well functions in $F$ can be approximated in $L^p(\mu)$ norm by functions in $G$, for any $p \geq 1$ and any probability measure $\mu$. The lower bound depends on the packing number of $F$, the range of $F$, and the fat-shattering dimension of $G$. We then instantiate this bound to the case where $G$ corresponds to a piecewise-polynomial feed-forward neural network, and describe in details the application to two sets $F$: H{\"o}lder balls and multivariate monotonic functions. Beside matching (known or new) upper bounds up to log factors, our lower bounds shed some light on the similarities or differences between approximation in $L^p$ norm or in sup norm, solving an open question by DeVore et al. (2021). Our proof strategy differs from the sup norm case and uses a key probability result of Mendelson (2002).
    ReDAL: Region-based and Diversity-aware Active Learning for Point Cloud Semantic Segmentation. (arXiv:2107.11769v3 [cs.CV] UPDATED)
    Despite the success of deep learning on supervised point cloud semantic segmentation, obtaining large-scale point-by-point manual annotations is still a significant challenge. To reduce the huge annotation burden, we propose a Region-based and Diversity-aware Active Learning (ReDAL), a general framework for many deep learning approaches, aiming to automatically select only informative and diverse sub-scene regions for label acquisition. Observing that only a small portion of annotated regions are sufficient for 3D scene understanding with deep learning, we use softmax entropy, color discontinuity, and structural complexity to measure the information of sub-scene regions. A diversity-aware selection algorithm is also developed to avoid redundant annotations resulting from selecting informative but similar regions in a querying batch. Extensive experiments show that our method highly outperforms previous active learning strategies, and we achieve the performance of 90% fully supervised learning, while less than 15% and 5% annotations are required on S3DIS and SemanticKITTI datasets, respectively. Our code is publicly available at https://github.com/tsunghan-wu/ReDAL.  ( 2 min )
    Convolutional Dictionary Learning by End-To-End Training of Iterative Neural Networks. (arXiv:2206.04447v1 [eess.IV])
    Sparsity-based methods have a long history in the field of signal processing and have been successfully applied to various image reconstruction problems. The involved sparsifying transformations or dictionaries are typically either pre-trained using a model which reflects the assumed properties of the signals or adaptively learned during the reconstruction - yielding so-called blind Compressed Sensing approaches. However, by doing so, the transforms are never explicitly trained in conjunction with the physical model which generates the signals. In addition, properly choosing the involved regularization parameters remains a challenging task. Another recently emerged training-paradigm for regularization methods is to use iterative neural networks (INNs) - also known as unrolled networks - which contain the physical model. In this work, we construct an INN which can be used as a supervised and physics-informed online convolutional dictionary learning algorithm. We evaluated the proposed approach by applying it to a realistic large-scale dynamic MR reconstruction problem and compared it to several other recently published works. We show that the proposed INN improves over two conventional model-agnostic training methods and yields competitive results also compared to a deep INN. Further, it does not require to choose the regularization parameters and - in contrast to deep INNs - each network component is entirely interpretable.  ( 2 min )
    OptWedge: Cognitive Optimized Guidance toward Off-screen POIs. (arXiv:2206.04293v1 [cs.HC])
    Guiding off-screen points of interest (POIs) is a practical way of providing additional information to users of small-screen devices, such as smart devices and head-mounted displays. Popular previous methods involve displaying a primitive figure referred to as Wedge on the screen for users to estimate off-screen POI on the invisible vertex. Because they utilize a cognitive process referred to as amodal completion, where users can imagine the entire figure even when a part of it is occluded, localization accuracy is influenced by bias and individual differences. To improve the accuracy, we propose to optimize the figure using a cognitive cost that considers the influence. We also design two types of optimizations with different parameters: unbiased OptWedge (UOW) and biased OptWedge (BOW). Experimental results indicate that OptWedge achieves more accurate guidance for a close distance compared to heuristics approach.
    Unsupervised Learning of the Total Variation Flow. (arXiv:2206.04406v1 [cs.CV])
    The total variation (TV) flow generates a scale-space representation of an image based on the TV functional. This gradient flow observes desirable features for images such as sharp edges and enables spectral, scale, and texture analysis. The standard numerical approach for TV flow requires solving multiple non-smooth optimisation problems. Even with state-of-the-art convex optimisation techniques, this is often prohibitively expensive and strongly motivates the use of alternative, faster approaches. Inspired by and extending the framework of physics-informed neural networks (PINNs), we propose the TVflowNET, a neural network approach to compute the solution of the TV flow given an initial image and a time instance. We significantly speed up the computation time by more than one order of magnitude and show that the TVflowNET approximates the TV flow solution with high fidelity. This is a preliminary report, more details are to follow.  ( 2 min )
    Draft-and-Revise: Effective Image Generation with Contextual RQ-Transformer. (arXiv:2206.04452v1 [cs.CV])
    Although autoregressive models have achieved promising results on image generation, their unidirectional generation process prevents the resultant images from fully reflecting global contexts. To address the issue, we propose an effective image generation framework of Draft-and-Revise with Contextual RQ-transformer to consider global contexts during the generation process. As a generalized VQ-VAE, RQ-VAE first represents a high-resolution image as a sequence of discrete code stacks. After code stacks in the sequence are randomly masked, Contextual RQ-Transformer is trained to infill the masked code stacks based on the unmasked contexts of the image. Then, Contextual RQ-Transformer uses our two-phase decoding, Draft-and-Revise, and generates an image, while exploiting the global contexts of the image during the generation process. Specifically. in the draft phase, our model first focuses on generating diverse images despite rather low quality. Then, in the revise phase, the model iteratively improves the quality of images, while preserving the global contexts of generated images. In experiments, our method achieves state-of-the-art results on conditional image generation. We also validate that the Draft-and-Revise decoding can achieve high performance by effectively controlling the quality-diversity trade-off in image generation.  ( 2 min )
    Regret Analysis of Certainty Equivalence Policies in Continuous-Time Linear-Quadratic Systems. (arXiv:2206.04434v1 [cs.LG])
    This work studies theoretical performance guarantees of a ubiquitous reinforcement learning policy for controlling the canonical model of stochastic linear-quadratic system. We show that randomized certainty equivalent policy addresses the exploration-exploitation dilemma for minimizing quadratic costs in linear dynamical systems that evolve according to stochastic differential equations. More precisely, we establish square-root of time regret bounds, indicating that randomized certainty equivalent policy learns optimal control actions fast from a single state trajectory. Further, linear scaling of the regret with the number of parameters is shown. The presented analysis introduces novel and useful technical approaches, and sheds light on fundamental challenges of continuous-time reinforcement learning.  ( 2 min )
    Multi-class Classification with Fuzzy-feature Observations: Theory and Algorithms. (arXiv:2206.04311v1 [cs.LG])
    The theoretical analysis of multi-class classification has proved that the existing multi-class classification methods can train a classifier with high classification accuracy on the test set, when the instances are precise in the training and test sets with same distribution and enough instances can be collected in the training set. However, one limitation with multi-class classification has not been solved: how to improve the classification accuracy of multi-class classification problems when only imprecise observations are available. Hence, in this paper, we propose a novel framework to address a new realistic problem called multi-class classification with imprecise observations (MCIMO), where we need to train a classifier with fuzzy-feature observations. Firstly, we give the theoretical analysis of the MCIMO problem based on fuzzy Rademacher complexity. Then, two practical algorithms based on support vector machine and neural networks are constructed to solve the proposed new problem. Experiments on both synthetic and real-world datasets verify the rationality of our theoretical analysis and the efficacy of the proposed algorithms.
    PyDTS: A Python Package for Discrete-Time Survival (Regularized) Regression with Competing Risks. (arXiv:2204.05731v2 [stat.ML] UPDATED)
    Time-to-event analysis (survival analysis) is used when the outcome or the response of interest is the time until a pre-specified event occurs. Time-to-event data are sometimes discrete either because time itself is discrete or due to grouping of failure times into intervals or rounding off measurements. In addition, the failure of an individual could be one of several distinct failure types; known as competing risks (events) data. This work focuses on discrete-time regression with competing events. We emphasize the main difference between the continuous and discrete settings with competing events, develop a new estimation procedure, and present PyDTS, an open source Python package which implements our estimation procedure and other tools for discrete-time-survival analysis with competing risks.
    Meet You Halfway: Explaining Deep Learning Mysteries. (arXiv:2206.04463v1 [cs.LG])
    Deep neural networks perform exceptionally well on various learning tasks with state-of-the-art results. While these models are highly expressive and achieve impressively accurate solutions with excellent generalization abilities, they are susceptible to minor perturbations. Samples that suffer such perturbations are known as "adversarial examples". Even though deep learning is an extensively researched field, many questions about the nature of deep learning models remain unanswered. In this paper, we introduce a new conceptual framework attached with a formal description that aims to shed light on the network's behavior and interpret the behind-the-scenes of the learning process. Our framework provides an explanation for inherent questions concerning deep learning. Particularly, we clarify: (1) Why do neural networks acquire generalization abilities? (2) Why do adversarial examples transfer between different models?. We provide a comprehensive set of experiments that support this new framework, as well as its underlying theory.
    Exploring Predictive States via Cantor Embeddings and Wasserstein Distance. (arXiv:2206.04198v1 [cond-mat.stat-mech])
    Predictive states for stochastic processes are a nonparametric and interpretable construct with relevance across a multitude of modeling paradigms. Recent progress on the self-supervised reconstruction of predictive states from time-series data focused on the use of reproducing kernel Hilbert spaces. Here, we examine how Wasserstein distances may be used to detect predictive equivalences in symbolic data. We compute Wasserstein distances between distributions over sequences ("predictions"), using a finite-dimensional embedding of sequences based on the Cantor for the underlying geometry. We show that exploratory data analysis using the resulting geometry via hierarchical clustering and dimension reduction provides insight into the temporal structure of processes ranging from the relatively simple (e.g., finite-state hidden Markov models) to the very complex (e.g., infinite-state indexed grammars).
    On Gradient Descent Convergence beyond the Edge of Stability. (arXiv:2206.04172v1 [cs.LG])
    Gradient Descent (GD) is a powerful workhorse of modern machine learning thanks to its scalability and efficiency in high-dimensional spaces. Its ability to find local minimisers is only guaranteed for losses with Lipschitz gradients, where it can be seen as a 'bona-fide' discretisation of an underlying gradient flow. Yet, many ML setups involving overparametrised models do not fall into this problem class, which has motivated research beyond the so-called "Edge of Stability", where the step-size crosses the admissibility threshold inversely proportional to the Lipschitz constant above. Perhaps surprisingly, GD has been empirically observed to still converge regardless of local instability. In this work, we study a local condition for such an unstable convergence around a local minima in a low dimensional setting. We then leverage these insights to establish global convergence of a two-layer single-neuron ReLU student network aligning with the teacher neuron in a large learning rate beyond the Edge of Stability under population loss. Meanwhile, while the difference of norms of the two layers is preserved by gradient flow, we show that GD above the edge of stability induces a balancing effect, leading to the same norms across the layers.
    TreeFlow: Going beyond Tree-based Gaussian Probabilistic Regression. (arXiv:2206.04140v1 [cs.LG])
    The tree-based ensembles are known for their outstanding performance for classification and regression problems characterized by feature vectors represented by mixed-type variables from various ranges and domains. However, considering regression problems, they are primarily designed to provide deterministic responses or model the uncertainty of the output with a Gaussian distribution. In this work, we introduce TreeFlow, the tree-based approach that combines the benefits of using tree ensembles with capabilities of modeling flexible probability distributions using normalizing flows. The main idea of the solution is to use a tree-based model as a feature extractor and combine it with a conditional variant of normalizing flow. Consequently, our approach is capable of modeling complex distributions for the regression outputs. We evaluate the proposed method on challenging regression benchmarks with varying volume, feature characteristics, and target dimensionality. We obtain the SOTA results on datasets with non-gaussian target distributions and competitive results on gaussian ones compared to tree-based regression baselines.
    VN-Transformer: Rotation-Equivariant Attention for Vector Neurons. (arXiv:2206.04176v1 [cs.CV])
    Rotation equivariance is a desirable property in many practical applications such as motion forecasting and 3D perception, where it can offer benefits like sample efficiency, better generalization, and robustness to input perturbations. Vector Neurons (VN) is a recently developed framework offering a simple yet effective approach for deriving rotation-equivariant analogs of standard machine learning operations by extending one-dimensional scalar neurons to three-dimensional "vector neurons." We introduce a novel "VN-Transformer" architecture to address several shortcomings of the current VN models. Our contributions are: $(i)$ we derive a rotation-equivariant attention mechanism which eliminates the need for the heavy feature preprocessing required by the original Vector Neurons models; $(ii)$ we extend the VN framework to support non-spatial attributes, expanding the applicability of these models to real-world datasets; $(iii)$ we derive a rotation-equivariant mechanism for multi-scale reduction of point-cloud resolution, greatly speeding up inference and training; $(iv)$ we show that small tradeoffs in equivariance ($\epsilon$-approximate equivariance) can be used to obtain large improvements in numerical stability and training robustness on accelerated hardware, and we bound the propagation of equivariance violations in our models. Finally, we apply our VN-Transformer to 3D shape classification and motion forecasting with compelling results.  ( 2 min )
    Analytical Composition of Differential Privacy via the Edgeworth Accountant. (arXiv:2206.04236v1 [cs.CR])
    Many modern machine learning algorithms are composed of simple private algorithms; thus, an increasingly important problem is to efficiently compute the overall privacy loss under composition. In this study, we introduce the Edgeworth Accountant, an analytical approach to composing differential privacy guarantees of private algorithms. The Edgeworth Accountant starts by losslessly tracking the privacy loss under composition using the $f$-differential privacy framework, which allows us to express the privacy guarantees using privacy-loss log-likelihood ratios (PLLRs). As the name suggests, this accountant next uses the Edgeworth expansion to the upper and lower bounds the probability distribution of the sum of the PLLRs. Moreover, by relying on a technique for approximating complex distributions using simple ones, we demonstrate that the Edgeworth Accountant can be applied to the composition of any noise-addition mechanism. Owing to certain appealing features of the Edgeworth expansion, the $(\epsilon, \delta)$-differential privacy bounds offered by this accountant are non-asymptotic, with essentially no extra computational cost, as opposed to the prior approaches in, wherein the running times increase with the number of compositions. Finally, we demonstrate that our upper and lower $(\epsilon, \delta)$-differential privacy bounds are tight in federated analytics and certain regimes of training private deep learning models.  ( 2 min )
    Deep Hierarchical Planning from Pixels. (arXiv:2206.04114v1 [cs.AI])
    Intelligent agents need to select long sequences of actions to solve complex tasks. While humans easily break down tasks into subgoals and reach them through millions of muscle commands, current artificial intelligence is limited to tasks with horizons of a few hundred decisions, despite large compute budgets. Research on hierarchical reinforcement learning aims to overcome this limitation but has proven to be challenging, current methods rely on manually specified goal spaces or subtasks, and no general solution exists. We introduce Director, a practical method for learning hierarchical behaviors directly from pixels by planning inside the latent space of a learned world model. The high-level policy maximizes task and exploration rewards by selecting latent goals and the low-level policy learns to achieve the goals. Despite operating in latent space, the decisions are interpretable because the world model can decode goals into images for visualization. Director outperforms exploration methods on tasks with sparse rewards, including 3D maze traversal with a quadruped robot from an egocentric camera and proprioception, without access to the global position or top-down view that was used by prior work. Director also learns successful behaviors across a wide range of environments, including visual control, Atari games, and DMLab levels.  ( 2 min )
    CASS: Cross Architectural Self-Supervision for Medical Image Analysis. (arXiv:2206.04170v1 [cs.CV])
    Recent advances in Deep Learning and Computer Vision have alleviated many of the bottlenecks, allowing algorithms to be label-free with better performance. Specifically, Transformers provide a global perspective of the image, which Convolutional Neural Networks (CNN) lack by design. Here we present \textbf{C}ross \textbf{A}rchitectural - \textbf{S}elf \textbf{S}upervision , a novel self-supervised learning approach which leverages transformers and CNN simultaneously, while also being computationally accessible to general practitioners via easily available cloud services. Compared to existing state-of-the-art self-supervised learning approaches, we empirically show CASS trained CNNs, and Transformers gained an average of 8.5\% with 100\% labelled data, 7.3\% with 10\% labelled data, and 11.5\% with 1\% labelled data, across three diverse datasets. Notably, one of the employed datasets included histopathology slides of an autoimmune disease, a topic underrepresented in Medical Imaging and has minimal data. In addition, our findings reveal that CASS is twice as efficient as other state-of-the-art methods in terms of training time.  ( 2 min )
    Sample-Efficient Reinforcement Learning in the Presence of Exogenous Information. (arXiv:2206.04282v1 [cs.LG])
    In real-world reinforcement learning applications the learner's observation space is ubiquitously high-dimensional with both relevant and irrelevant information about the task at hand. Learning from high-dimensional observations has been the subject of extensive investigation in supervised learning and statistics (e.g., via sparsity), but analogous issues in reinforcement learning are not well understood, even in finite state/action (tabular) domains. We introduce a new problem setting for reinforcement learning, the Exogenous Markov Decision Process (ExoMDP), in which the state space admits an (unknown) factorization into a small controllable (or, endogenous) component and a large irrelevant (or, exogenous) component; the exogenous component is independent of the learner's actions, but evolves in an arbitrary, temporally correlated fashion. We provide a new algorithm, ExoRL, which learns a near-optimal policy with sample complexity polynomial in the size of the endogenous component and nearly independent of the size of the exogenous component, thereby offering a doubly-exponential improvement over off-the-shelf algorithms. Our results highlight for the first time that sample-efficient reinforcement learning is possible in the presence of exogenous information, and provide a simple, user-friendly benchmark for investigation going forward.  ( 2 min )
    Words are all you need? Capturing human sensory similarity with textual descriptors. (arXiv:2206.04105v1 [cs.CL])
    Recent advances in multimodal training use textual descriptions to significantly enhance machine understanding of images and videos. Yet, it remains unclear to what extent language can fully capture sensory experiences across different modalities. A well-established approach for characterizing sensory experiences relies on similarity judgments, namely, the degree to which people perceive two distinct stimuli as similar. We explore the relation between human similarity judgments and language in a series of large-scale behavioral studies ($N=1,823$ participants) across three modalities (images, audio, and video) and two types of text descriptors: simple word tags and free-text captions. In doing so, we introduce a novel adaptive pipeline for tag mining that is both efficient and domain-general. We show that our prediction pipeline based on text descriptors exhibits excellent performance, and we compare it against a comprehensive array of 611 baseline models based on vision-, audio-, and video-processing architectures. We further show that the degree to which textual descriptors and models predict human similarity varies across and within modalities. Taken together, these studies illustrate the value of integrating machine learning and cognitive science approaches to better understand the similarities and differences between human and machine representations. We present an interactive visualization at https://words-are-all-you-need.s3.amazonaws.com/index.html for exploring the similarity between stimuli as experienced by humans and different methods reported in the paper.  ( 2 min )
    Simplifying Polylogarithms with Machine Learning. (arXiv:2206.04115v1 [cs.LG])
    Polylogrithmic functions, such as the logarithm or dilogarithm, satisfy a number of algebraic identities. For the logarithm, all the identities follow from the product rule. For the dilogarithm and higher-weight classical polylogarithms, the identities can involve five functions or more. In many calculations relevant to particle physics, complicated combinations of polylogarithms often arise from Feynman integrals. Although the initial expressions resulting from the integration usually simplify, it is often difficult to know which identities to apply and in what order. To address this bottleneck, we explore to what extent machine learning methods can help. We consider both a reinforcement learning approach, where the identities are analogous to moves in a game, and a transformer network approach, where the problem is viewed analogously to a language-translation task. While both methods are effective, the transformer network appears more powerful and holds promise for practical use in symbolic manipulation tasks in mathematical physics.  ( 2 min )
    A Comprehensive Survey of Graph-based Deep Learning Approaches for Anomaly Detection in Complex Distributed Systems. (arXiv:2206.04149v1 [cs.LG])
    Anomaly detection is an important problem for complex distributed systems consisting of hardware and software components. A thorough understanding of the requirements and challenges of anomaly detection for such systems is pivotal to the security of a system, especially for real-world deployment. While there have been many diverse research areas and application domains that deal with the problem, few have attempted to provide an in-depth look at such systems. Most anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. In this survey, we explore the significant potential of graph-based algorithms to identify and mitigate different types of anomalies in complex distributed heterogeneous systems. Our main focus is to provide an in-depth look at graphs when applied on heterogeneous computing devices spread across complex distributed systems. This study analyzes, compares, and contrasts the state-of-the-art research articles in the field. First, we describe the characteristics of the real-world distributed systems and their specific challenges of anomaly detection in such complex networks, such as data and evaluation, nature of the anomalies, and real-world requirements. Later, we discuss why graphs can be leveraged in such systems and the benefits of utilizing graphs. Then we will aptly delve into the state-of-the-art approaches and highlight their strength and weaknesses. Finally, we evaluate and compare these approaches and point out the areas for possible improvements.  ( 2 min )
    Likelihood-free Model Choice for Simulator-based Models with the Jensen--Shannon Divergence. (arXiv:2206.04110v1 [stat.ME])
    Choice of appropriate structure and parametric dimension of a model in the light of data has a rich history in statistical research, where the first seminal approaches were developed in 1970s, such as the Akaike's and Schwarz's model scoring criteria that were inspired by information theory and embodied the rationale called Occam's razor. After those pioneering works, model choice was quickly established as its own field of research, gaining considerable attention in both computer science and statistics. However, to date, there have been limited attempts to derive scoring criteria for simulator-based models lacking a likelihood expression. Bayes factors have been considered for such models, but arguments have been put both for and against use of them and around issues related to their consistency. Here we use the asymptotic properties of Jensen--Shannon divergence (JSD) to derive a consistent model scoring criterion for the likelihood-free setting called JSD-Razor. Relationships of JSD-Razor with established scoring criteria for the likelihood-based approach are analyzed and we demonstrate the favorable properties of our criterion using both synthetic and real modeling examples.  ( 2 min )
  • Open

    Generative Flow Networks for Discrete Probabilistic Modeling. (arXiv:2202.01361v2 [cs.LG] UPDATED)
    We present energy-based generative flow networks (EB-GFN), a novel probabilistic modeling algorithm for high-dimensional discrete data. Building upon the theory of generative flow networks (GFlowNets), we model the generation process by a stochastic data construction policy and thus amortize expensive MCMC exploration into a fixed number of actions sampled from a GFlowNet. We show how GFlowNets can approximately perform large-block Gibbs sampling to mix between modes. We propose a framework to jointly train a GFlowNet with an energy function, so that the GFlowNet learns to sample from the energy distribution, while the energy learns with an approximate MLE objective with negative samples from the GFlowNet. We demonstrate EB-GFN's effectiveness on various probabilistic modeling tasks. Code is publicly available at https://github.com/zdhNarsil/EB_GFN.  ( 2 min )
    Contextual Information-Directed Sampling. (arXiv:2205.10895v2 [cs.LG] UPDATED)
    Information-directed sampling (IDS) has recently demonstrated its potential as a data-efficient reinforcement learning algorithm. However, it is still unclear what is the right form of information ratio to optimize when contextual information is available. We investigate the IDS design through two contextual bandit problems: contextual bandits with graph feedback and sparse linear contextual bandits. We provably demonstrate the advantage of contextual IDS over conditional IDS and emphasize the importance of considering the context distribution. The main message is that an intelligent agent should invest more on the actions that are beneficial for the future unseen contexts while the conditional IDS can be myopic. We further propose a computationally-efficient version of contextual IDS based on Actor-Critic and evaluate it empirically on a neural network contextual bandit.  ( 2 min )
    Estimation in Rotationally Invariant Generalized Linear Models via Approximate Message Passing. (arXiv:2112.04330v2 [stat.ML] UPDATED)
    We consider the problem of signal estimation in generalized linear models defined via rotationally invariant design matrices. Since these matrices can have an arbitrary spectral distribution, this model is well suited for capturing complex correlation structures which often arise in applications. We propose a novel family of approximate message passing (AMP) algorithms for signal estimation, and rigorously characterize their performance in the high-dimensional limit via a state evolution recursion. Our rotationally invariant AMP has complexity of the same order as the existing AMP derived under the restrictive assumption of a Gaussian design; our algorithm also recovers this existing AMP as a special case. Numerical results showcase a performance close to Vector AMP (which is conjectured to be Bayes-optimal in some settings), but obtained with a much lower complexity, as the proposed algorithm does not require a computationally expensive singular value decomposition.  ( 2 min )
    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. (arXiv:2206.04615v1 [cs.CL])
    Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
    Globally Optimal Algorithms for Fixed-Budged Best Arm Identification. (arXiv:2206.04646v1 [stat.ML])
    We consider the fixed-budget best arm identification problem where the goal is to find the arm of the largest mean with a fixed number of samples. It is known that the probability of misidentifying the best arm is exponentially small to the number of rounds. However, limited characterizations have been discussed on the rate (exponent) of this value. In this paper, we characterize the optimal rate as a result of global optimization over all possible parameters. We introduce two rates, $R^{\mathrm{go}}$ and $R^{\mathrm{go}}_{\infty}$, corresponding to lower bounds on the misidentification probability, each of which is associated with a proposed algorithm. The rate $R^{\mathrm{go}}$ is associated with $R^{\mathrm{go}}$-tracking, which can be efficiently implemented by a neural network and is shown to outperform existing algorithms. However, this rate requires a nontrivial condition to be achievable. To deal with this issue, we introduce the second rate $R^{\mathrm{go}}_\infty$. We show that this rate is indeed achievable by introducing a conceptual algorithm called delayed optimal tracking (DOT).
    Explicit Regularization in Overparametrized Models via Noise Injection. (arXiv:2206.04613v1 [cs.LG])
    Injecting noise within gradient descent has several desirable features. In this paper, we explore noise injection before computing a gradient step, which is known to have smoothing and regularizing properties. We show that small perturbations induce explicit regularization for simple finite-dimensional models based on the l1-norm, group l1-norms, or nuclear norms. When applied to overparametrized neural networks with large widths, we show that the same perturbations do not work due to variance explosion resulting from overparametrization. However, we also show that independent layer wise perturbations allow to avoid the exploding variance term, and explicit regularizers can then be obtained. We empirically show that the small perturbations lead to better generalization performance than vanilla (stochastic) gradient descent training, with minor adjustments to the training procedure.
    Learning Invariant Representations with Missing Data. (arXiv:2112.00881v2 [cs.LG] UPDATED)
    Spurious correlations allow flexible models to predict well during training but poorly on related test distributions. Recent work has shown that models that satisfy particular independencies involving correlation-inducing \textit{nuisance} variables have guarantees on their test performance. Enforcing such independencies requires nuisances to be observed during training. However, nuisances, such as demographics or image background labels, are often missing. Enforcing independence on just the observed data does not imply independence on the entire population. Here we derive \acrshort{mmd} estimators used for invariance objectives under missing nuisances. On simulations and clinical data, optimizing through these estimates achieves test performance similar to using estimators that make use of the full data.
    Hilbert Curve Projection Distance for Distribution Comparison. (arXiv:2205.15059v2 [cs.LG] UPDATED)
    Distribution comparison plays a central role in many machine learning tasks like data classification and generative modeling. In this study, we propose a novel metric, called Hilbert curve projection (HCP) distance, to measure the distance between two probability distributions with high robustness and low complexity. In particular, we first project two high-dimensional probability densities using Hilbert curve to obtain a coupling between them, and then calculate the transport distance between these two densities in the original space, according to the coupling. We show that HCP distance is a proper metric and is well-defined for absolutely continuous probability measures. Furthermore, we demonstrate that the empirical HCP distance converges to its population counterpart at a rate of no more than $O(n^{-1/2d})$ under regularity conditions. To suppress the curse-of-dimensionality, we also develop two variants of the HCP distance using (learnable) subspace projections. Experiments on both synthetic and real-world data show that our HCP distance works as an effective surrogate of the Wasserstein distance with low complexity and overcomes the drawbacks of the sliced Wasserstein distance.
    Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent. (arXiv:2002.04861v3 [stat.ML] UPDATED)
    We prove that two-layer (Leaky)ReLU networks initialized by e.g. the widely used method proposed by He et al. (2015) and trained using gradient descent on a least-squares loss are not universally consistent. Specifically, we describe a large class of one-dimensional data-generating distributions for which, with high probability, gradient descent only finds a bad local minimum of the optimization landscape, since it is unable to move the biases far away from their initialization at zero. It turns out that in these cases, the found network essentially performs linear regression even if the target function is non-linear. We further provide numerical evidence that this happens in practical situations, for some multi-dimensional distributions and that stochastic gradient descent exhibits similar behavior. We also provide empirical results on how the choice of initialization and optimizer can influence this behavior.
    Objective-Based Hierarchical Clustering of Deep Embedding Vectors. (arXiv:2012.08466v2 [cs.LG] UPDATED)
    We initiate a comprehensive experimental study of objective-based hierarchical clustering methods on massive datasets consisting of deep embedding vectors from computer vision and NLP applications. This includes a large variety of image embedding (ImageNet, ImageNetV2, NaBirds), word embedding (Twitter, Wikipedia), and sentence embedding (SST-2) vectors from several popular recent models (e.g. ResNet, ResNext, Inception V3, SBERT). Our study includes datasets with up to $4.5$ million entries with embedding dimensions up to $2048$. In order to address the challenge of scaling up hierarchical clustering to such large datasets we propose a new practical hierarchical clustering algorithm B++&C. It gives a 5%/20% improvement on average for the popular Moseley-Wang (MW) / Cohen-Addad et al. (CKMM) objectives (normalized) compared to a wide range of classic methods and recent heuristics. We also introduce a theoretical algorithm B2SAT&C which achieves a $0.74$-approximation for the CKMM objective in polynomial time. This is the first substantial improvement over the trivial $2/3$-approximation achieved by a random binary tree. Prior to this work, the best poly-time approximation of $\approx 2/3 + 0.0004$ was due to Charikar et al. (SODA'19).
    Cooperative learning for multi-view analysis. (arXiv:2112.12337v5 [stat.ME] UPDATED)
    We propose a new method for supervised learning with multiple sets of features ("views"). The multi-view problem is especially important in biology and medicine, where "-omics" data such as genomics, proteomics and radiomics are measured on a common set of samples. Cooperative learning combines the usual squared error loss of predictions with an "agreement" penalty to encourage the predictions from different data views to agree. By varying the weight of the agreement penalty, we get a continuum of solutions that include the well-known early and late fusion approaches. Cooperative learning chooses the degree of agreement (or fusion) in an adaptive manner, using a validation set or cross-validation to estimate test set prediction error. One version of our fitting procedure is modular, where one can choose different fitting mechanisms (e.g. lasso, random forests, boosting, neural networks) appropriate for different data views. In the setting of cooperative regularized linear regression, the method combines the lasso penalty with the agreement penalty, yielding feature sparsity. The method can be especially powerful when the different data views share some underlying relationship in their signals that can be exploited to boost the signals. We show that cooperative learning achieves higher predictive accuracy on simulated data and real multiomics examples of labor onset prediction and breast ductal carcinoma in situ and invasive breast cancer classification. Leveraging aligned signals and allowing flexible fitting mechanisms for different modalities, cooperative learning offers a powerful approach to multiomics data fusion.
    Contrastive Regularization for Semi-Supervised Learning. (arXiv:2201.06247v2 [cs.LG] UPDATED)
    Consistency regularization on label predictions becomes a fundamental technique in semi-supervised learning, but it still requires a large number of training iterations for high performance. In this study, we analyze that the consistency regularization restricts the propagation of labeling information due to the exclusion of samples with unconfident pseudo-labels in the model updates. Then, we propose contrastive regularization to improve both efficiency and accuracy of the consistency regularization by well-clustered features of unlabeled data. In specific, after strongly augmented samples are assigned to clusters by their pseudo-labels, our contrastive regularization updates the model so that the features with confident pseudo-labels aggregate the features in the same cluster, while pushing away features in different clusters. As a result, the information of confident pseudo-labels can be effectively propagated into more unlabeled samples during training by the well-clustered features. On benchmarks of semi-supervised learning tasks, our contrastive regularization improves the previous consistency-based methods and achieves state-of-the-art results, especially with fewer training iterations. Our method also shows robust performance on open-set semi-supervised learning where unlabeled data includes out-of-distribution samples.
    Time Delay Estimation of Traffic Congestion Propagation based on Transfer Entropy. (arXiv:2108.06717v2 [stat.ML] UPDATED)
    Considering how congestion will propagate in the near future, understanding traffic congestion propagation has become crucial in GPS navigation systems for providing users with a more accurate estimated time of arrival (ETA). However, providing the exact ETA during congestion is a challenge owing to the complex propagation process between roads and high uncertainty regarding the future behavior of the process. Recent studies have focused on finding frequent congestion propagation patterns and determining the propagation probabilities. By contrast, this study proposes a novel time delay estimation method for traffic congestion propagation between roads using lag-specific transfer entropy (TE). Nonlinear normalization with a sliding window is used to effectively reveal the causal relationship between the source and target time series in calculating the TE. Moreover, Markov bootstrap techniques were adopted to quantify the uncertainty in the time delay estimator. To the best of our knowledge, the time delay estimation method presented in this article is the first to determine the time delay between roads for any congestion propagation pattern. The proposed method was validated using simulated data as well as real user trajectory data obtained from a major GPS navigation system applied in South Korea.
    Regret Bounds for Information-Directed Reinforcement Learning. (arXiv:2206.04640v1 [cs.LG])
    Information-directed sampling (IDS) has revealed its potential as a data-efficient algorithm for reinforcement learning (RL). However, theoretical understanding of IDS for Markov Decision Processes (MDPs) is still limited. We develop novel information-theoretic tools to bound the information ratio and cumulative information gain about the learning target. Our theoretical results shed light on the importance of choosing the learning target such that the practitioners can balance the computation and regret bounds. As a consequence, we derive prior-free Bayesian regret bounds for vanilla-IDS which learns the whole environment under tabular finite-horizon MDPs. In addition, we propose a computationally-efficient regularized-IDS that maximizes an additive form rather than the ratio form and show that it enjoys the same regret bound as vanilla-IDS. With the aid of rate-distortion theory, we improve the regret bound by learning a surrogate, less informative environment. Furthermore, we extend our analysis to linear MDPs and prove similar regret bounds for Thompson sampling as a by-product.
    DORA: Exploring outlier representations in Deep Neural Networks. (arXiv:2206.04530v1 [cs.LG])
    Deep Neural Networks (DNNs) draw their power from the representations they learn. In recent years, however, researchers have found that DNNs, while being incredibly effective in learning complex abstractions, also tend to be infected with artifacts, such as biases, Clever Hanses (CH), or Backdoors, due to spurious correlations inherent in the training data. So far, existing methods for uncovering such artifactual and malicious behavior in trained models focus on finding artifacts in the input data, which requires both availabilities of a data set and human intervention. In this paper, we introduce DORA (Data-agnOstic Representation Analysis): the first automatic data-agnostic method for the detection of potentially infected representations in Deep Neural Networks. We further show that contaminated representations found by DORA can be used to detect infected samples in any given dataset. We qualitatively and quantitatively evaluate the performance of our proposed method in both, controlled toy scenarios, and in real-world settings, where we demonstrate the benefit of DORA in safety-critical applications.
    Automatic Debiased Machine Learning for Dynamic Treatment Effects and General Nested Functionals. (arXiv:2203.13887v3 [econ.EM] UPDATED)
    We extend the idea of automated debiased machine learning to the dynamic treatment regime and more generally to nested functionals. We show that the multiply robust formula for the dynamic treatment regime with discrete treatments can be re-stated in terms of a recursive Riesz representer characterization of nested mean regressions. We then apply a recursive Riesz representer estimation learning algorithm that estimates de-biasing corrections without the need to characterize how the correction terms look like, such as for instance, products of inverse probability weighting terms, as is done in prior work on doubly robust estimation in the dynamic regime. Our approach defines a sequence of loss minimization problems, whose minimizers are the mulitpliers of the de-biasing correction, hence circumventing the need for solving auxiliary propensity models and directly optimizing for the mean squared error of the target de-biasing correction. We provide further applications of our approach to estimation of dynamic discrete choice models.
    Markovian Interference in Experiments. (arXiv:2206.02371v2 [cs.LG] UPDATED)
    We consider experiments in dynamical systems where interventions on some experimental units impact other units through a limiting constraint (such as a limited inventory). Despite outsize practical importance, the best estimators for this `Markovian' interference problem are largely heuristic in nature, and their bias is not well understood. We formalize the problem of inference in such experiments as one of policy evaluation. Off-policy estimators, while unbiased, apparently incur a large penalty in variance relative to state-of-the-art heuristics. We introduce an on-policy estimator: the Differences-In-Q's (DQ) estimator. We show that the DQ estimator can in general have exponentially smaller variance than off-policy evaluation. At the same time, its bias is second order in the impact of the intervention. This yields a striking bias-variance tradeoff so that the DQ estimator effectively dominates state-of-the-art alternatives. From a theoretical perspective, we introduce three separate novel techniques that are of independent interest in the theory of Reinforcement Learning (RL). Our empirical evaluation includes a set of experiments on a city-scale ride-hailing simulator.
    PyDTS: A Python Package for Discrete-Time Survival (Regularized) Regression with Competing Risks. (arXiv:2204.05731v2 [stat.ML] UPDATED)
    Time-to-event analysis (survival analysis) is used when the outcome or the response of interest is the time until a pre-specified event occurs. Time-to-event data are sometimes discrete either because time itself is discrete or due to grouping of failure times into intervals or rounding off measurements. In addition, the failure of an individual could be one of several distinct failure types; known as competing risks (events) data. This work focuses on discrete-time regression with competing events. We emphasize the main difference between the continuous and discrete settings with competing events, develop a new estimation procedure, and present PyDTS, an open source Python package which implements our estimation procedure and other tools for discrete-time-survival analysis with competing risks.
    On the Generalization and Adaption Performance of Causal Models. (arXiv:2206.04620v1 [cs.LG])
    Learning models that offer robust out-of-distribution generalization and fast adaptation is a key challenge in modern machine learning. Modelling causal structure into neural networks holds the promise to accomplish robust zero and few-shot adaptation. Recent advances in differentiable causal discovery have proposed to factorize the data generating process into a set of modules, i.e. one module for the conditional distribution of every variable where only causal parents are used as predictors. Such a modular decomposition of knowledge enables adaptation to distributions shifts by only updating a subset of parameters. In this work, we systematically study the generalization and adaption performance of such modular neural causal models by comparing it to monolithic models and structured models where the set of predictors is not constrained to causal parents. Our analysis shows that the modular neural causal models outperform other models on both zero and few-shot adaptation in low data regimes and offer robust generalization. We also found that the effects are more significant for sparser graphs as compared to denser graphs.  ( 2 min )
    Vector Optimization with Stochastic Bandit Feedback. (arXiv:2110.12311v3 [cs.LG] UPDATED)
    We introduce vector optimization problems with stochastic bandit feedback, which extends the best arm identification problem to vector-valued rewards. We consider $K$ designs with multi-dimensional mean reward vectors, which are partially ordered according to a polyhedral ordering cone $C$. This generalizes the concept of the Pareto set in multi-objective optimization and allows different sets of preferences of decision-makers to be encoded by $C$. Different than prior work, we define approximations of the Pareto set based on direction-free covering and gap notions. We study an ($\epsilon,\delta$)-PAC Pareto set identification problem where an evaluation of each design yields a noisy observation of the mean reward vector. In order to characterize the difficulty of learning the Pareto set, we introduce the concept of {\em ordering complexity}, i.e., geometric conditions on the deviations of empirical reward vectors from their mean under which the Pareto front can be approximated accurately. We show how to compute the ordering complexity of any polyhedral ordering cone. We provide gap-dependent and worst-case lower bounds on the sample complexity and show that in the worst-case the sample complexity scales with the square of ordering complexity. Furthermore, we investigate the sample complexity of the na\"ive elimination algorithm and prove that it nearly matches the worst-case sample complexity. Finally, we run experiments to verify our theoretical results and illustrate how $C$ and sampling budget affect the Pareto set, returned ($\epsilon,\delta$)-PAC Pareto set and the success of identification.  ( 2 min )
    Individually Fair Learning with One-Sided Feedback. (arXiv:2206.04475v1 [cs.LG])
    We consider an online learning problem with one-sided feedback, in which the learner is able to observe the true label only for positively predicted instances. On each round, $k$ instances arrive and receive classification outcomes according to a randomized policy deployed by the learner, whose goal is to maximize accuracy while deploying individually fair policies. We first extend the framework of Bechavod et al. (2020), which relies on the existence of a human fairness auditor for detecting fairness violations, to instead incorporate feedback from dynamically-selected panels of multiple, possibly inconsistent, auditors. We then construct an efficient reduction from our problem of online learning with one-sided feedback and a panel reporting fairness violations to the contextual combinatorial semi-bandit problem (Cesa-Bianchi & Lugosi, 2009, Gy\"{o}rgy et al., 2007). Finally, we show how to leverage the guarantees of two algorithms in the contextual combinatorial semi-bandit setting: Exp2 (Bubeck et al., 2012) and the oracle-efficient Context-Semi-Bandit-FTPL (Syrgkanis et al., 2016), to provide multi-criteria no regret guarantees simultaneously for accuracy and fairness. Our results eliminate two potential sources of bias from prior work: the "hidden outcomes" that are not available to an algorithm operating in the full information setting, and human biases that might be present in any single human auditor, but can be mitigated by selecting a well chosen panel.  ( 2 min )
    The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training. (arXiv:2007.12826v3 [stat.ML] UPDATED)
    Modern neural networks are often operated in a strongly overparametrized regime: they comprise so many parameters that they can interpolate the training set, even if actual labels are replaced by purely random ones. Despite this, they achieve good prediction error on unseen data: interpolating the training set does not lead to a large generalization error. Further, overparametrization appears to be beneficial in that it simplifies the optimization landscape. Here we study these phenomena in the context of two-layers neural networks in the neural tangent (NT) regime. We consider a simple data model, with isotropic covariates vectors in $d$ dimensions, and $N$ hidden neurons. We assume that both the sample size $n$ and the dimension $d$ are large, and they are polynomially related. Our first main result is a characterization of the eigenstructure of the empirical NT kernel in the overparametrized regime $Nd\gg n$. This characterization implies as a corollary that the minimum eigenvalue of the empirical NT kernel is bounded away from zero as soon as $Nd\gg n$, and therefore the network can exactly interpolate arbitrary labels in the same regime. Our second main result is a characterization of the generalization error of NT ridge regression including, as a special case, min-$\ell_2$ norm interpolation. We prove that, as soon as $Nd\gg n$, the test error is well approximated by the one of kernel ridge regression with respect to the infinite-width kernel. The latter is in turn well approximated by the error of polynomial ridge regression, whereby the regularization parameter is increased by a `self-induced' term related to the high-degree components of the activation function. The polynomial degree depends on the sample size and the dimension (in particular on $\log n/\log d$).  ( 3 min )
    Generalization and Robustness Implications in Object-Centric Learning. (arXiv:2107.00637v3 [cs.LG] UPDATED)
    The idea behind object-centric representation learning is that natural scenes can better be modeled as compositions of objects and their relations as opposed to distributed representations. This inductive bias can be injected into neural networks to potentially improve systematic generalization and performance of downstream tasks in scenes with multiple objects. In this paper, we train state-of-the-art unsupervised models on five common multi-object datasets and evaluate segmentation metrics and downstream object property prediction. In addition, we study generalization and robustness by investigating the settings where either a single object is out of distribution -- e.g., having an unseen color, texture, or shape -- or global properties of the scene are altered -- e.g., by occlusions, cropping, or increasing the number of objects. From our experimental study, we find object-centric representations to be useful for downstream tasks and generally robust to most distribution shifts affecting objects. However, when the distribution shift affects the input in a less structured manner, robustness in terms of segmentation and downstream task performance may vary significantly across models and distribution shifts.  ( 2 min )
    Benefits of Overparameterized Convolutional Residual Networks: Function Approximation under Smoothness Constraint. (arXiv:2206.04569v1 [stat.ML])
    Overparameterized neural networks enjoy great representation power on complex data, and more importantly yield sufficiently smooth output, which is crucial to their generalization and robustness. Most existing function approximation theories suggest that with sufficiently many parameters, neural networks can well approximate certain classes of functions in terms of the function value. The neural network themselves, however, can be highly nonsmooth. To bridge this gap, we take convolutional residual networks (ConvResNets) as an example, and prove that large ConvResNets can not only approximate a target function in terms of function value, but also exhibit sufficient first-order smoothness. Moreover, we extend our theory to approximating functions supported on a low-dimensional manifold. Our theory partially justifies the benefits of using deep and wide networks in practice. Numerical experiments on adversarial robust image classification are provided to support our theory.  ( 2 min )
    Optimal SQ Lower Bounds for Robustly Learning Discrete Product Distributions and Ising Models. (arXiv:2206.04589v1 [cs.DS])
    We establish optimal Statistical Query (SQ) lower bounds for robustly learning certain families of discrete high-dimensional distributions. In particular, we show that no efficient SQ algorithm with access to an $\epsilon$-corrupted binary product distribution can learn its mean within $\ell_2$-error $o(\epsilon \sqrt{\log(1/\epsilon)})$. Similarly, we show that no efficient SQ algorithm with access to an $\epsilon$-corrupted ferromagnetic high-temperature Ising model can learn the model to total variation distance $o(\epsilon \log(1/\epsilon))$. Our SQ lower bounds match the error guarantees of known algorithms for these problems, providing evidence that current upper bounds for these tasks are best possible. At the technical level, we develop a generic SQ lower bound for discrete high-dimensional distributions starting from low dimensional moment matching constructions that we believe will find other applications. Additionally, we introduce new ideas to analyze these moment-matching constructions for discrete univariate distributions.  ( 2 min )
    On Margins and Generalisation for Voting Classifiers. (arXiv:2206.04607v1 [cs.LG])
    We study the generalisation properties of majority voting on finite ensembles of classifiers, proving margin-based generalisation bounds via the PAC-Bayes theory. These provide state-of-the-art guarantees on a number of classification tasks. Our central results leverage the Dirichlet posteriors studied recently by Zantedeschi et al. [2021] for training voting classifiers; in contrast to that work our bounds apply to non-randomised votes via the use of margins. Our contributions add perspective to the debate on the "margins theory" proposed by Schapire et al. [1998] for the generalisation of ensemble classifiers.  ( 2 min )
    A Spectral Representation of Kernel Stein Discrepancy with Application to Goodness-of-Fit Tests for Measures on Infinite Dimensional Hilbert Spaces. (arXiv:2206.04552v1 [math.ST])
    Kernel Stein discrepancy (KSD) is a widely used kernel-based non-parametric measure of discrepancy between probability measures. It is often employed in the scenario where a user has a collection of samples from a candidate probability measure and wishes to compare them against a specified target probability measure. A useful property of KSD is that it may be calculated with samples from only the candidate measure and without knowledge of the normalising constant of the target measure. KSD has been employed in a range of settings including goodness-of-fit testing, parametric inference, MCMC output assessment and generative modelling. Two main issues with current KSD methodology are (i) the lack of applicability beyond the finite dimensional Euclidean setting and (ii) a lack of clarity on what influences KSD performance. This paper provides a novel spectral representation of KSD which remedies both of these, making KSD applicable to Hilbert-valued data and revealing the impact of kernel and Stein operator choice on the KSD. We demonstrate the efficacy of the proposed methodology by performing goodness-of-fit tests for various Gaussian and non-Gaussian functional models in a number of synthetic data experiments.  ( 2 min )
    Overcoming the Spectral Bias of Neural Value Approximation. (arXiv:2206.04672v1 [cs.LG])
    Value approximation using deep neural networks is at the heart of off-policy deep reinforcement learning, and is often the primary module that provides learning signals to the rest of the algorithm. While multi-layer perceptron networks are universal function approximators, recent works in neural kernel regression suggest the presence of a spectral bias, where fitting high-frequency components of the value function requires exponentially more gradient update steps than the low-frequency ones. In this work, we re-examine off-policy reinforcement learning through the lens of kernel regression and propose to overcome such bias via a composite neural tangent kernel. With just a single line-change, our approach, the Fourier feature networks (FFN) produce state-of-the-art performance on challenging continuous control domains with only a fraction of the compute. Faster convergence and better off-policy stability also make it possible to remove the target network without suffering catastrophic divergences, which further reduces TD}(0)'s estimation bias on a few tasks.  ( 2 min )
    What is a Good Metric to Study Generalization of Minimax Learners?. (arXiv:2206.04502v1 [stat.ML])
    Minimax optimization has served as the backbone of many machine learning (ML) problems. Although the convergence behavior of optimization algorithms has been extensively studied in minimax settings, their generalization guarantees in the stochastic setting, i.e., how the solution trained on empirical data performs on the unseen testing data, have been relatively underexplored. A fundamental question remains elusive: What is a good metric to study generalization of minimax learners? In this paper, we aim to answer this question by first showing that primal risk, a universal metric to study generalization in minimization, fails in simple examples of minimax problems. Furthermore, another popular metric, the primal-dual risk, also fails to characterize the generalization behavior for minimax problems with nonconvexity, due to non-existence of saddle points. We thus propose a new metric to study generalization of minimax learners: the primal gap, to circumvent these issues. Next, we derive generalization bounds for the primal gap in nonconvex-concave settings. As byproducts of our analysis, we also solve two open questions: establishing generalization bounds for primal risk and primal-dual risk in the strong sense, i.e., without strong concavity or assuming that the maximization and expectation can be interchanged, while either of these assumptions was needed in the literature. Finally, we leverage this new metric to compare the generalization behavior of two popular algorithms -- gradient descent-ascent (GDA) and gradient descent-max (GDMax) in stochastic minimax optimization.  ( 2 min )
    Deep Hierarchical Planning from Pixels. (arXiv:2206.04114v1 [cs.AI])
    Intelligent agents need to select long sequences of actions to solve complex tasks. While humans easily break down tasks into subgoals and reach them through millions of muscle commands, current artificial intelligence is limited to tasks with horizons of a few hundred decisions, despite large compute budgets. Research on hierarchical reinforcement learning aims to overcome this limitation but has proven to be challenging, current methods rely on manually specified goal spaces or subtasks, and no general solution exists. We introduce Director, a practical method for learning hierarchical behaviors directly from pixels by planning inside the latent space of a learned world model. The high-level policy maximizes task and exploration rewards by selecting latent goals and the low-level policy learns to achieve the goals. Despite operating in latent space, the decisions are interpretable because the world model can decode goals into images for visualization. Director outperforms exploration methods on tasks with sparse rewards, including 3D maze traversal with a quadruped robot from an egocentric camera and proprioception, without access to the global position or top-down view that was used by prior work. Director also learns successful behaviors across a wide range of environments, including visual control, Atari games, and DMLab levels.  ( 2 min )
    A Simple Unified Approach to Testing High-Dimensional Conditional Independences for Categorical and Ordinal Data. (arXiv:2206.04356v1 [stat.ML])
    Conditional independence (CI) tests underlie many approaches to model testing and structure learning in causal inference. Most existing CI tests for categorical and ordinal data stratify the sample by the conditioning variables, perform simple independence tests in each stratum, and combine the results. Unfortunately, the statistical power of this approach degrades rapidly as the number of conditioning variables increases. Here we propose a simple unified CI test for ordinal and categorical data that maintains reasonable calibration and power in high dimensions. We show that our test outperforms existing baselines in model testing and structure learning for dense directed graphical models while being comparable for sparse models. Our approach could be attractive for causal model testing because it is easy to implement, can be used with non-parametric or parametric probability models, has the symmetry property, and has reasonable computational requirements.  ( 2 min )
    On Transfer Learning in Functional Linear Regression. (arXiv:2206.04277v1 [stat.ML])
    This work studies the problem of transfer learning under the functional linear model framework, which aims to improve the fit of the target model by leveraging the knowledge from related source models. We measure the relatedness between target and source models using Reproducing Kernel Hilbert Spaces, allowing the type of knowledge being transferred to be interpreted by the structure of the spaces. Two algorithms are proposed: one transfers knowledge when the index of transferable sources is known, while the other one utilizes aggregation to achieve knowledge transfer without prior information about the sources. Furthermore, we establish the optimal convergence rates for excess risk, making the statistical gain via transfer learning mathematically provable. The effectiveness of the proposed algorithms is demonstrated on synthetic data as well as real financial data.  ( 2 min )
    Choosing Answers in $\varepsilon$-Best-Answer Identification for Linear Bandits. (arXiv:2206.04456v1 [stat.ML])
    In pure-exploration problems, information is gathered sequentially to answer a question on the stochastic environment. While best-arm identification for linear bandits has been extensively studied in recent years, few works have been dedicated to identifying one arm that is $\varepsilon$-close to the best one (and not exactly the best one). In this problem with several correct answers, an identification algorithm should focus on one candidate among those answers and verify that it is correct. We demonstrate that picking the answer with highest mean does not allow an algorithm to reach asymptotic optimality in terms of expected sample complexity. Instead, a \textit{furthest answer} should be identified. Using that insight to choose the candidate answer carefully, we develop a simple procedure to adapt best-arm identification algorithms to tackle $\varepsilon$-best-answer identification in transductive linear stochastic bandits. Finally, we propose an asymptotically optimal algorithm for this setting, which is shown to achieve competitive empirical performance against existing modified best-arm identification algorithms.  ( 2 min )
    Uplifting Bandits. (arXiv:2206.04091v1 [stat.ML])
    We introduce a multi-armed bandit model where the reward is a sum of multiple random variables, and each action only alters the distributions of some of them. After each action, the agent observes the realizations of all the variables. This model is motivated by marketing campaigns and recommender systems, where the variables represent outcomes on individual customers, such as clicks. We propose UCB-style algorithms that estimate the uplifts of the actions over a baseline. We study multiple variants of the problem, including when the baseline and affected variables are unknown, and prove sublinear regret bounds for all of these. We also provide lower bounds that justify the necessity of our modeling assumptions. Experiments on synthetic and real-world datasets show the benefit of methods that estimate the uplifts over policies that do not use this structure.  ( 2 min )
    Regret Analysis of Certainty Equivalence Policies in Continuous-Time Linear-Quadratic Systems. (arXiv:2206.04434v1 [cs.LG])
    This work studies theoretical performance guarantees of a ubiquitous reinforcement learning policy for controlling the canonical model of stochastic linear-quadratic system. We show that randomized certainty equivalent policy addresses the exploration-exploitation dilemma for minimizing quadratic costs in linear dynamical systems that evolve according to stochastic differential equations. More precisely, we establish square-root of time regret bounds, indicating that randomized certainty equivalent policy learns optimal control actions fast from a single state trajectory. Further, linear scaling of the regret with the number of parameters is shown. The presented analysis introduces novel and useful technical approaches, and sheds light on fundamental challenges of continuous-time reinforcement learning.  ( 2 min )
    A general approximation lower bound in $L^p$ norm, with applications to feed-forward neural networks. (arXiv:2206.04360v1 [cs.LG])
    We study the fundamental limits to the expressive power of neural networks. Given two sets $F$, $G$ of real-valued functions, we first prove a general lower bound on how well functions in $F$ can be approximated in $L^p(\mu)$ norm by functions in $G$, for any $p \geq 1$ and any probability measure $\mu$. The lower bound depends on the packing number of $F$, the range of $F$, and the fat-shattering dimension of $G$. We then instantiate this bound to the case where $G$ corresponds to a piecewise-polynomial feed-forward neural network, and describe in details the application to two sets $F$: H{\"o}lder balls and multivariate monotonic functions. Beside matching (known or new) upper bounds up to log factors, our lower bounds shed some light on the similarities or differences between approximation in $L^p$ norm or in sup norm, solving an open question by DeVore et al. (2021). Our proof strategy differs from the sup norm case and uses a key probability result of Mendelson (2002).  ( 2 min )
    Adversarial Noises Are Linearly Separable for (Nearly) Random Neural Networks. (arXiv:2206.04316v1 [cs.LG])
    Adversarial examples, which are usually generated for specific inputs with a specific model, are ubiquitous for neural networks. In this paper we unveil a surprising property of adversarial noises when they are put together, i.e., adversarial noises crafted by one-step gradient methods are linearly separable if equipped with the corresponding labels. We theoretically prove this property for a two-layer network with randomly initialized entries and the neural tangent kernel setup where the parameters are not far from initialization. The proof idea is to show the label information can be efficiently backpropagated to the input while keeping the linear separability. Our theory and experimental evidence further show that the linear classifier trained with the adversarial noises of the training data can well classify the adversarial noises of the test data, indicating that adversarial noises actually inject a distributional perturbation to the original data distribution. Furthermore, we empirically demonstrate that the adversarial noises may become less linearly separable when the above conditions are compromised while they are still much easier to classify than original features.  ( 2 min )
    Evaluating Aleatoric Uncertainty via Conditional Generative Models. (arXiv:2206.04287v1 [cs.LG])
    Aleatoric uncertainty quantification seeks for distributional knowledge of random responses, which is important for reliability analysis and robustness improvement in machine learning applications. Previous research on aleatoric uncertainty estimation mainly targets closed-formed conditional densities or variances, which requires strong restrictions on the data distribution or dimensionality. To overcome these restrictions, we study conditional generative models for aleatoric uncertainty estimation. We introduce two metrics to measure the discrepancy between two conditional distributions that suit these models. Both metrics can be easily and unbiasedly computed via Monte Carlo simulation of the conditional generative models, thus facilitating their evaluation and training. We demonstrate numerically how our metrics provide correct measurements of conditional distributional discrepancies and can be used to train conditional models competitive against existing benchmarks.  ( 2 min )
    Exploring Predictive States via Cantor Embeddings and Wasserstein Distance. (arXiv:2206.04198v1 [cond-mat.stat-mech])
    Predictive states for stochastic processes are a nonparametric and interpretable construct with relevance across a multitude of modeling paradigms. Recent progress on the self-supervised reconstruction of predictive states from time-series data focused on the use of reproducing kernel Hilbert spaces. Here, we examine how Wasserstein distances may be used to detect predictive equivalences in symbolic data. We compute Wasserstein distances between distributions over sequences ("predictions"), using a finite-dimensional embedding of sequences based on the Cantor for the underlying geometry. We show that exploratory data analysis using the resulting geometry via hierarchical clustering and dimension reduction provides insight into the temporal structure of processes ranging from the relatively simple (e.g., finite-state hidden Markov models) to the very complex (e.g., infinite-state indexed grammars).  ( 2 min )
    On Gradient Descent Convergence beyond the Edge of Stability. (arXiv:2206.04172v1 [cs.LG])
    Gradient Descent (GD) is a powerful workhorse of modern machine learning thanks to its scalability and efficiency in high-dimensional spaces. Its ability to find local minimisers is only guaranteed for losses with Lipschitz gradients, where it can be seen as a 'bona-fide' discretisation of an underlying gradient flow. Yet, many ML setups involving overparametrised models do not fall into this problem class, which has motivated research beyond the so-called "Edge of Stability", where the step-size crosses the admissibility threshold inversely proportional to the Lipschitz constant above. Perhaps surprisingly, GD has been empirically observed to still converge regardless of local instability. In this work, we study a local condition for such an unstable convergence around a local minima in a low dimensional setting. We then leverage these insights to establish global convergence of a two-layer single-neuron ReLU student network aligning with the teacher neuron in a large learning rate beyond the Edge of Stability under population loss. Meanwhile, while the difference of norms of the two layers is preserved by gradient flow, we show that GD above the edge of stability induces a balancing effect, leading to the same norms across the layers.  ( 2 min )
    CCP: Correlated Clustering and Projection for Dimensionality Reduction. (arXiv:2206.04189v1 [stat.ML])
    Most dimensionality reduction methods employ frequency domain representations obtained from matrix diagonalization and may not be efficient for large datasets with relatively high intrinsic dimensions. To address this challenge, Correlated Clustering and Projection (CCP) offers a novel data domain strategy that does not need to solve any matrix. CCP partitions high-dimensional features into correlated clusters and then projects correlated features in each cluster into a one-dimensional representation based on sample correlations. Residue-Similarity (R-S) scores and indexes, the shape of data in Riemannian manifolds, and algebraic topology-based persistent Laplacian are introduced for visualization and analysis. Proposed methods are validated with benchmark datasets associated with various machine learning algorithms.  ( 2 min )
    Robust Matrix Completion with Heavy-tailed Noise. (arXiv:2206.04276v1 [math.ST])
    This paper studies low-rank matrix completion in the presence of heavy-tailed and possibly asymmetric noise, where we aim to estimate an underlying low-rank matrix given a set of highly incomplete noisy entries. Though the matrix completion problem has attracted much attention in the past decade, there is still lack of theoretical understanding when the observations are contaminated by heavy-tailed noises. Prior theory falls short of explaining the empirical results and is unable to capture the optimal dependence of the estimation error on the noise level. In this paper, we adopt an adaptive Huber loss to accommodate heavy-tailed noise, which is robust against large and possibly asymmetric errors when the parameter in the loss function is carefully designed to balance the Huberization biases and robustness to outliers. Then, we propose an efficient nonconvex algorithm via a balanced low-rank Burer-Monteiro matrix factorization and gradient decent with robust spectral initialization. We prove that under merely bounded second moment condition on the error distributions, rather than the sub-Gaussian assumption, the Euclidean error of the iterates generated by the proposed algorithm decrease geometrically fast until achieving a minimax-optimal statistical estimation error, which has the same order as that in the sub-Gaussian case. The key technique behind this significant advancement is a powerful leave-one-out analysis framework. The theoretical results are corroborated by our simulation studies.  ( 2 min )
    Conformal Off-Policy Prediction in Contextual Bandits. (arXiv:2206.04405v1 [stat.ML])
    Most off-policy evaluation methods for contextual bandits have focused on the expected outcome of a policy, which is estimated via methods that at best provide only asymptotic guarantees. However, in many applications, the expectation may not be the best measure of performance as it does not capture the variability of the outcome. In addition, particularly in safety-critical settings, stronger guarantees than asymptotic correctness may be required. To address these limitations, we consider a novel application of conformal prediction to contextual bandits. Given data collected under a behavioral policy, we propose \emph{conformal off-policy prediction} (COPP), which can output reliable predictive intervals for the outcome under a new target policy. We provide theoretical finite-sample guarantees without making any additional assumptions beyond the standard contextual bandit setup, and empirically demonstrate the utility of COPP compared with existing methods on synthetic and real-world data.  ( 2 min )
    Words are all you need? Capturing human sensory similarity with textual descriptors. (arXiv:2206.04105v1 [cs.CL])
    Recent advances in multimodal training use textual descriptions to significantly enhance machine understanding of images and videos. Yet, it remains unclear to what extent language can fully capture sensory experiences across different modalities. A well-established approach for characterizing sensory experiences relies on similarity judgments, namely, the degree to which people perceive two distinct stimuli as similar. We explore the relation between human similarity judgments and language in a series of large-scale behavioral studies ($N=1,823$ participants) across three modalities (images, audio, and video) and two types of text descriptors: simple word tags and free-text captions. In doing so, we introduce a novel adaptive pipeline for tag mining that is both efficient and domain-general. We show that our prediction pipeline based on text descriptors exhibits excellent performance, and we compare it against a comprehensive array of 611 baseline models based on vision-, audio-, and video-processing architectures. We further show that the degree to which textual descriptors and models predict human similarity varies across and within modalities. Taken together, these studies illustrate the value of integrating machine learning and cognitive science approaches to better understand the similarities and differences between human and machine representations. We present an interactive visualization at https://words-are-all-you-need.s3.amazonaws.com/index.html for exploring the similarity between stimuli as experienced by humans and different methods reported in the paper.  ( 2 min )
    GCVAE: Generalized-Controllable Variational AutoEncoder. (arXiv:2206.04225v1 [stat.ML])
    Variational autoencoders (VAEs) have recently been used for unsupervised disentanglement learning of complex density distributions. Numerous variants exist to encourage disentanglement in latent space while improving reconstruction. However, none have simultaneously managed the trade-off between attaining extremely low reconstruction error and a high disentanglement score. We present a generalized framework to handle this challenge under constrained optimization and demonstrate that it outperforms state-of-the-art existing models as regards disentanglement while balancing reconstruction. We introduce three controllable Lagrangian hyperparameters to control reconstruction loss, KL divergence loss and correlation measure. We prove that maximizing information in the reconstruction network is equivalent to information maximization during amortized inference under reasonable assumptions and constraint relaxation.  ( 2 min )
    ESCHER: Eschewing Importance Sampling in Games by Computing a History Value Function to Estimate Regret. (arXiv:2206.04122v1 [cs.GT])
    Recent techniques for approximating Nash equilibria in very large games leverage neural networks to learn approximately optimal policies (strategies). One promising line of research uses neural networks to approximate counterfactual regret minimization (CFR) or its modern variants. DREAM, the only current CFR-based neural method that is model free and therefore scalable to very large games, trains a neural network on an estimated regret target that can have extremely high variance due to an importance sampling term inherited from Monte Carlo CFR (MCCFR). In this paper we propose an unbiased model-free method that does not require any importance sampling. Our method, ESCHER, is principled and is guaranteed to converge to an approximate Nash equilibrium with high probability in the tabular case. We show that the variance of the estimated regret of a tabular version of ESCHER with an oracle value function is significantly lower than that of outcome sampling MCCFR and tabular DREAM with an oracle value function. We then show that a deep learning version of ESCHER outperforms the prior state of the art -- DREAM and neural fictitious self play (NFSP) -- and the difference becomes dramatic as game size increases.  ( 2 min )
    Analytical Composition of Differential Privacy via the Edgeworth Accountant. (arXiv:2206.04236v1 [cs.CR])
    Many modern machine learning algorithms are composed of simple private algorithms; thus, an increasingly important problem is to efficiently compute the overall privacy loss under composition. In this study, we introduce the Edgeworth Accountant, an analytical approach to composing differential privacy guarantees of private algorithms. The Edgeworth Accountant starts by losslessly tracking the privacy loss under composition using the $f$-differential privacy framework, which allows us to express the privacy guarantees using privacy-loss log-likelihood ratios (PLLRs). As the name suggests, this accountant next uses the Edgeworth expansion to the upper and lower bounds the probability distribution of the sum of the PLLRs. Moreover, by relying on a technique for approximating complex distributions using simple ones, we demonstrate that the Edgeworth Accountant can be applied to the composition of any noise-addition mechanism. Owing to certain appealing features of the Edgeworth expansion, the $(\epsilon, \delta)$-differential privacy bounds offered by this accountant are non-asymptotic, with essentially no extra computational cost, as opposed to the prior approaches in, wherein the running times increase with the number of compositions. Finally, we demonstrate that our upper and lower $(\epsilon, \delta)$-differential privacy bounds are tight in federated analytics and certain regimes of training private deep learning models.  ( 2 min )
    Applying separative non-negative matrix factorization to extra-financial data. (arXiv:2206.04350v1 [q-fin.CP])
    We present here an original application of the non-negative matrix factorization (NMF) method, for the case of extra-financial data. These data are subject to high correlations between co-variables, as well as between observations. NMF provides a much more relevant clustering of co-variables and observations than a simple principal component analysis (PCA). In addition, we show that an initial data separation step before applying NMF further improves the quality of the clustering.  ( 2 min )
    Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem. (arXiv:2206.04119v1 [q-bio.BM])
    Construction of a scaffold structure that supports a desired motif, conferring protein function, shows promise for the design of vaccines and enzymes. But a general solution to this motif-scaffolding problem remains open. Current machine-learning techniques for scaffold design are either limited to unrealistically small scaffolds (up to length 20) or struggle to produce multiple diverse scaffolds. We propose to learn a distribution over diverse and longer protein backbone structures via an E(3)-equivariant graph neural network. We develop SMCDiff to efficiently sample scaffolds from this distribution conditioned on a given motif; our algorithm is the first to theoretically guarantee conditional samples from a diffusion model in the large-compute limit. We evaluate our designed backbones by how well they align with AlphaFold2-predicted structures. We show that our method can (1) sample scaffolds up to 80 residues and (2) achieve structurally diverse scaffolds for a fixed motif.  ( 2 min )
  • Open

    How much do reward engineers make?
    The biggest influence I had on the performance of a method was through the reward, and how and what components are weighted. In fact, this has had a bigger impact than fiddling with hyperparameters that couldn't be autotuned. It's the most intrinsic bias I've found to be effective at meeting time/compute constraints without compromising performance. In other words, whether a method worked or not depended on its reward. What's the demand for reward engineers? submitted by /u/XecutionStyle [link] [comments]  ( 1 min )

  • Open

    How service providers can use natural language processing to gain insights from customer tickets with Amazon Comprehend
    Today, customers can raise support tickets through multiple channels like – web, mobile, chat-bots, emails, or phone calls. When a support ticket is raised by a customer, it is processed and assigned to a category based on the information provided in the ticket. It is then routed to the support group for resolution according to […]  ( 14 min )
    Incremental training with Amazon SageMaker JumpStart
    In December 2020, AWS announced the general availability of Amazon SageMaker JumpStart, a capability of Amazon SageMaker that helps you quickly and easily get started with machine learning (ML). SageMaker JumpStart provides one-click fine-tuning and deployment of a wide variety of pre-trained models across popular ML tasks, as well as a selection of end-to-end solutions […]  ( 9 min )
    How eMagazines utilizes Amazon Polly to voice articles for school-aged kids
    This is a guest post by Andrew Degenholtz, CEO and Founder of eMagazines, the parent company of ReadAlong.ai. eMagazines’ technology seamlessly transforms print products into premium digital and audio experiences. Leveraging Amazon technology, ReadAlong.ai offers a simple, turn-key way for publishers to add audio to their websites with a single line of code. eMagazines supports […]  ( 7 min )
    Weekly forecasts can now start on Sunday with Amazon Forecast
    We are excited to announce that in Amazon Forecast, you can now start your forecast horizon at custom starting points, including on Sundays for weekly forecasts. This allows you to more closely align demand planning forecasts to local business practices and operational requirements. Forecast is a fully managed service that uses statistical and machine learning […]  ( 6 min )
    Continuously monitor predictor accuracy with Amazon Forecast
    We’re excited to announce that you can now automatically monitor the accuracy of your Amazon Forecast predictors over time. As new data is provided, Forecast automatically computes predictor accuracy metrics, providing you with more information to decide whether to keep using, retrain, or create new predictors. Monitoring predictor quality and identifying deterioration in accuracy over […]  ( 9 min )
    Unified data preparation and model training with Amazon SageMaker Data Wrangler and Amazon SageMaker Autopilot
    Data fuels machine learning (ML); the quality of data has a direct impact on the quality of ML models. Therefore, improving data quality and employing the right feature engineering techniques are critical to creating accurate ML models. ML practitioners often tediously iterate on feature engineering, choice of algorithms, and other aspects of ML in search […]  ( 10 min )
  • Open

    [R] Decentralized Training of Foundation Models in Heterogeneous Environments
    Paper: https://arxiv.org/abs/2206.01288 Abstract: Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often involving tens of thousands of GPUs running continuously for months. These models are typically trained in specialized clusters featuring fast, homogeneous interconnects and using carefully designed software systems that support both data parallelism and model/pipeline parallelism. Such dedicated clusters can be costly and difficult to obtain. Can we instead leverage the much greater amount of decentralized, heterogeneous, and lower-bandwidth interconnected compute? Previous works examining the heterogeneous, decentralized setting focus on relatively small models that can be trained in a purely data parallel manner. State-of-the-art schemes for model …  ( 1 min )
    [R] Extreme Compression for Pre-trained Transformers Made Simple and Efficient - Microsoft 2022
    Paper: https://arxiv.org/abs/2206.01859 Abstract: Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices. However, to preserve the accuracy for such aggressive compression schemes, cutting-edge methods usually introduce complicated compression pipelines, e.g., multi-stage expensive knowledge distillation with extensive hyperparameter tuning. Also, they oftentimes focus less on smaller transformer models that have already been heavily compressed via knowledge distillation and lack a systematic study to show the effectiveness of their methods. In this paper, we perform a very comprehensive systematic study to measure the impact of many key hyperparameters and training strategies from previous works. As a result, we find out that previous baselines for ultra-low bit precision quantization are significantly under-trained. Based on our study, we propose a simple yet effective compression pipeline for extreme compression, named XTC. XTC demonstrates that (1) we can skip the pre-training knowledge distillation to obtain a 5-layer BERT while achieving better performance than previous state-of-the-art methods, e.g., the 6-layer TinyBERT; (2) extreme quantization plus layer reduction is able to reduce the model size by 50x, resulting in new state-of-the-art results on GLUE tasks. ​ https://preview.redd.it/kgbjncheeo491.jpg?width=1187&format=pjpg&auto=webp&s=ffa0963f0c0dd9a2ab9163d5ec6dc8d43584ece0 https://preview.redd.it/7ioxmuqeeo491.jpg?width=577&format=pjpg&auto=webp&s=e2d5eec7274bcfe2c9bb66cb0b0256af5d4594a6 https://preview.redd.it/y5wcth7feo491.jpg?width=1151&format=pjpg&auto=webp&s=bb96b98862987cd2900d7d0906259190c3ad154b submitted by /u/Singularian2501 [link] [comments]  ( 1 min )
    New Insights on Infant Word Lear[N]ing - Implications for optimizing machine learning and second language learning
    https://community.chatwithastrid.com/aprendiendo-espanol-76iwwk5y/post/new-insight-on-how-babies-learn-words-M3dUgc6rrRb83ZD submitted by /u/InstrumentalAsylum [link] [comments]  ( 1 min )
    [D] Request for moderators
    If you frequently visit r/ml throughout the day, have a good understanding of the field, and a history of constructive comments/posts, then we need your help as a moderator. Please apply by sending us a modmail with the following info: Your role (engineer, student, researcher, self-taught, etc) and years of experience in ML Amount of time available to spend on the sub (you must check the sub quite regularly throughout the day) Your time zone We’re specifically looking for friendly people that have at least a year or two of experience in ML who understand the current research and industry landscape and who have been on r/ml long enough to understand what the community expects in terms of moderation. Thanks! submitted by /u/dojoteef [link] [comments]  ( 1 min )
    [R] Blazingly Fast Computer Vision Training with the Mosaic ResNet and Composer
    Hey all! MosaicML is excited to release the Mosaic ResNet, which trains to a 76.6% classification accuracy in 27 minutes, 7x faster than NVIDIA's ResNet baseline, using only vanilla PyTorch. These recipes modify the training algorithm; the network architecture is the same ResNet you’ve known and loved since 2015 (with updated anti-aliasing pooling via Blurpool). See all of the details in our blog post! The figure below summarizes our three training recipes (exact recipes available here). You can check out the complete results of the hundreds of training runs we conducted to create these recipes using Explorer, our tool for evaluating the efficiency of training algorithms. Comparison between best MosaicML ResNet-50 Recipe for a given Time & Accuracy (i.e. the Pareto frontier) to different baselines. Data collected on the MosaicML Cloud (8x NVIDIA A100). These results push on the interplay between algorithmic science and systems engineering, providing segmented cases for research like FFCV Dataloaders, Sharpness-Aware Minimization, and novel, MosaicML algorithms such as ColOut. MosaicML's release of \"training recipes\", which permit a user to trade off between accuracy and runtime. Want to verify our results? Want to beat ours? Or just want to speed up your own model training? Head over to our our GitHub repo, https://github.com/mosaicml/composer, which enables this research, and star it ⭐️ to keep up with the latest updates! And stay tuned for a much deeper dive on all the details, a comprehensive write-up on the science and engineering of this work, next week! https://preview.redd.it/falrstlytn491.png?width=1498&format=png&auto=webp&s=e7eb5413816b18e2f13efb89681c5e451a41aa64 submitted by /u/moinnadeem [link] [comments]  ( 6 min )
    [D] G.Hinton's ML-driven explanation of the role of the sleep - inquiry about further sources.
    In the recent episode of Peter Abbeel's "The Robot Brains" podcast, G.Hinton explains a fascinating hypothesis behind the role of sleep in our lives ("sleep is the process of forgetting negative examples in human contrastive learning framework"). However, he does it in a very general way. Does anybody know where I could read more about that? Academic papers etc.? Reference: https://youtu.be/2EDP4v-9TUA submitted by /u/dtransposed [link] [comments]  ( 2 min )
    [R] More ethical machine learning using model cards at Wikimedia
    Abstract, 10 minute video, and transcript from May 2022 Apply(conf): First proposed by Mitchell et al. in 2018, model cards are a form of transparent reporting of machine learning models, their uses, and performance for public audiences. As part of a broader effort to strengthen our ethical approaches to machine learning at Wikimedia, we started implementing model cards for every model hosted by the Foundation. This talk is a description of our process, motivation, and lessons learned along the way. https://www.youtube.com/watch?v=t4GMq7MC7Js https://www.tecton.ai/apply/session-video-archive/more-ethical-machine-learning-using-model-card-at-wikimedia/ submitted by /u/Competitive_Travel16 [link] [comments]  ( 1 min )
    [P] Set-up Yolo-V5 distributed data-parallel multi GPU on AWS and Kubeflow
    Hi all, I've been trying to set up DDP multi-node training on AWS for a week and am finally able to make it work. I didn't find any resources for the same. So thought would write a blog and share it. Please provide feedback and see if this is helpful for you https://medium.com/@sachinchandra/running-yolo-v5-with-ddp-on-aws-8a4f07a77cf submitted by /u/scb_11 [link] [comments]  ( 1 min )
    [R] Can machine learning make side-channel attacks even stronger?
    Twitter thread: https://twitter.com/jackcook36/status/1534920169369309184 Paper: https://jackcook.github.io/bigger-fish/paper.pdf Key findings: Machine learning can be used to identify activity on your computer from traces recorded in JavaScript that measure CPU instruction throughput over time We found this type of attack exploits signals from system interrupts, which operating systems use to interact with hardware devices When a core processes interrupts, it pauses the execution of an attacker, creating a signal that can be exploited Our loop-counting attack can correctly identify one of 100 websites being opened 96.6% of the time in Chrome on Linux We identified a randomized timer mitigation that reduces our attack’s accuracy to near chance Please let me know if you have any feedback or questions! submitted by /u/jackcook [link] [comments]  ( 1 min )
    [D] Benchmark Object Detection Hyperparameters
    I want to conduct benchmark experiments: Faster R-CNN vs YOLOv3 vs YOLOv4 vs YOLOv5. For that reason, I want to fix the hyperparameters: optimizer, learning rate, weight decay and learning rate scheduler. For optimizer, due to different frameworks, I have to go with ADAM (b0=0.9, b1=0.999, eps=1e-7). What parameters should I choose for weigt decay, and learning rate scheduler, given that different models converge at different epoch/steps? Should I go with cosine decay, manual step (with 0.1 decay at 80 and 90% of total epoch/steps), or something else? Note: different frameworks have different "default" hyperparameters, maybe I should stick to defalt? submitted by /u/giakou4 [link] [comments]  ( 1 min )
    [D] Use conversational AI based on GPT-J/GPT-NeoX in Discord
    Hello all, It is very easy to build a chatbot in a Discord server thanks to great AI models like GPT-3, GPT-J, and GPT-NeoX. In this article, we I'm showing you how to code your own conversational bot in Node.js by using GPT-J and GPT-NeoX through the NLP Cloud API: https://nlpcloud.io/build-gpt-j-gpt-neox-discord-chatbot-with-nlpcloud.html As you might know, these AI models are "stateless", meaning that they can't remember the chat history. So I am showing how to handle this by automatically re-sending the chat history in each request, and by truncating the history when it is too long. If you have questions please don't hesitate to ask. I hope it will be useful! Julien submitted by /u/juliensalinas [link] [comments]  ( 1 min )
    [P] Virtual Background project (feat. The Rock with Alpacas) with PyTorch Implementation
    ​ The Rock with Alpaca Hey Guys! Recently, I worked on a side project that generates virtual background (like the one in Zoom) with semantic segmentation. I used BiSeNet as a base model. My goal was to implement everything from scratch without using any fancy libraries and it works pretty well! You can test it on either a single image or a real-time webcam. Feel free to leave comments for any feedback! ​ Project GitHub Repo: Link BiSeNet detailed review: my blog If you want to see other research paper implementations check out my repo! submitted by /u/JasonTheCoders [link] [comments]  ( 1 min )
    [Discussion] Fine tune model for long context
    How to train GPT or BERT for large context where context length is more than 1024 tokens. Truncating the context is not an option as the complete context is important. One approach that I can think of is breaking/dividing the context into multiple chunks. What are my other options?t submitted by /u/Expert-Departure-236 [link] [comments]  ( 1 min )
    [R] The Annotated Diffusion Model
    From huggingface post: https://huggingface.co/blog/annotated-diffusion A New great article joined the Annotated series: The Annotated Transformer http://nlp.seas.harvard.edu/2018/04/03/attention.html, The Annotated GPT-2 https://amaarora.github.io/2020/02/18/annotatedGPT2.html submitted by /u/ghosthamlet [link] [comments]  ( 2 min )
  • Open

    Face of the night (GAN) AI Generated
    submitted by /u/FVCKDIGITAL [link] [comments]
    Brave Heart (GAN) AI Generated
    submitted by /u/FVCKDIGITAL [link] [comments]
    AI Dream 56 - Post-Apocalyptic WarDepression by AI
    submitted by /u/LordPewPew777 [link] [comments]
    AI to create similar videos based on input
    Hi guys! Does someone know if there's an app or something similar to create similar videos based on (multiple) video inputs? Wishes! submitted by /u/kreismeis [link] [comments]
    Researchers Built a Neural Network That Not Only Solves but Explains and Generates University Math Problems by Program Synthesis and Few-Shot Learning at Human Level
    👉 They created a pre-trained neural network on the text and finetuned the code to answer mathematics course problems, explain solutions, and produce new questions on a human level. It automatically synthesizes programs and runs them to answer course problems with 81 percent automated accuracy utilizing few-shot learning and OpenAI’s Codex transformer. 👉 They also curated a new dataset of questions from MIT’s most famous mathematics courses. The neural network answers questions from the MATH dataset (including questions on Prealgebra, Algebra, Counting, and Probability, Intermediate Algebra, Number Theory, and Precalculus), which is the current standard of advanced mathematics issues meant to examine mathematical thinking. Continue reading | Check out the paper and github ​ https://preview.redd.it/3pq9vu2fnm491.png?width=1198&format=png&auto=webp&s=ab9425a5130d52a1c0d9cfc525bac546eccfec57 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    A.I. Coding Overview
    submitted by /u/a1a3a5a7a9 [link] [comments]
    List of (free) GAN / generative AI apps and playgrounds
    submitted by /u/nathan_thinks [link] [comments]  ( 1 min )
    I’ve created a fashion blogger bot you can freely chat with
    Hey guys, I’ve just made a language model that behaves like an average person (I hope). She perceives herself as a fashion blogger. I’ve made her capable chat about fashion and music topics. I would be over the moon if you could test and share your feedback in the comments regarding its ability to support open-domain dialogue. Here is the bot -- just tap submitted by /u/GBalchidi [link] [comments]  ( 1 min )
    CYBERHOLIC BLISS | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]
    The last line killed me (I hope this is within the parameters, it's GPT-3)
    ​ https://preview.redd.it/dndbky83jl491.png?width=2402&format=png&auto=webp&s=063f10ffa4f18b215d2ff87aa9c3f0623ada32b9 submitted by /u/thatgerhard [link] [comments]
    looking for ai that generates sexual content
    Hi, I am looking for some kind of software that can make random generated sexual pictures. NOT of existing people (like the "undressing" apps). The sexual content must be completely artifically generated. Thanks submitted by /u/insert_username--- [link] [comments]
    Build a Discord chatbot based on GPT-J/GPT-NeoX
    Hello all, It is very easy to build a chatbot in a Discord server thanks to great AI models like GPT-3, GPT-J, and GPT-NeoX. In this article, we I'm showing you how to code your own conversational bot in Node.js by using GPT-J and GPT-NeoX through the NLP Cloud API: https://nlpcloud.io/build-gpt-j-gpt-neox-discord-chatbot-with-nlpcloud.html As you might know, these AI models are "stateless", meaning that they can't remember the chat history. So I am showing how to handle this by automatically re-sending the chat history in each request, and by truncating the history when it is too long. If you have questions please don't hesitate to ask. I hope it will be useful! Julien submitted by /u/juliensalinas [link] [comments]  ( 1 min )
    DISCO DIFFUSION 3D AI ART ANIMATION | ANGEL OF DEATH, AZRAEL
    submitted by /u/Available_Tadpole829 [link] [comments]
    A Samurai Story, DISCO DIFFUSION V5.2 3D animation (using both image and text prompts) OC
    submitted by /u/crabmansboxturtle [link] [comments]  ( 1 min )
    DISCO DIFFUSION 3D AI ART ANIMATION | EXTRATERRESTRIAL ESCAPADE
    submitted by /u/Available_Tadpole829 [link] [comments]
    Thor in Battle - Neural-Art Parody / [4K] Creative Experiment w/ GPT-3, VQGAN+CLIP
    submitted by /u/MLInsights [link] [comments]
  • Open

    LIMoE: Learning Multiple Modalities with One Sparse Mixture of Experts Model
    Posted by Basil Mustafa, Research Software Engineer and Carlos Riquelme, Research Scientist, Google Research, Brain team Sparse models stand out among the most promising approaches for the future of deep learning. Instead of every part of a model processing every input (“dense” modeling), sparse models employing conditional computation learn to route individual inputs to different “experts” in a potentially huge network. This has many benefits. First, model size can increase while keeping computational cost constant — an effective and environmentally friendlier way to scale models, which is often key to high performance. Sparsity also naturally compartmentalizes neural networks. Dense models that learn many different tasks simultaneously (multitask) or sequentially (continual learning) of…  ( 9 min )
  • Open

    Introduction to the Python Deep Learning Library TensorFlow
    TensorFlow is a Python library for fast numerical computing created and released by Google. It is a foundation library that can be used to create Deep Learning models directly or by using wrapper libraries that simplify the process built on top of TensorFlow. In this post you will discover the TensorFlow library for Deep Learning. […] The post Introduction to the Python Deep Learning Library TensorFlow appeared first on Machine Learning Mastery.  ( 11 min )
  • Open

    Techniques for Training Large Neural Networks
    submitted by /u/nickb [link] [comments]
    What impacts the speed of prediction for ANNs?
    I am currently building several ANNs to approximate lengthy PDE calculations. I am curious as to how one can minimise the speed of prediction when it comes to hyperparameter optimization. Is it best to minimise the number of weight parameters in the model? (I know this benefits storage) or is it best to minimise the number of layers? Any help would be appreciated, cheers! submitted by /u/Algo-G-H [link] [comments]  ( 1 min )
  • Open

    Student-powered machine learning
    Recent MEng graduates reflect on their application-focused research as affiliates of the MIT-IBM Watson AI Lab.  ( 7 min )
  • Open

    Techniques for Training Large Neural Networks
    Large neural networks are at the core of many recent advances in AI, but training them is a difficult engineering and research challenge which requires orchestrating a cluster of GPUs to perform a single synchronized calculation. As cluster and model sizes have grown, machine learning practitioners have developed an increasing  ( 6 min )
  • Open

    Smartgrids and Reinforcement Learning
    Hi every1ne, Are you, interested by #smartgrids with #reinforcementlearning ? Here is a little sharing ✋ #LittleBigCity, an outstanding new open-source project for smartgrid, has recently emerged. Inspired by #CityLearn, which focuses solely on the customer side of the smartgrid, LittleBigCode and Paul-Adrien Nicole created #LittleBigCity by taking this limitation into consideration. A new open-source simulator that generates a two-sided smartgrid: the CityLearn-inspired consumer side and the producer side from LittleBigCity. With Streamlit, they have also added a way to view the smartgrid's changes in real time 🥳 We welcome pull requests on both the simulator and reinforcement learning sides. Feel free to drop by and share the information with your network 🌎 #smartgrid for the future 🥇 Gitlab LINK: https://gitlab.com/littlebigcode/public/littlebigcity City learn authors: José Ramón Vázquez Canteli & Zoltan Nagy LittleBigCity authors: Johan Jublanc Paul-Adrien Nicole submitted by /u/SimonSoftEng [link] [comments]  ( 1 min )
    Schmidhuber notes 25th anniversary of LSTM
    submitted by /u/gwern [link] [comments]  ( 1 min )
    RL topics for MS research.
    I was wondering what are the research areas to explore for a master thesis work. I'm thinking about research problems that are on the implementation side rather than on the theoretical side of RL. Goal-conditioned RL and autotelic agents are some of the interesting areas to explore. In terms of implementation, what are the areas to look for as a thesis work? submitted by /u/thisisdespaleo [link] [comments]  ( 1 min )
  • Open

    How to Write a Thank You Letter for a Scholarship with the Help Of AI
    It’s no secret that writing a thank you letter can be difficult. You want to express your gratitude, but you also don’t want to sound too…  ( 4 min )
    Conversational AI at Ludicrous Speed
    The Problem  ( 7 min )
  • Open

    Out of This World: ‘Mass Effect Legendary Edition’ and ‘It Takes Two’ Lead GFN Thursday Updates
    Some may call this GFN Thursday legendary as Mass Effect Legendary Edition and It Takes Two join the GeForce NOW library. Both games expand the available number of Electronic Arts games streaming from our GeForce cloud servers, and are part of 10 new additions this week. Adventure Awaits In The Cloud Relive the saga of Read article > The post Out of This World: ‘Mass Effect Legendary Edition’ and ‘It Takes Two’ Lead GFN Thursday Updates appeared first on NVIDIA Blog.  ( 2 min )
  • Open

    What do we learn? Debunking the Myth of Unsupervised Outlier Detection. (arXiv:2206.03698v1 [cs.CV])
    Even though auto-encoders (AEs) have the desirable property of learning compact representations without labels and have been widely applied to out-of-distribution (OoD) detection, they are generally still poorly understood and are used incorrectly in detecting outliers where the normal and abnormal distributions are strongly overlapping. In general, the learned manifold is assumed to contain key information that is only important for describing samples within the training distribution, and that the reconstruction of outliers leads to high residual errors. However, recent work suggests that AEs are likely to be even better at reconstructing some types of OoD samples. In this work, we challenge this assumption and investigate what auto-encoders actually learn when they are posed to solve two different tasks. First, we propose two metrics based on the Fr\'echet inception distance (FID) and confidence scores of a trained classifier to assess whether AEs can learn the training distribution and reliably recognize samples from other domains. Second, we investigate whether AEs are able to synthesize normal images from samples with abnormal regions, on a more challenging lung pathology detection task. We have found that state-of-the-art (SOTA) AEs are either unable to constrain the latent manifold and allow reconstruction of abnormal patterns, or they are failing to accurately restore the inputs from their latent distribution, resulting in blurred or misaligned reconstructions. We propose novel deformable auto-encoders (MorphAEus) to learn perceptually aware global image priors and locally adapt their morphometry based on estimated dense deformation fields. We demonstrate superior performance over unsupervised methods in detecting OoD and pathology.  ( 2 min )
    Machine learning-based patient selection in an emergency department. (arXiv:2206.03752v1 [cs.LG])
    The performance of Emergency Departments (EDs) is of great importance for any health care system, as they serve as the entry point for many patients. However, among other factors, the variability of patient acuity levels and corresponding treatment requirements of patients visiting EDs imposes significant challenges on decision makers. Balancing waiting times of patients to be first seen by a physician with the overall length of stay over all acuity levels is crucial to maintain an acceptable level of operational performance for all patients. To address those requirements when assigning idle resources to patients, several methods have been proposed in the past, including the Accumulated Priority Queuing (APQ) method. The APQ method linearly assigns priority scores to patients with respect to their time in the system and acuity level. Hence, selection decisions are based on a simple system representation that is used as an input for a selection function. This paper investigates the potential of an Machine Learning (ML) based patient selection method. It assumes that for a large set of training data, including a multitude of different system states, (near) optimal assignments can be computed by a (heuristic) optimizer, with respect to a chosen performance metric, and aims to imitate such optimal behavior when applied to new situations. Thereby, it incorporates a comprehensive state representation of the system and a complex non-linear selection function. The motivation for the proposed approach is that high quality selection decisions may depend on a variety of factors describing the current state of the ED, not limited to waiting times, which can be captured and utilized by the ML model. Results show that the proposed method significantly outperforms the APQ method for a majority of evaluated settings  ( 2 min )
    Latent Boundary-guided Adversarial Training. (arXiv:2206.03717v1 [cs.LG])
    Deep Neural Networks (DNNs) have recently achieved great success in many classification tasks. Unfortunately, they are vulnerable to adversarial attacks that generate adversarial examples with a small perturbation to fool DNN models, especially in model sharing scenarios. Adversarial training is proved to be the most effective strategy that injects adversarial examples into model training to improve the robustness of DNN models to adversarial attacks. However, adversarial training based on the existing adversarial examples fails to generalize well to standard, unperturbed test data. To achieve a better trade-off between standard accuracy and adversarial robustness, we propose a novel adversarial training framework called LAtent bounDary-guided aDvErsarial tRaining (LADDER) that adversarially trains DNN models on latent boundary-guided adversarial examples. As opposed to most of the existing methods that generate adversarial examples in the input space, LADDER generates a myriad of high-quality adversarial examples through adding perturbations to latent features. The perturbations are made along the normal of the decision boundary constructed by an SVM with an attention mechanism. We analyze the merits of our generated boundary-guided adversarial examples from a boundary field perspective and visualization view. Extensive experiments and detailed analysis on MNIST, SVHN, CelebA, and CIFAR-10 validate the effectiveness of LADDER in achieving a better trade-off between standard accuracy and adversarial robustness as compared with vanilla DNNs and competitive baselines.  ( 2 min )
    Two Ways of Understanding Social Dynamics: Analyzing the Predictability of Emergent of Objects in Reddit r/place Dependent on Locality in Space and Time. (arXiv:2206.03563v1 [physics.soc-ph])
    Lately, studying social dynamics in interacting agents has been boosted by the power of computer models, which bring the richness of qualitative work, while offering the precision, transparency, extensiveness, and replicability of statistical and mathematical approaches. A particular set of phenomena for the study of social dynamics is Web collaborative platforms. A dataset of interest is r/place, a collaborative social experiment held in 2017 on Reddit, which consisted of a shared online canvas of 1000 pixels by 1000 pixels co-edited by over a million recorded users over 72 hours. In this paper, we designed and compared two methods to analyze the dynamics of this experiment. Our first method consisted in approximating the set of 2D cellular-automata-like rules used to generate the canvas images and how these rules change over time. The second method consisted in a convolutional neural network (CNN) that learned an approximation to the generative rules in order to generate the complex outcomes of the canvas. Our results indicate varying context-size dependencies for the predictability of different objects in r/place in time and space. They also indicate a surprising peak in difficulty to statistically infer behavioral rules towards the middle of the social experiment, while user interactions did not drop until before the end. The combination of our two approaches, one rule-based and the other statistical CNN-based, shows the ability to highlight diverse aspects of analyzing social dynamics.  ( 2 min )
    Learning Interpretable Decision Rule Sets: A Submodular Optimization Approach. (arXiv:2206.03718v1 [cs.LG])
    Rule sets are highly interpretable logical models in which the predicates for decision are expressed in disjunctive normal form (DNF, OR-of-ANDs), or, equivalently, the overall model comprises an unordered collection of if-then decision rules. In this paper, we consider a submodular optimization based approach for learning rule sets. The learning problem is framed as a subset selection task in which a subset of all possible rules needs to be selected to form an accurate and interpretable rule set. We employ an objective function that exhibits submodularity and thus is amenable to submodular optimization techniques. To overcome the difficulty arose from dealing with the exponential-sized ground set of rules, the subproblem of searching a rule is casted as another subset selection task that asks for a subset of features. We show it is possible to write the induced objective function for the subproblem as a difference of two submodular (DS) functions to make it approximately solvable by DS optimization algorithms. Overall, the proposed approach is simple, scalable, and likely to be benefited from further research on submodular optimization. Experiments on real datasets demonstrate the effectiveness of our method.  ( 2 min )
    How does overparametrization affect performance on minority groups?. (arXiv:2206.03515v1 [cs.LG])
    The benefits of overparameterization for the overall performance of modern machine learning (ML) models are well known. However, the effect of overparameterization at a more granular level of data subgroups is less understood. Recent empirical studies demonstrate encouraging results: (i) when groups are not known, overparameterized models trained with empirical risk minimization (ERM) perform better on minority groups; (ii) when groups are known, ERM on data subsampled to equalize group sizes yields state-of-the-art worst-group-accuracy in the overparameterized regime. In this paper, we complement these empirical studies with a theoretical investigation of the risk of overparameterized random feature models on minority groups. In a setting in which the regression functions for the majority and minority groups are different, we show that overparameterization always improves minority group performance.  ( 2 min )
    Joint Adversarial Learning for Cross-domain Fair Classification. (arXiv:2206.03656v1 [cs.LG])
    Modern machine learning (ML) models are becoming increasingly popular and are widely used in decision-making systems. However, studies have shown critical issues of ML discrimination and unfairness, which hinder their adoption on high-stake applications. Recent research on fair classifiers has drawn significant attention to develop effective algorithms to achieve fairness and good classification performance. Despite the great success of these fairness-aware machine learning models, most of the existing models require sensitive attributes to preprocess the data, regularize the model learning or postprocess the prediction to have fair predictions. However, sensitive attributes are often incomplete or even unavailable due to privacy, legal or regulation restrictions. Though we lack the sensitive attribute for training a fair model in the target domain, there might exist a similar domain that has sensitive attributes. Thus, it is important to exploit auxiliary information from the similar domain to help improve fair classification in the target domain. Therefore, in this paper, we study a novel problem of exploring domain adaptation for fair classification. We propose a new framework that can simultaneously estimate the sensitive attributes while learning a fair classifier in the target domain. Extensive experiments on real-world datasets illustrate the effectiveness of the proposed model for fair classification, even when no sensitive attributes are available in the target domain.  ( 2 min )
    $p$-Sparsified Sketches for Fast Multiple Output Kernel Methods. (arXiv:2206.03827v1 [stat.ML])
    Kernel methods are learning algorithms that enjoy solid theoretical foundations while suffering from important computational limitations. Sketching, that consists in looking for solutions among a subspace of reduced dimension, is a widely studied approach to alleviate this numerical burden. However, fast sketching strategies, such as non-adaptive subsampling, significantly degrade the guarantees of the algorithms, while theoretically-accurate sketches, such as the Gaussian one, turn out to remain relatively slow in practice. In this paper, we introduce the $p$-sparsified sketches, that combine the benefits from both approaches to achieve a good tradeoff between statistical accuracy and computational efficiency. To support our method, we derive excess risk bounds for both single and multiple output problems, with generic Lipschitz losses, providing new guarantees for a wide range of applications, from robust regression to multiple quantile regression. We also provide empirical evidences of the superiority of our sketches over recent SOTA approaches.  ( 2 min )
    Disentangled Ontology Embedding for Zero-shot Learning. (arXiv:2206.03739v1 [cs.AI])
    Knowledge Graph (KG) and its variant of ontology have been widely used for knowledge representation, and have shown to be quite effective in augmenting Zero-shot Learning (ZSL). However, existing ZSL methods that utilize KGs all neglect the intrinsic complexity of inter-class relationships represented in KGs. One typical feature is that a class is often related to other classes in different semantic aspects. In this paper, we focus on ontologies for augmenting ZSL, and propose to learn disentangled ontology embeddings guided by ontology properties to capture and utilize more fine-grained class relationships in different aspects. We also contribute a new ZSL framework named DOZSL, which contains two new ZSL solutions based on generative models and graph propagation models, respectively, for effectively utilizing the disentangled ontology embeddings. Extensive evaluations have been conducted on five benchmarks across zero-shot image classification (ZS-IMGC) and zero-shot KG completion (ZS-KGC). DOZSL often achieves better performance than the state-of-the-art, and its components have been verified by ablation studies and case studies. Our codes and datasets are available at https://github.com/zjukg/DOZSL.  ( 2 min )
    Neural Network Compression via Effective Filter Analysis and Hierarchical Pruning. (arXiv:2206.03596v1 [cs.LG])
    Network compression is crucial to making the deep networks to be more efficient, faster, and generalizable to low-end hardware. Current network compression methods have two open problems: first, there lacks a theoretical framework to estimate the maximum compression rate; second, some layers may get over-prunned, resulting in significant network performance drop. To solve these two problems, this study propose a gradient-matrix singularity analysis-based method to estimate the maximum network redundancy. Guided by that maximum rate, a novel and efficient hierarchical network pruning algorithm is developed to maximally condense the neuronal network structure without sacrificing network performance. Substantial experiments are performed to demonstrate the efficacy of the new method for pruning several advanced convolutional neural network (CNN) architectures. Compared to existing pruning methods, the proposed pruning algorithm achieved state-of-the-art performance. At the same or similar compression ratio, the new method provided the highest network prediction accuracy as compared to other methods.  ( 2 min )
    Alternately Optimized Graph Neural Networks. (arXiv:2206.03638v1 [cs.LG])
    Graph Neural Networks (GNNs) have demonstrated powerful representation capability in numerous graph-based tasks. Specifically, the decoupled structures of GNNs such as APPNP become popular due to their simplicity and performance advantages. However, the end-to-end training of these GNNs makes them inefficient in computation and memory consumption. In order to deal with these limitations, in this work, we propose an alternating optimization framework for graph neural networks that does not require end-to-end training. Extensive experiments under different settings demonstrate that the performance of the proposed algorithm is comparable to existing state-of-the-art algorithms but has significantly better computation and memory efficiency. Additionally, we show that our framework can be taken advantage to enhance existing decoupled GNNs.  ( 2 min )
    Spam Detection Using BERT. (arXiv:2206.02443v2 [cs.CR] UPDATED)
    Emails and SMSs are the most popular tools in today communications, and as the increase of emails and SMSs users are increase, the number of spams is also increases. Spam is any kind of unwanted, unsolicited digital communication that gets sent out in bulk, spam emails and SMSs are causing major resource wastage by unnecessarily flooding the network links. Although most spam mail originate with advertisers looking to push their products, some are much more malicious in their intent like phishing emails that aims to trick victims into giving up sensitive information like website logins or credit card information this type of cybercrime is known as phishing. To countermeasure spams, many researches and efforts are done to build spam detectors that are able to filter out messages and emails as spam or ham. In this research we build a spam detector using BERT pre-trained model that classifies emails and messages by understanding to their context, and we trained our spam detector model using multiple corpuses like SMS collection corpus, Enron corpus, SpamAssassin corpus, Ling-Spam corpus and SMS spam collection corpus, our spam detector performance was 98.62%, 97.83%, 99.13% and 99.28% respectively. Keywords: Spam Detector, BERT, Machine learning, NLP, Transformer, Enron Corpus, SpamAssassin Corpus, SMS Spam Detection Corpus, Ling-Spam Corpus.  ( 2 min )
    Meta-Learning Transferable Parameterized Skills. (arXiv:2206.03597v1 [cs.LG])
    We propose a novel parameterized skill-learning algorithm that aims to learn transferable parameterized skills and synthesize them into a new action space that supports efficient learning in long-horizon tasks. We first propose novel learning objectives -- trajectory-centric diversity and smoothness -- that allow an agent to meta-learn reusable parameterized skills. Our agent can use these learned skills to construct a temporally-extended parameterized-action Markov decision process, for which we propose a hierarchical actor-critic algorithm that aims to efficiently learn a high-level control policy with the learned skills. We empirically demonstrate that the proposed algorithms enable an agent to solve a complicated long-horizon obstacle-course environment.  ( 2 min )
    Random and Adversarial Bit Error Robustness: Energy-Efficient and Secure DNN Accelerators. (arXiv:2104.08323v2 [cs.LG] UPDATED)
    Deep neural network (DNN) accelerators received considerable attention in recent years due to the potential to save energy compared to mainstream hardware. Low-voltage operation of DNN accelerators allows to further reduce energy consumption, however, causes bit-level failures in the memory storing the quantized weights. Furthermore, DNN accelerators are vulnerable to adversarial attacks on voltage controllers or individual bits. In this paper, we show that a combination of robust fixed-point quantization, weight clipping, as well as random bit error training (RandBET) or adversarial bit error training (AdvBET) improves robustness against random or adversarial bit errors in quantized DNN weights significantly. This leads not only to high energy savings for low-voltage operation as well as low-precision quantization, but also improves security of DNN accelerators. In contrast to related work, our approach generalizes across operating voltages and accelerators and does not require hardware changes. Moreover, we present a novel adversarial bit error attack and are able to obtain robustness against both targeted and untargeted bit-level attacks. Without losing more than 0.8%/2% in test accuracy, we can reduce energy consumption on CIFAR10 by 20%/30% for 8/4-bit quantization. Allowing up to 320 adversarial bit errors, we reduce test error from above 90% (chance level) to 26.22%.  ( 2 min )
    A generative recommender system with GMM prior for cancer drug generation and sensitivity prediction. (arXiv:2206.03555v1 [cs.LG])
    Recent emergence of high-throughput drug screening assays sparkled an intensive development of machine learning methods, including models for prediction of sensitivity of cancer cell lines to anti-cancer drugs, as well as methods for generation of potential drug candidates. However, a concept of generation of compounds with specific properties and simultaneous modeling of their efficacy against cancer cell lines has not been comprehensively explored. To address this need, we present VADEERS, a Variational Autoencoder-based Drug Efficacy Estimation Recommender System. The generation of compounds is performed by a novel variational autoencoder with a semi-supervised Gaussian Mixture Model (GMM) prior. The prior defines a clustering in the latent space, where the clusters are associated with specific drug properties. In addition, VADEERS is equipped with a cell line autoencoder and a sensitivity prediction network. The model combines data for SMILES string representations of anti-cancer drugs, their inhibition profiles against a panel of protein kinases, cell lines biological features and measurements of the sensitivity of the cell lines to the drugs. The evaluated variants of VADEERS achieve a high r=0.87 Pearson correlation between true and predicted drug sensitivity estimates. We train the GMM prior in such a way that the clusters in the latent space correspond to a pre-computed clustering of the drugs by their inhibitory profiles. We show that the learned latent representations and new generated data points accurately reflect the given clustering. In summary, VADEERS offers a comprehensive model of drugs and cell lines properties and relationships between them, as well as a guided generation of novel compounds.  ( 2 min )
    Selective Network Linearization for Efficient Private Inference. (arXiv:2202.02340v2 [cs.CR] UPDATED)
    Private inference (PI) enables inference directly on cryptographically secure data.While promising to address many privacy issues, it has seen limited use due to extreme runtimes. Unlike plaintext inference, where latency is dominated by FLOPs, in PI non-linear functions (namely ReLU) are the bottleneck. Thus, practical PI demands novel ReLU-aware optimizations. To reduce PI latency we propose a gradient-based algorithm that selectively linearizes ReLUs while maintaining prediction accuracy. We evaluate our algorithm on several standard PI benchmarks. The results demonstrate up to $4.25\%$ more accuracy (iso-ReLU count at 50K) or $2.2\times$ less latency (iso-accuracy at 70\%) than the current state of the art and advance the Pareto frontier across the latency-accuracy space. To complement empirical results, we present a "no free lunch" theorem that sheds light on how and when network linearization is possible while maintaining prediction accuracy. Public code is available at \url{https://github.com/NYU-DICE-Lab/selective_network_linearization}.
    An Iterative Labeling Method for Annotating Fisheries Imagery. (arXiv:2204.12934v2 [cs.LG] UPDATED)
    In this paper, we present a methodology for fisheries-related data that allows us to converge on a labeled image dataset by iterating over the dataset with multiple training and production loops that can exploit crowdsourcing interfaces. We present our algorithm and its results on two separate sets of image data collected using the Seabed autonomous underwater vehicle. The first dataset comprises of 2,026 completely unlabeled images, while the second consists of 21,968 images that were point annotated by experts. Our results indicate that training with a small subset and iterating on that to build a larger set of labeled data allows us to converge to a fully annotated dataset with a small number of iterations. Even in the case of a dataset labeled by experts, a single iteration of the methodology improves the labels by discovering additional complicated examples of labels associated with fish that overlap, are very small, or obscured by the contrast limitations associated with underwater imagery.
    Beyond Value: CHECKLIST for Testing Inferences in Planning-Based RL. (arXiv:2206.02039v2 [cs.AI] UPDATED)
    Reinforcement learning (RL) agents are commonly evaluated via their expected value over a distribution of test scenarios. Unfortunately, this evaluation approach provides limited evidence for post-deployment generalization beyond the test distribution. In this paper, we address this limitation by extending the recent CheckList testing methodology from natural language processing to planning-based RL. Specifically, we consider testing RL agents that make decisions via online tree search using a learned transition model and value function. The key idea is to improve the assessment of future performance via a CheckList approach for exploring and assessing the agent's inferences during tree search. The approach provides the user with an interface and general query-rule mechanism for identifying potential inference flaws and validating expected inference invariances. We present a user study involving knowledgeable AI researchers using the approach to evaluate an agent trained to play a complex real-time strategy game. The results show the approach is effective in allowing users to identify previously-unknown flaws in the agent's reasoning. In addition, our analysis provides insight into how AI experts use this type of testing approach, which may help improve future instantiations.
    Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained Models. (arXiv:2205.15223v2 [cs.CL] UPDATED)
    Pre-trained masked language models successfully perform few-shot learning by formulating downstream tasks as text infilling. However, as a strong alternative in full-shot settings, discriminative pre-trained models like ELECTRA do not fit into the paradigm. In this work, we adapt prompt-based few-shot learning to ELECTRA and show that it outperforms masked language models in a wide range of tasks. ELECTRA is pre-trained to distinguish if a token is generated or original. We naturally extend that to prompt-based few-shot learning by training to score the originality of the target options without introducing new parameters. Our method can be easily adapted to tasks involving multi-token predictions without extra computation overhead. Analysis shows that ELECTRA learns distributions that align better with downstream tasks.
    Towards Individual Grevy's Zebra Identification via Deep 3D Fitting and Metric Learning. (arXiv:2206.02261v2 [cs.CV] UPDATED)
    This paper combines deep learning techniques for species detection, 3D model fitting, and metric learning in one pipeline to perform individual animal identification from photographs by exploiting unique coat patterns. This is the first work to attempt this and, compared to traditional 2D bounding box or segmentation based CNN identification pipelines, the approach provides effective and explicit view-point normalisation and allows for a straight forward visualisation of the learned biometric population space. Note that due to the use of metric learning the pipeline is also readily applicable to open set and zero shot re-identification scenarios. We apply the proposed approach to individual Grevy's zebra (Equus grevyi) identification and show in a small study on the SMALST dataset that the use of 3D model fitting can indeed benefit performance. In particular, back-projected textures from 3D fitted models improve identification accuracy from 48.0% to 56.8% compared to 2D bounding box approaches for the dataset. Whilst the study is far too small accurately to estimate the full performance potential achievable in larger-scale real-world application settings and in comparisons against polished tools, our work lays the conceptual and practical foundations for a next step in animal biometrics towards deep metric learning driven, fully 3D-aware animal identification in open population settings. We publish network weights and relevant facilitating source code with this paper for full reproducibility and as inspiration for further research.
    Experience report of physics-informed neural networks in fluid simulations: pitfalls and frustration. (arXiv:2205.14249v2 [physics.flu-dyn] UPDATED)
    The deep learning boom motivates researchers and practitioners of computational fluid dynamics eager to integrate the two areas.The PINN (physics-informed neural network) method is one such attempt. While most reports in the literature show positive outcomes of applying the PINN method, our experiments with it stifled such optimism. This work presents our not-so-successful story of using PINN to solve two fundamental flow problems: 2D Taylor-Green vortex at $Re = 100$ and 2D cylinder flow at $Re = 200$. The PINN method solved the 2D Taylor-Green vortex problem with acceptable results, and we used this flow as an accuracy and performance benchmark. About 32 hours of training were required for the PINN method's accuracy to match the accuracy of a $16 \times 16$ finite-difference simulation, which took less than 20 seconds. The 2D cylinder flow, on the other hand, did not even result in a physical solution. The PINN method behaved like a steady-flow solver and did not capture the vortex shedding phenomenon. By sharing our experience, we would like to emphasize that the PINN method is still a work-in-progress. More work is needed to make PINN feasible for real-world problems.
    Poisoning Deep Learning Based Recommender Model in Federated Learning Scenarios. (arXiv:2204.13594v2 [cs.IR] UPDATED)
    Various attack methods against recommender systems have been proposed in the past years, and the security issues of recommender systems have drawn considerable attention. Traditional attacks attempt to make target items recommended to as many users as possible by poisoning the training data. Benifiting from the feature of protecting users' private data, federated recommendation can effectively defend such attacks. Therefore, quite a few works have devoted themselves to developing federated recommender systems. For proving current federated recommendation is still vulnerable, in this work we probe to design attack approaches targeting deep learning based recommender models in federated learning scenarios. Specifically, our attacks generate poisoned gradients for manipulated malicious users to upload based on two strategies (i.e., random approximation and hard user mining). Extensive experiments show that our well-designed attacks can effectively poison the target models, and the attack effectiveness sets the state-of-the-art.
    Stop Oversampling for Class Imbalance Learning: A Critical Review. (arXiv:2202.03579v2 [cs.LG] UPDATED)
    For the last two decades, oversampling has been employed to overcome the challenge of learning from imbalanced datasets. Many approaches to solving this challenge have been offered in the literature. Oversampling, on the other hand, is a concern. That is, models trained on fictitious data may fail spectacularly when put to real-world problems. The fundamental difficulty with oversampling approaches is that, given a real-life population, the synthesized samples may not truly belong to the minority class. As a result, training a classifier on these samples while pretending they represent minority may result in incorrect predictions when the model is used in the real world. We analyzed a large number of oversampling methods in this paper and devised a new oversampling evaluation system based on hiding a number of majority examples and comparing them to those generated by the oversampling process. Based on our evaluation system, we ranked all these methods based on their incorrectly generated examples for comparison. Our experiments using more than 70 oversampling methods and three imbalanced real-world datasets reveal that all oversampling methods studied generate minority samples that are most likely to be majority. Given data and methods in hand, we argue that oversampling in its current forms and methodologies is unreliable for learning from class imbalanced data and should be avoided in real-world applications.
    What's in the Black Box? The False Negative Mechanisms Inside Object Detectors. (arXiv:2203.07662v2 [cs.CV] UPDATED)
    In object detection, false negatives arise when a detector fails to detect a target object. To understand why object detectors produce false negatives, we identify five 'false negative mechanisms', where each mechanism describes how a specific component inside the detector architecture failed. Focusing on two-stage and one-stage anchor-box object detector architectures, we introduce a framework for quantifying these false negative mechanisms. Using this framework, we investigate why Faster R-CNN and RetinaNet fail to detect objects in benchmark vision datasets and robotics datasets. We show that a detector's false negative mechanisms differ significantly between computer vision benchmark datasets and robotics deployment scenarios. This has implications for the translation of object detectors developed for benchmark datasets to robotics applications.
    Label Cleaning Multiple Instance Learning: Refining Coarse Annotations on Single Whole-Slide Images. (arXiv:2109.10778v2 [cs.CV] UPDATED)
    Annotating cancerous regions in whole-slide images (WSIs) of pathology samples plays a critical role in clinical diagnosis, biomedical research, and machine learning algorithms development. However, generating exhaustive and accurate annotations is labor-intensive, challenging, and costly. Drawing only coarse and approximate annotations is a much easier task, less costly, and it alleviates pathologists' workload. In this paper, we study the problem of refining these approximate annotations in digital pathology to obtain more accurate ones. Some previous works have explored obtaining machine learning models from these inaccurate annotations, but few of them tackle the refinement problem where the mislabeled regions should be explicitly identified and corrected, and all of them require a -- often very large -- number of training samples. We present a method, named Label Cleaning Multiple Instance Learning (LC-MIL), to refine coarse annotations on a single WSI without the need of external training data. Patches cropped from a WSI with inaccurate labels are processed jointly within a multiple instance learning framework, mitigating their impact on the predictive model and refining the segmentation. Our experiments on a heterogeneous WSI set with breast cancer lymph node metastasis, liver cancer, and colorectal cancer samples show that LC-MIL significantly refines the coarse annotations, outperforming state-of-the-art alternatives, even while learning from a single slide. Moreover, we demonstrate how real annotations drawn by pathologists can be efficiently refined and improved by the proposed approach. All these results demonstrate that LC-MIL is a promising, light-weight tool to provide fine-grained annotations from coarsely annotated pathology sets.
    STable: Table Generation Framework for Encoder-Decoder Models. (arXiv:2206.04045v1 [cs.CL])
    The output structure of database-like tables, consisting of values structured in horizontal rows and vertical columns identifiable by name, can cover a wide range of NLP tasks. Following this constatation, we propose a framework for text-to-table neural models applicable to problems such as extraction of line items, joint entity and relation extraction, or knowledge base population. The permutation-based decoder of our proposal is a generalized sequential method that comprehends information from all cells in the table. The training maximizes the expected log-likelihood for a table's content across all random permutations of the factorization order. During the content inference, we exploit the model's ability to generate cells in any order by searching over possible orderings to maximize the model's confidence and avoid substantial error accumulation, which other sequential models are prone to. Experiments demonstrate a high practical value of the framework, which establishes state-of-the-art results on several challenging datasets, outperforming previous solutions by up to 15%.
    Inverse Contextual Bandits: Learning How Behavior Evolves over Time. (arXiv:2107.06317v3 [cs.LG] UPDATED)
    Understanding a decision-maker's priorities by observing their behavior is critical for transparency and accountability in decision processes, such as in healthcare. Though conventional approaches to policy learning almost invariably assume stationarity in behavior, this is hardly true in practice: Medical practice is constantly evolving as clinical professionals fine-tune their knowledge over time. For instance, as the medical community's understanding of organ transplantations has progressed over the years, a pertinent question is: How have actual organ allocation policies been evolving? To give an answer, we desire a policy learning method that provides interpretable representations of decision-making, in particular capturing an agent's non-stationary knowledge of the world, as well as operating in an offline manner. First, we model the evolving behavior of decision-makers in terms of contextual bandits, and formalize the problem of Inverse Contextual Bandits (ICB). Second, we propose two concrete algorithms as solutions, learning parametric and nonparametric representations of an agent's behavior. Finally, using both real and simulated data for liver transplantations, we illustrate the applicability and explainability of our method, as well as benchmarking and validating its accuracy.
    Geometry of Linear Convolutional Networks. (arXiv:2108.01538v2 [cs.LG] UPDATED)
    We study the family of functions that are represented by a linear convolutional neural network (LCN). These functions form a semi-algebraic subset of the set of linear maps from input space to output space. In contrast, the families of functions represented by fully-connected linear networks form algebraic sets. We observe that the functions represented by LCNs can be identified with polynomials that admit certain factorizations, and we use this perspective to describe the impact of the network's architecture on the geometry of the resulting function space. We further study the optimization of an objective function over an LCN, analyzing critical points in function space and in parameter space, and describing dynamical invariants for gradient descent. Overall, our theory predicts that the optimized parameters of an LCN will often correspond to repeated filters across layers, or filters that can be decomposed as repeated filters. We also conduct numerical and symbolic experiments that illustrate our results and present an in-depth analysis of the landscape for small architectures.
    Predicting Census Survey Response Rates via Interpretable Nonparametric Additive Models with Structured Interactions. (arXiv:2108.11328v2 [stat.ML] UPDATED)
    Accurate and interpretable prediction of survey response rates is important from an operational standpoint. The US Census Bureau's well-known ROAM application uses principled statistical models trained on the US Census Planning Database data to identify hard-to-survey areas. An earlier crowdsourcing competition revealed that an ensemble of regression trees led to the best performance in predicting survey response rates; however, the corresponding models could not be adopted for the intended application due to limited interpretability. In this paper, we present new interpretable statistical methods to predict, with high accuracy, response rates in surveys. We study sparse nonparametric additive models with pairwise interactions via $\ell_0$-regularization, as well as hierarchically structured variants that provide enhanced interpretability. Despite strong methodological underpinnings, such models can be computationally challenging -- we present new scalable algorithms for learning these models. We also establish novel non-asymptotic error bounds for the proposed estimators. Experiments based on the US Census Planning Database demonstrate that our methods lead to high-quality predictive models that permit actionable interpretability for different segments of the population. Interestingly, our methods provide significant gains in interpretability without losing in predictive performance to state-of-the-art black-box machine learning methods based on gradient boosting and feedforward neural networks. Our code implementation in python is available at https://github.com/ShibalIbrahim/Additive-Models-with-Structured-Interactions.
    FedSEAL: Semi-Supervised Federated Learning with Self-Ensemble Learning and Negative Learning. (arXiv:2110.07829v2 [cs.LG] UPDATED)
    Federated learning (FL), a popular decentralized and privacy-preserving machine learning (FL) framework, has received extensive research attention in recent years. The majority of existing works focus on supervised learning (SL) problems where it is assumed that clients carry labeled datasets while the server has no data. However, in realistic scenarios, clients are often unable to label their data due to the lack of expertise and motivation while the server may host a small amount of labeled data. How to reasonably utilize the server labeled data and the clients' unlabeled data is thus of paramount practical importance. In this paper, we propose a new FL algorithm, called FedSEAL, to solve this Semi-Supervised Federated Learning (SSFL) problem. Our algorithm utilizes self-ensemble learning and complementary negative learning to enhance both the accuracy and the efficiency of clients' unsupervised learning on unlabeled data, and orchestrates the model training on both the server side and the clients' side. Our experimental results on Fashion-MNIST and CIFAR10 datasets in the SSFL setting validate the effectiveness of our method, which outperforms the state-of-the-art SSFL methods by a large margin.
    Dissipative Deep Neural Dynamical Systems. (arXiv:2011.13492v3 [cs.LG] UPDATED)
    In this paper, we provide sufficient conditions for dissipativity and local asymptotic stability of discrete-time dynamical systems parametrized by deep neural networks. We leverage the representation of neural networks as pointwise affine maps, thus exposing their local linear operators and making them accessible to classical system analytic and design methods. This allows us to "crack open the black box" of the neural dynamical system's behavior by evaluating their dissipativity, and estimating their stationary points and state-space partitioning. We relate the norms of these local linear operators to the energy stored in the dissipative system with supply rates represented by their aggregate bias terms. Empirically, we analyze the variance in dynamical behavior and eigenvalue spectra of these local linear operators with varying weight factorizations, activation functions, bias terms, and depths.
    SelfCF: A Simple Framework for Self-supervised Collaborative Filtering. (arXiv:2107.03019v2 [cs.IR] UPDATED)
    Collaborative filtering (CF) is widely used to learn informative latent representations of users and items from observed interactions. Existing CF-based methods commonly adopt negative sampling to discriminate different items. Training with negative sampling on large datasets is computationally expensive. Further, negative items should be carefully sampled under the defined distribution, in order to avoid selecting an observed positive item in the training dataset. Unavoidably, some negative items sampled from the training dataset could be positive in the test set. In this paper, we propose a self-supervised collaborative filtering framework (SelfCF), that is specially designed for recommender scenario with implicit feedback. The proposed SelfCF framework simplifies the Siamese networks and can be easily applied to existing deep-learning based CF models, which we refer to as backbone networks. The main idea of SelfCF is to augment the output embeddings generated by backbone networks, because it is infeasible to augment raw input of user/item ids. We propose and study three output perturbation techniques that can be applied to different types of backbone networks including both traditional CF models and graph-based models. The framework enables learning informative representations of users and items without negative samples, and is agnostic to the encapsulated backbones. We conduct comprehensive experiments on four datasets to show that our framework may achieve even better recommendation accuracy than the encapsulated supervised counterpart with a 2$\times$--4$\times$ faster training speed. We also show that SelfCF can boost up the accuracy by up to 17.79\% on average, compared with a self-supervised framework BUIR.
    Attribution of Predictive Uncertainties in Classification Models. (arXiv:2107.08756v3 [cs.LG] UPDATED)
    Predictive uncertainties in classification tasks are often a consequence of model inadequacy or insufficient training data. In popular applications, such as image processing, we are often required to scrutinise these uncertainties by meaningfully attributing them to input features. This helps to improve interpretability assessments. However, there exist few effective frameworks for this purpose. Vanilla forms of popular methods for the provision of saliency masks, such as SHAP or integrated gradients, adapt poorly to target measures of uncertainty. Thus, state-of-the-art tools instead proceed by creating counterfactual or adversarial feature vectors, and assign attributions by direct comparison to original images. In this paper, we present a novel framework that combines path integrals, counterfactual explanations and generative models, in order to procure attributions that contain few observable artefacts or noise. We evidence that this outperforms existing alternatives through quantitative evaluations with popular benchmarking methods and data sets of varying complexity.
    To remove or not remove Mobile Apps? A data-driven predictive model approach. (arXiv:2206.03905v1 [cs.CR])
    Mobile app stores are the key distributors of mobile applications. They regularly apply vetting processes to the deployed apps. Yet, some of these vetting processes might be inadequate or applied late. The late removal of applications might have unpleasant consequences for developers and users alike. Thus, in this work we propose a data-driven predictive approach that determines whether the respective app will be removed or accepted. It also indicates the features' relevance that help the stakeholders in the interpretation. In turn, our approach can support developers in improving their apps and users in downloading the ones that are less likely to be removed. We focus on the Google App store and we compile a new data set of 870,515 applications, 56% of which have actually been removed from the market. Our proposed approach is a bootstrap aggregating of multiple XGBoost machine learning classifiers. We propose two models: user-centered using 47 features, and developer-centered using 37 features, the ones only available before deployment. We achieve the following Areas Under the ROC Curves (AUCs) on the test set: user-centered = 0.792, developer-centered = 0.762.
    Resolving the Human Subjects Status of Machine Learning's Crowdworkers. (arXiv:2206.04039v1 [cs.CY])
    In recent years, machine learning (ML) has come to rely more heavily on crowdworkers, both for building bigger datasets and for addressing research questions requiring human interaction or judgment. Owing to the diverse tasks performed by crowdworkers, and the myriad ways the resulting datasets are used, it can be difficult to determine when these individuals are best thought of as workers, versus as human subjects. These difficulties are compounded by conflicting policies, with some institutions and researchers treating all ML crowdwork as human subjects research, and other institutions holding that ML crowdworkers rarely constitute human subjects. Additionally, few ML papers involving crowdwork mention IRB oversight, raising the prospect that many might not be in compliance with ethical and regulatory requirements. In this paper, we focus on research in natural language processing to investigate the appropriate designation of crowdsourcing studies and the unique challenges that ML research poses for research oversight. Crucially, under the U.S. Common Rule, these judgments hinge on determinations of "aboutness", both whom (or what) the collected data is about and whom (or what) the analysis is about. We highlight two challenges posed by ML: (1) the same set of workers can serve multiple roles and provide many sorts of information; and (2) compared to the life sciences and social sciences, ML research tends to embrace a dynamic workflow, where research questions are seldom stated ex ante and data sharing opens the door for future studies to ask questions about different targets from the original study. In particular, our analysis exposes a potential loophole in the Common Rule, where researchers can elude research ethics oversight by splitting data collection and analysis into distinct studies. We offer several policy recommendations to address these concerns.
    Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens. (arXiv:2108.11193v2 [cs.CL] UPDATED)
    Standard pretrained language models operate on sequences of subword tokens without direct access to the characters that compose each token's string representation. We probe the embedding layer of pretrained language models and show that models learn the internal character composition of whole word and subword tokens to a surprising extent, without ever seeing the characters coupled with the tokens. Our results show that the embedding layer of RoBERTa holds enough information to accurately spell up to a third of the vocabulary and reach high average character ngram overlap on all token types. We further test whether enriching subword models with additional character information can improve language modeling, and observe that this method has a near-identical learning curve as training without spelling-based enrichment. Overall, our results suggest that language modeling objectives incentivize the model to implicitly learn some notion of spelling, and that explicitly teaching the model how to spell does not appear to enhance its performance on such tasks.
    An Information-Theoretic Framework for Supervised Learning. (arXiv:2203.00246v5 [cs.LG] UPDATED)
    Each year, deep learning demonstrates new and improved empirical results with deeper and wider neural networks. Meanwhile, with existing theoretical frameworks, it is difficult to analyze networks deeper than two layers without resorting to counting parameters or encountering sample complexity bounds that are exponential in depth. Perhaps it may be fruitful to try to analyze modern machine learning under a different lens. In this paper, we propose a novel information-theoretic framework with its own notions of regret and sample complexity for analyzing the data requirements of machine learning. With our framework, we first work through some classical examples such as scalar estimation and linear regression to build intuition and introduce general techniques. Then, we use the framework to study the sample complexity of learning from data generated by deep sign neural networks, deep ReLU neural networks, and deep networks that are infinitely wide but have a bounded sum of weights. For sign neural networks, we recover sample-complexity bounds that follow from VC-dimension based arguments. For the latter two neural network environments, we establish new results that suggest that the sample complexity of learning under these data generating processes is at most linear and quadratic, respectively, in network depth.
    Model-Based Reinforcement Learning Is Minimax-Optimal for Offline Zero-Sum Markov Games. (arXiv:2206.04044v1 [cs.LG])
    This paper makes progress towards learning Nash equilibria in two-player zero-sum Markov games from offline data. Specifically, consider a $\gamma$-discounted infinite-horizon Markov game with $S$ states, where the max-player has $A$ actions and the min-player has $B$ actions. We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds -- called VI-LCB-Game -- that provably finds an $\varepsilon$-approximate Nash equilibrium with a sample complexity no larger than $\frac{C_{\mathsf{clipped}}^{\star}S(A+B)}{(1-\gamma)^{3}\varepsilon^{2}}$ (up to some log factor). Here, $C_{\mathsf{clipped}}^{\star}$ is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-\`a-vis the target data), and the target accuracy $\varepsilon$ can be any value within $\big(0,\frac{1}{1-\gamma}\big]$. Our sample complexity bound strengthens prior art by a factor of $\min\{A,B\}$, achieving minimax optimality for the entire $\varepsilon$-range. An appealing feature of our result lies in algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.
    Structure-Aware Transformer for Graph Representation Learning. (arXiv:2202.03036v2 [stat.ML] UPDATED)
    The Transformer architecture has gained growing attention in graph representation learning recently, as it naturally overcomes several limitations of graph neural networks (GNNs) by avoiding their strict structural inductive biases and instead only encoding the graph structure via positional encoding. Here, we show that the node representations generated by the Transformer with positional encoding do not necessarily capture structural similarity between them. To address this issue, we propose the Structure-Aware Transformer, a class of simple and flexible graph Transformers built upon a new self-attention mechanism. This new self-attention incorporates structural information into the original self-attention by extracting a subgraph representation rooted at each node before computing the attention. We propose several methods for automatically generating the subgraph representation and show theoretically that the resulting representations are at least as expressive as the subgraph representations. Empirically, our method achieves state-of-the-art performance on five graph prediction benchmarks. Our structure-aware framework can leverage any existing GNN to extract the subgraph representation, and we show that it systematically improves performance relative to the base GNN model, successfully combining the advantages of GNNs and Transformers. Our code is available at https://github.com/BorgwardtLab/SAT .
    Scalable Joint Learning of Wireless Multiple-Access Policies and their Signaling. (arXiv:2206.03844v1 [cs.IT])
    In this paper, we apply an multi-agent reinforcement learning (MARL) framework allowing the base station (BS) and the user equipments (UEs) to jointly learn a channel access policy and its signaling in a wireless multiple access scenario. In this framework, the BS and UEs are reinforcement learning (RL) agents that need to cooperate in order to deliver data. The comparison with a contention-free and a contention-based baselines shows that our framework achieves a superior performance in terms of goodput even in high traffic situations while maintaining a low collision rate. The scalability of the proposed method is studied, since it is a major problem in MARL and this paper provides the first results in order to address it.
    Few-shot Prompting Toward Controllable Response Generation. (arXiv:2206.03931v1 [cs.CL])
    Much literature has shown that prompt-based learning is an efficient method to make use of the large pre-trained language model. Recent works also exhibit the possibility of steering a chatbot's output by plugging in an appropriate prompt. Gradient-based methods are often used to perturb the prompts. However, some language models are not even available to the public. In this work, we first explored the combination of prompting and reinforcement learning (RL) to steer models' generation without accessing any of the models' parameters. Second, to reduce the training effort and enhance the generalizability to the unseen task, we apply multi-task learning to make the model learn to generalize to new tasks better. The experiment results show that our proposed method can successfully control several state-of-the-art (SOTA) dialogue models without accessing their parameters. Furthermore, the model demonstrates the strong ability to quickly adapt to an unseen task in fewer steps than the baseline model.
    A Study of Continual Learning Methods for Q-Learning. (arXiv:2206.03934v1 [cs.LG])
    We present an empirical study on the use of continual learning (CL) methods in a reinforcement learning (RL) scenario, which, to the best of our knowledge, has not been described before. CL is a very active recent research topic concerned with machine learning under non-stationary data distributions. Although this naturally applies to RL, the use of dedicated CL methods is still uncommon. This may be due to the fact that CL methods often assume a decomposition of CL problems into disjoint sub-tasks of stationary distribution, that the onset of these sub-tasks is known, and that sub-tasks are non-contradictory. In this study, we perform an empirical comparison of selected CL methods in a RL problem where a physically simulated robot must follow a racetrack by vision. In order to make CL methods applicable, we restrict the RL setting and introduce non-conflicting subtasks of known onset, which are however not disjoint and whose distribution, from the learner's point of view, is still non-stationary. Our results show that dedicated CL methods can significantly improve learning when compared to the baseline technique of "experience replay".
    Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning. (arXiv:2206.03996v1 [cs.LG])
    Model-agnostic meta learning (MAML) is currently one of the dominating approaches for few-shot meta-learning. Albeit its effectiveness, the optimization of MAML can be challenging due to the innate bilevel problem structure. Specifically, the loss landscape of MAML is much more complex with possibly more saddle points and local minimizers than its empirical risk minimization counterpart. To address this challenge, we leverage the recently invented sharpness-aware minimization and develop a sharpness-aware MAML approach that we term Sharp-MAML. We empirically demonstrate that Sharp-MAML and its computation-efficient variant can outperform popular existing MAML baselines (e.g., $+12\%$ accuracy on Mini-Imagenet). We complement the empirical study with the convergence rate analysis and the generalization bound of Sharp-MAML. To the best of our knowledge, this is the first empirical and theoretical study on sharpness-aware minimization in the context of bilevel learning. The code is available at https://github.com/mominabbass/Sharp-MAML.
    Learning in games from a stochastic approximation viewpoint. (arXiv:2206.03922v1 [cs.GT])
    We develop a unified stochastic approximation framework for analyzing the long-run behavior of multi-agent online learning in games. Our framework is based on a "primal-dual", mirrored Robbins-Monro (MRM) template which encompasses a wide array of popular game-theoretic learning algorithms (gradient methods, their optimistic variants, the EXP3 algorithm for learning with payoff-based feedback in finite games, etc.). In addition to providing an integrated view of these algorithms, the proposed MRM blueprint allows us to obtain a broad range of new convergence results, both asymptotic and in finite time, in both continuous and finite games.
    Federated Learning Algorithms for Generalized Mixed-effects Model (GLMM) on Horizontally Partitioned Data from Distributed Sources. (arXiv:2109.14046v2 [stat.ML] UPDATED)
    Objectives: This paper develops two algorithms to achieve federated generalized linear mixed effect models (GLMM), and compares the developed model's outcomes with each other, as well as that from the standard R package (`lme4'). Methods: The log-likelihood function of GLMM is approximated by two numerical methods (Laplace approximation and Gaussian Hermite approximation), which supports federated decomposition of GLMM to bring computation to data. Results: Our developed method can handle GLMM to accommodate hierarchical data with multiple non-independent levels of observations in a federated setting. The experiment results demonstrate comparable (Laplace) and superior (Gaussian-Hermite) performances with simulated and real-world data. Conclusion: We developed and compared federated GLMMs with different approximations, which can support researchers in analyzing biomedical data to accommodate mixed effects and address non-independence due to hierarchical structures (i.e., institutes, region, country, etc.).
    COVIDHunter: An Accurate, Flexible, and Environment-Aware Open-Source COVID-19 Outbreak Simulation Model. (arXiv:2102.03667v2 [q-bio.PE] UPDATED)
    Background: Early detection and isolation of COVID-19 patients are essential for successful implementation of mitigation strategies and eventually curbing the disease spread. With a limited number of daily COVID-19 tests performed in every country, simulating the COVID-19 spread along with the potential effect of each mitigation strategy currently remains one of the most effective ways in managing the healthcare system and guiding policy-makers. Methods: We introduce COVIDHunter, a flexible and accurate COVID-19 outbreak simulation model that evaluates the current mitigation measures that are applied to a region and provides suggestions on what strength the upcoming mitigation measure should be. The key idea of COVIDHunter is to quantify the spread of COVID-19 in a geographical region by simulating the average number of new infections caused by an infected person considering the effect of external factors, such as environmental conditions (e.g., climate, temperature, humidity) and mitigation measures. Results: Using Switzerland as a case study, COVIDHunter estimates that if the policy-makers relax the mitigation measures by 50% for 30 days then both the daily capacity need for hospital beds and daily number of deaths increase exponentially by an average of 5.1x, who may occupy ICU beds and ventilators for a period of time. Unlike existing models, the COVIDHunter model accurately monitors and predicts the daily number of cases, hospitalizations, and deaths due to COVID-19. Our model is flexible to configure and simple to modify for modeling different scenarios under different environmental conditions and mitigation measures. Availability: We release the source code of the COVIDHunter implementation at https://github.com/CMU- SAFARI/COVIDHunter and show how to flexibly configure our model for any scenario and easily extend it for different measures and conditions than we account for.
    Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning. (arXiv:2205.04363v2 [cs.CV] UPDATED)
    Significant progress has been made on visual captioning, largely relying on pre-trained features and later fixed object detectors that serve as rich inputs to auto-regressive models. A key limitation of such methods, however, is that the output of the model is conditioned only on the object detector's outputs. The assumption that such outputs can represent all necessary information is unrealistic, especially when the detector is transferred across datasets. In this work, we reason about the graphical model induced by this assumption, and propose to add an auxiliary input to represent missing information such as object relationships. We specifically propose to mine attributes and relationships from the Visual Genome dataset and condition the captioning model on them. Crucially, we propose (and show to be important) the use of a multi-modal pre-trained model (CLIP) to retrieve such contextual descriptions. Further, object detector models are frozen and do not have sufficient richness to allow the captioning model to properly ground them. As a result, we propose to condition both the detector and description outputs on the image, and show qualitatively and quantitatively that this can improve grounding. We validate our method on image captioning, perform thorough analyses of each component and importance of the pre-trained multi-modal model, and demonstrate significant improvements over the current state of the art, specifically +7.5% in CIDEr and +1.3% in BLEU-4 metrics.
    Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data. (arXiv:2206.02353v2 [cs.LG] UPDATED)
    Recently, Self-Supervised Representation Learning (SSRL) has attracted much attention in the field of computer vision, speech, natural language processing (NLP), and recently, with other types of modalities, including time series from sensors. The popularity of self-supervised learning is driven by the fact that traditional models typically require a huge amount of well-annotated data for training. Acquiring annotated data can be a difficult and costly process. Self-supervised methods have been introduced to improve the efficiency of training data through discriminative pre-training of models using supervisory signals that have been freely obtained from the raw data. Unlike existing reviews of SSRL that have pre-dominately focused upon methods in the fields of CV or NLP for a single modality, we aim to provide the first comprehensive review of multimodal self-supervised learning methods for temporal data. To this end, we 1) provide a comprehensive categorization of existing SSRL methods, 2) introduce a generic pipeline by defining the key components of a SSRL framework, 3) compare existing models in terms of their objective function, network architecture and potential applications, and 4) review existing multimodal techniques in each category and various modalities. Finally, we present existing weaknesses and future opportunities. We believe our work develops a perspective on the requirements of SSRL in domains that utilise multimodal and/or temporal data
    Inferring Lexicographically-Ordered Rewards from Preferences. (arXiv:2202.10153v2 [cs.LG] UPDATED)
    Modeling the preferences of agents over a set of alternatives is a principal concern in many areas. The dominant approach has been to find a single reward/utility function with the property that alternatives yielding higher rewards are preferred over alternatives yielding lower rewards. However, in many settings, preferences are based on multiple, often competing, objectives; a single reward function is not adequate to represent such preferences. This paper proposes a method for inferring multi-objective reward-based representations of an agent's observed preferences. We model the agent's priorities over different objectives as entering lexicographically, so that objectives with lower priorities matter only when the agent is indifferent with respect to objectives with higher priorities. We offer two example applications in healthcare, one inspired by cancer treatment, the other inspired by organ transplantation, to illustrate how the lexicographically-ordered rewards we learn can provide a better understanding of a decision-maker's preferences and help improve policies when used in reinforcement learning.
    Neural Diffusion Processes. (arXiv:2206.03992v1 [stat.ML])
    Gaussian processes provide an elegant framework for specifying prior and posterior distributions over functions. They are, however, also computationally expensive, and limited by the expressivity of their covariance function. We propose Neural Diffusion Processes (NDPs), a novel approach based upon diffusion models, that learn to sample from distributions over functions. Using a novel attention block, we can incorporate properties of stochastic processes, such as exchangeability, directly into the NDP's architecture. We empirically show that NDPs are able to capture functional distributions that are close to the true Bayesian posterior of a Gaussian process. This enables a variety of downstream tasks, including hyperparameter marginalisation and Bayesian optimisation.
    Data-driven hysteretic behavior simulation based on weighted stacked pyramid neural network architecture. (arXiv:2206.03990v1 [cs.LG])
    An accurate and efficient simulation of the hysteretic behavior of materials and components is essential for structural analysis. The surrogate model based on neural networks shows significant potential in balancing efficiency and accuracy. However, its serial information flow and prediction based on single-level features adversely affect the network performance. Therefore, a weighted stacked pyramid neural network architecture is proposed herein. This network establishes a pyramid architecture by introducing multi-level shortcuts to directly integrate features in the output module. In addition, a weighted stacked strategy is proposed to replace the conventional feature fusion method. The weights of the features are determined based on their levels. These basic principles are verified, and key network settings are discussed. Subsequently, the redesigned architectures are compared with other commonly used algorithms. Results show that the testing mean-square error (MSE) loss of the networks on varied datasets can be reduced by an average of 34.7%. The redesigned architectures outperform 87.5% of cases, and the proposed Pyramid-GA network has the best overall performance.
    Modeling Disagreement in Automatic Data Labelling for Semi-Supervised Learning in Clinical Natural Language Processing. (arXiv:2205.14761v2 [cs.LG] UPDATED)
    Computational models providing accurate estimates of their uncertainty are crucial for risk management associated with decision making in healthcare contexts. This is especially true since many state-of-the-art systems are trained using the data which has been labelled automatically (self-supervised mode) and tend to overfit. In this work, we investigate the quality of uncertainty estimates from a range of current state-of-the-art predictive models applied to the problem of observation detection in radiology reports. This problem remains understudied for Natural Language Processing in the healthcare domain. We demonstrate that Gaussian Processes (GPs) provide superior performance in quantifying the risks of 3 uncertainty labels based on the negative log predictive probability (NLPP) evaluation metric and mean maximum predicted confidence levels (MMPCL), whilst retaining strong predictive performance.
    Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks. (arXiv:2206.03826v1 [cs.LG])
    For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches randomly mask input patches and then reconstruct pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional supervised learning (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative semantics in the pretraining dataset, and accordingly show its provable improvement over SL on the classification downstream task. Specifically, we assume that pretraining dataset contains multi-view samples of ratio $1-\mu$ and single-view samples of ratio $\mu$, where multi/single-view samples has multiple/single discriminative semantics. Then for pretraining, we prove that 1) the convolution kernels of the MRP encoder captures all discriminative semantics in the pretraining data; and 2) a convolution kernel captures at most one semantic. Accordingly, in the downstream supervised fine-tuning, most semantics would be captured and different semantics would not be fused together. This helps the downstream fine-tuned network to easily establish the relation between kernels and semantic class labels. In this way, the fine-tuned encoder in MRP provably achieves zero test error with high probability for both multi-view and single-view test data. In contrast, as proved by~[3], conventional SL can only obtain a test accuracy between around $0.5\mu$ for single-view test data. These results together explain the benefits of MRP in downstream tasks. Experimental results testify to multi-view data assumptions and our theoretical implications.
    Option Transfer and SMDP Abstraction with Successor Features. (arXiv:2110.09196v2 [cs.LG] UPDATED)
    Abstraction plays an important role in the generalisation of knowledge and skills and is key to sample efficient learning. In this work, we study joint temporal and state abstraction in reinforcement learning, where temporally-extended actions in the form of options induce temporal abstractions, while aggregation of similar states with respect to abstract options induces state abstractions. Many existing abstraction schemes ignore the interplay of state and temporal abstraction. Consequently, the considered option policies often cannot be directly transferred to new environments due to changes in the state space and transition dynamics. To address this issue, we propose a novel abstraction scheme building on successor features. This includes an algorithm for transferring abstract options across different environments and a state abstraction mechanism that allows us to perform efficient planning with the transferred options.
    Action Noise in Off-Policy Deep Reinforcement Learning: Impact on Exploration and Performance. (arXiv:2206.03787v1 [cs.LG])
    Many deep reinforcement learning algorithms rely on simple forms of exploration, such as the additive action-noise often used in continuous control domains. Typically, the scaling factor of this action noise is chosen as a hyper-parameter and kept constant during training. In this paper, we analyze how the learned policy is impacted by the noise type, scale, and reducing of the scaling factor over time. We consider the two most prominent types of action-noise: Gaussian and Ornstein-Uhlenbeck noise, and perform a vast experimental campaign by systematically varying the noise type and scale parameter, and by measuring variables of interest like the expected return of the policy and the state space coverage during exploration. For the latter, we propose a novel state-space coverage measure $\operatorname{X}_{\mathcal{U}\text{rel}}$ that is more robust to boundary artifacts than previously proposed measures. Larger noise scales generally increase state space coverage. However, we found that increasing the space coverage using a larger noise scale is often not beneficial. On the contrary, reducing the noise-scale over the training process reduces the variance and generally improves the learning performance. We conclude that the best noise-type and scale are environment dependent, and based on our observations, derive heuristic rules for guiding the choice of the action noise as a starting point for further optimization.
    Entropic Convergence of Random Batch Methods for Interacting Particle Diffusion. (arXiv:2206.03792v1 [math.PR])
    We propose a co-variance corrected random batch method for interacting particle systems. By establishing a certain entropic central limit theorem, we provide entropic convergence guarantees for the law of the entire trajectories of all particles of the proposed method to the law of the trajectories of the discrete time interacting particle system whenever the batch size $B \gg (\alpha n)^{\frac{1}{3}}$ (where $n$ is the number of particles and $\alpha$ is the time discretization parameter). This in turn implies that the outputs of these methods are nearly \emph{statistically indistinguishable} when $B$ is even moderately large. Previous works mainly considered convergence in Wasserstein distance with required stringent assumptions on the potentials or the bounds had an exponential dependence on the time horizon. This work makes minimal assumptions on the interaction potentials and in particular establishes that even when the particle trajectories diverge to infinity, they do so in the same way for both the methods. Such guarantees are very useful in light of the recent advances in interacting particle based algorithms for sampling.
    Patch-based Object-centric Transformers for Efficient Video Generation. (arXiv:2206.04003v1 [cs.CV])
    In this work, we present Patch-based Object-centric Video Transformer (POVT), a novel region-based video generation architecture that leverages object-centric information to efficiently model temporal dynamics in videos. We build upon prior work in video prediction via an autoregressive transformer over the discrete latent space of compressed videos, with an added modification to model object-centric information via bounding boxes. Due to better compressibility of object-centric representations, we can improve training efficiency by allowing the model to only access object information for longer horizon temporal information. When evaluated on various difficult object-centric datasets, our method achieves better or equal performance to other video generation models, while remaining computationally more efficient and scalable. In addition, we show that our method is able to perform object-centric controllability through bounding box manipulation, which may aid downstream tasks such as video editing, or visual planning. Samples are available at https://sites.google.com/view/povt-public}{https://sites.google.com/view/povt-public
    Neural Bandit with Arm Group Graph. (arXiv:2206.03644v1 [cs.LG])
    Contextual bandits aim to identify among a set of arms the optimal one with the highest reward based on their contextual information. Motivated by the fact that the arms usually exhibit group behaviors and the mutual impacts exist among groups, we introduce a new model, Arm Group Graph (AGG), where the nodes represent the groups of arms and the weighted edges formulate the correlations among groups. To leverage the rich information in AGG, we propose a bandit algorithm, AGG-UCB, where the neural networks are designed to estimate rewards, and we propose to utilize graph neural networks (GNN) to learn the representations of arm groups with correlations. To solve the exploitation-exploration dilemma in bandits, we derive a new upper confidence bound (UCB) built on neural networks (exploitation) for exploration. Furthermore, we prove that AGG-UCB can achieve a near-optimal regret bound with over-parameterized neural networks, and provide the convergence analysis of GNN with fully-connected layers which may be of independent interest. In the end, we conduct extensive experiments against state-of-the-art baselines on multiple public data sets, showing the effectiveness of the proposed algorithm.
    Accelerating Score-based Generative Models for High-Resolution Image Synthesis. (arXiv:2206.04029v1 [cs.CV])
    Score-based generative models (SGMs) have recently emerged as a promising class of generative models. The key idea is to produce high-quality images by recurrently adding Gaussian noises and gradients to a Gaussian sample until converging to the target distribution, a.k.a. the diffusion sampling. To ensure stability of convergence in sampling and generation quality, however, this sequential sampling process has to take a small step size and many sampling iterations (e.g., 2000). Several acceleration methods have been proposed with focus on low-resolution generation. In this work, we consider the acceleration of high-resolution generation with SGMs, a more challenging yet more important problem. We prove theoretically that this slow convergence drawback is primarily due to the ignorance of the target distribution. Further, we introduce a novel Target Distribution Aware Sampling (TDAS) method by leveraging the structural priors in space and frequency domains. Extensive experiments on CIFAR-10, CelebA, LSUN, and FFHQ datasets validate that TDAS can consistently accelerate state-of-the-art SGMs, particularly on more challenging high resolution (1024x1024) image generation tasks by up to 18.4x, whilst largely maintaining the synthesis quality. With fewer sampling iterations, TDAS can still generate good quality images. In contrast, the existing methods degrade drastically or even fails completely
    Progress Report: A Deep Learning Guided Exploration of Affine Unimodular Loop Transformations. (arXiv:2206.03684v1 [cs.PL])
    In this paper, we present a work in progress about a deep learning based approach for automatic code optimization in polyhedral compilers. The proposed technique explores combinations of affine and non-affine loop transformations to find the sequence of transformations that minimizes the execution time of a given program. This exploration is guided by a deep learning based cost model that evaluates the speedup that each sequence of transformations would yield. Preliminary results show that the proposed techniques achieve a 2.35x geometric mean speedup over state of the art polyhedral compilers (Pluto).
    Sequential Density Estimation via NCWFAs Sequential Density Estimation via Nonlinear Continuous Weighted Finite Automata. (arXiv:2206.03923v1 [cs.LG])
    Weighted finite automata (WFAs) have been widely applied in many fields. One of the classic problems for WFAs is probability distribution estimation over sequences of discrete symbols. Although WFAs have been extended to deal with continuous input data, namely continuous WFAs (CWFAs), it is still unclear how to approximate density functions over sequences of continuous random variables using WFA-based models, due to the limitation on the expressiveness of the model as well as the tractability of approximating density functions via CWFAs. In this paper, we propose a nonlinear extension to the CWFA model to first improve its expressiveness, we refer to it as the nonlinear continuous WFAs (NCWFAs). Then we leverage the so-called RNADE method, which is a well-known density estimator based on neural networks, and propose the RNADE-NCWFA model. The RNADE-NCWFA model computes a density function by design. We show that this model is strictly more expressive than the Gaussian HMM model, which CWFA cannot approximate. Empirically, we conduct a synthetic experiment using Gaussian HMM generated data. We focus on evaluating the model's ability to estimate densities for sequences of varying lengths (longer length than the training data). We observe that our model performs the best among the compared baseline methods.
    TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation. (arXiv:2206.03933v1 [cs.CL])
    We present TURJUMAN, a neural toolkit for translating from 20 languages into Modern Standard Arabic (MSA). TURJUMAN exploits the recently-introduced text-to-text Transformer AraT5 model, endowing it with a powerful ability to decode into Arabic. The toolkit offers the possibility of employing a number of diverse decoding methods, making it suited for acquiring paraphrases for the MSA translations as an added value. To train TURJUMAN, we sample from publicly available parallel data employing a simple semantic similarity method to ensure data quality. This allows us to prepare and release AraOPUS-20, a new machine translation benchmark. We publicly release our translation toolkit (TURJUMAN) as well as our benchmark dataset (AraOPUS-20).
    Sim2real for Reinforcement Learning Driven Next Generation Networks. (arXiv:2206.03846v1 [cs.LG])
    The next generation of networks will actively embrace artificial intelligence (AI) and machine learning (ML) technologies for automation networks and optimal network operation strategies. The emerging network structure represented by Open RAN (O-RAN) conforms to this trend, and the radio intelligent controller (RIC) at the centre of its specification serves as an ML applications host. Various ML models, especially Reinforcement Learning (RL) models, are regarded as the key to solving RAN-related multi-objective optimization problems. However, it should be recognized that most of the current RL successes are confined to abstract and simplified simulation environments, which may not directly translate to high performance in complex real environments. One of the main reasons is the modelling gap between the simulation and the real environment, which could make the RL agent trained by simulation ill-equipped for the real environment. This issue is termed as the sim2real gap. This article brings to the fore the sim2real challenge within the context of O-RAN. Specifically, it emphasizes the characteristics, and benefits that the digital twins (DT) could have as a place for model development and verification. Several use cases are presented to exemplify and demonstrate failure modes of the simulations trained RL model in real environments. The effectiveness of DT in assisting the development of RL algorithms is discussed. Then the current state of the art learning-based methods commonly used to overcome the sim2real challenge are presented. Finally, the development and deployment concerns for the RL applications realisation in O-RAN are discussed from the view of the potential issues like data interaction, environment bottlenecks, and algorithm design.
    Robust Semantic Communications with Masked VQ-VAE Enabled Codebook. (arXiv:2206.04011v1 [eess.SP])
    Although semantic communications have exhibited satisfactory performance for a large number of tasks, the impact of semantic noise and the robustness of the systems have not been well investigated. Semantic noise refers to the misleading between the intended semantic symbols and received ones, thus cause the failure of tasks. In this paper, we first propose a framework for the robust end-to-end semantic communication systems to combat the semantic noise. In particular, we analyze sample-dependent and sample-independent semantic noise. To combat the semantic noise, the adversarial training with weight perturbation is developed to incorporate the samples with semantic noise in the training dataset. Then, we propose to mask a portion of the input, where the semantic noise appears frequently, and design the masked vector quantized-variational autoencoder (VQ-VAE) with the noise-related masking strategy. We use a discrete codebook shared by the transmitter and the receiver for encoded feature representation. To further improve the system robustness, we develop a feature importance module (FIM) to suppress the noise-related and task-unrelated features. Thus, the transmitter simply needs to transmit the indices of these important task-related features in the codebook. Simulation results show that the proposed method can be applied in many downstream tasks and significantly improve the robustness against semantic noise with remarkable reduction on the transmission overhead.
    Scaleformer: Iterative Multi-scale Refining Transformers for Time Series Forecasting. (arXiv:2206.04038v1 [cs.LG])
    The performance of time series forecasting has recently been greatly improved by the introduction of transformers. In this paper, we propose a general multi-scale framework that can be applied to state-of-the-art transformer-based time series forecasting models including Autoformer and Informer. Using iteratively refining a forecasted time series at multiple scales with shared weights, architecture adaptations and a specially-designed normalization scheme, we are able to achieve significant performance improvements with minimal additional computational overhead. Via detailed ablation studies, we demonstrate the effectiveness of our proposed architectural and methodological innovations. Furthermore, our experiments on four public datasets show that the proposed multi-scale framework outperforms the corresponding baselines with an average improvement of 13% and 38% over Autoformer and Informer, respectively.
    Automatic Personality Prediction; an Enhanced Method Using Ensemble Modeling. (arXiv:2007.04571v3 [cs.CL] UPDATED)
    Human personality is significantly represented by those words which he/she uses in his/her speech or writing. As a consequence of spreading the information infrastructures (specifically the Internet and social media), human communications have reformed notably from face to face communication. Generally, Automatic Personality Prediction (or Perception) (APP) is the automated forecasting of the personality on different types of human generated/exchanged contents (like text, speech, image, video, etc.). The major objective of this study is to enhance the accuracy of APP from the text. To this end, we suggest five new APP methods including term frequency vector-based, ontology-based, enriched ontology-based, latent semantic analysis (LSA)-based, and deep learning-based (BiLSTM) methods. These methods as the base ones, contribute to each other to enhance the APP accuracy through ensemble modeling (stacking) based on a hierarchical attention network (HAN) as the meta-model. The results show that ensemble modeling enhances the accuracy of APP.
    Towards Bridging Algorithm and Theory for Unbiased Recommendation. (arXiv:2206.03851v1 [cs.IR])
    This work studies the problem of learning unbiased algorithms from biased feedback for recommender systems. We address this problem from both theoretical and algorithmic perspectives. Recent works in unbiased learning have advanced the state-of-the-art with various techniques such as meta-learning, knowledge distillation, and information bottleneck. Despite their empirical successes, most of them lack theoretical guarantee, forming non-negligible gaps between the theories and recent algorithms. To this end, we first view the unbiased recommendation problem from a distribution shift perspective. We theoretically analyze the generalization bounds of unbiased learning and suggest their close relations with recent unbiased learning objectives. Based on the theoretical analysis, we further propose a principled framework, Adversarial Self-Training (AST), for unbiased recommendation. Empirical evaluation on real-world and semi-synthetic datasets demonstrate the effectiveness of the proposed AST.
    ConFUDA: Contrastive Fewshot Unsupervised Domain Adaptation for Medical Image Segmentation. (arXiv:2206.03888v1 [cs.CV])
    Unsupervised domain adaptation (UDA) aims to transfer knowledge learned from a labeled source domain to an unlabeled target domain. Contrastive learning (CL) in the context of UDA can help to better separate classes in feature space. However, in image segmentation, the large memory footprint due to the computation of the pixel-wise contrastive loss makes it prohibitive to use. Furthermore, labeled target data is not easily available in medical imaging, and obtaining new samples is not economical. As a result, in this work, we tackle a more challenging UDA task when there are only a few (fewshot) or a single (oneshot) image available from the target domain. We apply a style transfer module to mitigate the scarcity of target samples. Then, to align the source and target features and tackle the memory issue of the traditional contrastive loss, we propose the centroid-based contrastive learning (CCL) and a centroid norm regularizer (CNR) to optimize the contrastive pairs in both direction and magnitude. In addition, we propose multi-partition centroid contrastive learning (MPCCL) to further reduce the variance in the target features. Fewshot evaluation on MS-CMRSeg dataset demonstrates that ConFUDA improves the segmentation performance by 0.34 of the Dice score on the target domain compared with the baseline, and 0.31 Dice score improvement in a more rigorous oneshot setting.
    One Ring to Bring Them All: Towards Open-Set Recognition under Domain Shift. (arXiv:2206.03600v1 [cs.CV])
    In this paper, we investigate $\textit{open-set recognition}$ with domain shift, where the final goal is to achieve $\textit{Source-free Universal Domain Adaptation}$ (SF-UNDA), which addresses the situation where there exist both domain and category shifts between source and target domains. Under the SF-UNDA setting, the model cannot access source data anymore during target adaptation, which aims to address data privacy concerns. We propose a novel training scheme to learn a ($n$+1)-way classifier to predict the $n$ source classes and the unknown class, where samples of only known source categories are available for training. Furthermore, for target adaptation, we simply adopt a weighted entropy minimization to adapt the source pretrained model to the unlabeled target domain without source data. In experiments, we show: $\textbf{1)}$ After source training, the resulting source model can get excellent performance for $\textit{open-set single domain generalization}$ and also $\textit{open-set recognition}$ tasks; $\textbf{2)}$ After target adaptation, our method surpasses current UNDA approaches which demand source data during adaptation on several benchmarks. The versatility to several different tasks strongly proves the efficacy and generalization ability of our method. $\textbf{3)}$ When augmented with a closed-set domain adaptation approach during target adaptation, our source-free method further outperforms the current state-of-the-art UNDA method by 2.5%, 7.2% and 13% on Office-31, Office-Home and VisDA respectively. Code will be available in https://github.com/Albert0147/OneRing.
    An Analysis of Selection Bias Issue for Online Advertising. (arXiv:2206.03853v1 [cs.IR])
    In online advertising, a set of potential advertisements can be ranked by a certain auction system where usually the top-1 advertisement would be selected and displayed at an advertising space. In this paper, we show a selection bias issue that is present in an auction system. We analyze that the selection bias destroy truthfulness of the auction, which implies that the buyers (advertisers) on the auction can not maximize their profits. Although selection bias is well known in the field of statistics and there are lot of studies for it, our main contribution is to combine the theoretical analysis of the bias with the auction mechanism. In our experiment using online A/B testing, we evaluate the selection bias on an auction system whose ranking score is the function of predicted CTR (click through rate) of advertisement. The experiment showed that the selection bias is drastically reduced by using a multi-task learning which learns the data for all advertisements.
    Escaping the Big Data Paradigm with Compact Transformers. (arXiv:2104.05704v4 [cs.CV] UPDATED)
    With the rise of Transformers as the standard for language processing, and their advancements in computer vision, there has been a corresponding growth in parameter size and amounts of training data. Many have come to believe that because of this, transformers are not suitable for small sets of data. This trend leads to concerns such as: limited availability of data in certain scientific domains and the exclusion of those with limited resource from research in the field. In this paper, we aim to present an approach for small-scale learning by introducing Compact Transformers. We show for the first time that with the right size, convolutional tokenization, transformers can avoid overfitting and outperform state-of-the-art CNNs on small datasets. Our models are flexible in terms of model size, and can have as little as 0.28M parameters while achieving competitive results. Our best model can reach 98% accuracy when training from scratch on CIFAR-10 with only 3.7M parameters, which is a significant improvement in data-efficiency over previous Transformer based models being over 10x smaller than other transformers and is 15% the size of ResNet50 while achieving similar performance. CCT also outperforms many modern CNN based approaches, and even some recent NAS-based approaches. Additionally, we obtain a new SOTA result on Flowers-102 with 99.76% top-1 accuracy, and improve upon the existing baseline on ImageNet (82.71% accuracy with 29% as many parameters as ViT), as well as NLP tasks. Our simple and compact design for transformers makes them more feasible to study for those with limited computing resources and/or dealing with small datasets, while extending existing research efforts in data efficient transformers. Our code and pre-trained models are publicly available at https://github.com/SHI-Labs/Compact-Transformers.
    How unfair is private learning ?. (arXiv:2206.03985v1 [cs.LG])
    As machine learning algorithms are deployed on sensitive data in critical decision making processes, it is becoming increasingly important that they are also private and fair. In this paper, we show that, when the data has a long-tailed structure, it is not possible to build accurate learning algorithms that are both private and results in higher accuracy on minority subpopulations. We further show that relaxing overall accuracy can lead to good fairness even with strict privacy requirements. To corroborate our theoretical results in practice, we provide an extensive set of experimental results using a variety of synthetic, vision~(\cifar and CelebA), and tabular~(Law School) datasets and learning algorithms.
    Diffusion Curvature for Estimating Local Curvature in High Dimensional Data. (arXiv:2206.03977v1 [cs.LG])
    We introduce a new intrinsic measure of local curvature on point-cloud data called diffusion curvature. Our measure uses the framework of diffusion maps, including the data diffusion operator, to structure point cloud data and define local curvature based on the laziness of a random walk starting at a point or region of the data. We show that this laziness directly relates to volume comparison results from Riemannian geometry. We then extend this scalar curvature notion to an entire quadratic form using neural network estimations based on the diffusion map of point-cloud data. We show applications of both estimations on toy data, single-cell data, and on estimating local Hessian matrices of neural network loss landscapes.
    Improving trajectory calculations using deep learning inspired single image superresolution. (arXiv:2206.04015v1 [physics.ao-ph])
    Lagrangian trajectory or particle dispersion models as well as semi-Lagrangian advection schemes require meteorological data such as wind, temperature and geopotential at the exact spatio-temporal locations of the particles that move independently from a regular grid. Traditionally, this high-resolution data has been obtained by interpolating the meteorological parameters from the gridded data of a meteorological model or reanalysis, e.g. using linear interpolation in space and time. However, interpolation errors are a large source of error for these models. Reducing them requires meteorological input fields with high space and time resolution, which may not always be available and can cause severe data storage and transfer problems. Here, we interpret this problem as a single image superresolution task. We interpret meteorological fields available at their native resolution as low-resolution images and train deep neural networks to up-scale them to higher resolution, thereby providing more accurate data for Lagrangian models. We train various versions of the state-of-the-art Enhanced Deep Residual Networks for Superresolution on low-resolution ERA5 reanalysis data with the goal to up-scale these data to arbitrary spatial resolution. We show that the resulting up-scaled wind fields have root-mean-squared errors half the size of the winds obtained with linear spatial interpolation at acceptable computational inference costs. In a test setup using the Lagrangian particle dispersion model FLEXPART and reduced-resolution wind fields, we demonstrate that absolute horizontal transport deviations of calculated trajectories from "ground-truth" trajectories calculated with undegraded 0.5{\deg} winds are reduced by at least 49.5% (21.8%) after 48 hours relative to trajectories using linear interpolation of the wind data when training on 2{\deg} to 1{\deg} (4{\deg} to 2{\deg}) resolution data.
    SYNERgy between SYNaptic consolidation and Experience Replay for general continual learning. (arXiv:2206.04016v1 [cs.NE])
    Continual learning (CL) in the brain is facilitated by a complex set of mechanisms. This includes the interplay of multiple memory systems for consolidating information as posited by the complementary learning systems (CLS) theory and synaptic consolidation for protecting the acquired knowledge from erasure. Thus, we propose a general CL method that creates a synergy between SYNaptic consolidation and dual memory Experience Replay (SYNERgy). Our method maintains a semantic memory that accumulates and consolidates information across the tasks and interacts with the episodic memory for effective replay. It further employs synaptic consolidation by tracking the importance of parameters during the training trajectory and anchoring them to the consolidated parameters in the semantic memory. To the best of our knowledge, our study is the first to employ dual memory experience replay in conjunction with synaptic consolidation that is suitable for general CL whereby the network does not utilize task boundaries or task labels during training or inference. Our evaluation on various challenging CL scenarios and characteristics analyses demonstrate the efficacy of incorporating both synaptic consolidation and CLS theory in enabling effective CL in DNNs.
    Performance, Transparency and Time. Feature selection to speed up the diagnosis of Parkinson's disease. (arXiv:2206.03716v1 [cs.LG])
    Accurate and early prediction of a disease allows to plan and improve a patient's quality of future life. During pandemic situations, the medical decision becomes a speed challenge in which physicians have to act fast to diagnose and predict the risk of the severity of the disease, moreover this is also of high priority for neurodegenerative diseases like Parkinson's disease. Machine Learning (ML) models with Features Selection (FS) techniques can be applied to help physicians to quickly diagnose a disease. FS optimally subset features that improve a model performance and help reduce the number of needed tests for a patient and hence speeding up the diagnosis. This study shows the result of three Feature Selection (FS) techniques pre-applied to a classifier algorithm, Logistic Regression, on non-invasive test results data. The three FS are Analysis of Variance (ANOVA) as filter based method, Least Absolute Shrinkage and Selection Operator (LASSO) as embedded method and Sequential Feature Selection (SFS) as wrapper method. The outcome shows that FS technique can help to build an efficient and effective classifier, hence improving the performance of the classifier while reducing the computation time.
    FEL: High Capacity Learning for Recommendation and Ranking via Federated Ensemble Learning. (arXiv:2206.03852v1 [cs.IR])
    Federated learning (FL) has emerged as an effective approach to address consumer privacy needs. FL has been successfully applied to certain machine learning tasks, such as training smart keyboard models and keyword spotting. Despite FL's initial success, many important deep learning use cases, such as ranking and recommendation tasks, have been limited from on-device learning. One of the key challenges faced by practical FL adoption for DL-based ranking and recommendation is the prohibitive resource requirements that cannot be satisfied by modern mobile systems. We propose Federated Ensemble Learning (FEL) as a solution to tackle the large memory requirement of deep learning ranking and recommendation tasks. FEL enables large-scale ranking and recommendation model training on-device by simultaneously training multiple model versions on disjoint clusters of client devices. FEL integrates the trained sub-models via an over-arch layer into an ensemble model that is hosted on the server. Our experiments demonstrate that FEL leads to 0.43-2.31% model quality improvement over traditional on-device federated learning - a significant improvement for ranking and recommendation system use cases.
    Blacklight: Defending Black-Box Adversarial Attacks on Deep Neural Networks. (arXiv:2006.14042v2 [cs.CR] UPDATED)
    Deep learning systems are known to be vulnerable to adversarial examples. In particular, query-based black-box attacks do not require knowledge of the deep learning model, but can compute adversarial examples over the network by submitting queries and inspecting returns. Recent work largely improves the efficiency of those attacks, demonstrating their practicality on today's ML-as-a-service platforms. We propose Blacklight, a new defense against query-based black-box adversarial attacks. The fundamental insight driving our design is that, to compute adversarial examples, these attacks perform iterative optimization over the network, producing image queries highly similar in the input space. Blacklight detects query-based black-box attacks by detecting highly similar queries, using an efficient similarity engine operating on probabilistic content fingerprints. We evaluate Blacklight against eight state-of-the-art attacks, across a variety of models and image classification tasks. Blacklight identifies them all, often after only a handful of queries. By rejecting all detected queries, Blacklight prevents any attack to complete, even when attackers persist to submit queries after account ban or query rejection. Blacklight is also robust against several powerful countermeasures, including an optimal black-box attack that approximates white-box attacks in efficiency. Finally, we illustrate how Blacklight generalizes to other domains like text classification.
    Dual Windows Are Significant: Learning from Mediastinal Window and Focusing on Lung Window. (arXiv:2206.03803v1 [eess.IV])
    Since the pandemic of COVID-19, several deep learning methods were proposed to analyze the chest Computed Tomography (CT) for diagnosis. In the current situation, the disease course classification is significant for medical personnel to decide the treatment. Most previous deep-learning-based methods extract features observed from the lung window. However, it has been proved that some appearances related to diagnosis can be observed better from the mediastinal window rather than the lung window, e.g., the pulmonary consolidation happens more in severe symptoms. In this paper, we propose a novel Dual Window RCNN Network (DWRNet), which mainly learns the distinctive features from the successive mediastinal window. Regarding the features extracted from the lung window, we introduce the Lung Window Attention Block (LWA Block) to pay additional attention to them for enhancing the mediastinal-window features. Moreover, instead of picking up specific slices from the whole CT slices, we use a Recurrent CNN and analyze successive slices as videos. Experimental results show that the fused and representative features improve the predictions of disease course by reaching the accuracy of 90.57%, against the baseline with an accuracy of 84.86%. Ablation studies demonstrate that combined dual window features are more efficient than lung-window features alone, while paying attention to lung-window features can improve the model's stability.
    "GAN I hire you?" -- A System for Personalized Virtual Job Interview Training. (arXiv:2206.03869v1 [cs.HC])
    Job interviews are usually high-stakes social situations where professional and behavioral skills are required for a satisfactory outcome. Professional job interview trainers give educative feedback about the shown behavior according to common standards. This feedback can be helpful concerning the improvement of behavioral skills needed for job interviews. A technological approach for generating such feedback might be a playful and low-key starting point for job interview training. Therefore, we extended an interactive virtual job interview training system with a Generative Adversarial Network (GAN)-based approach that first detects behavioral weaknesses and subsequently generates personalized feedback. To evaluate the usefulness of the generated feedback, we conducted a mixed-methods pilot study using mock-ups from the job interview training system. The overall study results indicate that the GAN-based generated behavioral feedback is helpful. Moreover, participants assessed that the feedback would improve their job interview performance.
    Out-of-Distribution Detection with Class Ratio Estimation. (arXiv:2206.03955v1 [stat.ML])
    Density-based Out-of-distribution (OOD) detection has recently been shown unreliable for the task of detecting OOD images. Various density ratio based approaches achieve good empirical performance, however methods typically lack a principled probabilistic modelling explanation. In this work, we propose to unify density ratio based methods under a novel framework that builds energy-based models and employs differing base distributions. Under our framework, the density ratio can be viewed as the unnormalized density of an implicit semantic distribution. Further, we propose to directly estimate the density ratio of a data sample through class ratio estimation. We report competitive results on OOD image problems in comparison with recent work that alternatively requires training of deep generative models for the task. Our approach enables a simple and yet effective path towards solving the OOD detection problem.
    PrivHAR: Recognizing Human Actions From Privacy-preserving Lens. (arXiv:2206.03891v1 [cs.CV])
    The accelerated use of digital cameras prompts an increasing concern about privacy and security, particularly in applications such as action recognition. In this paper, we propose an optimizing framework to provide robust visual privacy protection along the human action recognition pipeline. Our framework parameterizes the camera lens to successfully degrade the quality of the videos to inhibit privacy attributes and protect against adversarial attacks while maintaining relevant features for activity recognition. We validate our approach with extensive simulations and hardware experiments.
    Set Interdependence Transformer: Set-to-Sequence Neural Networks for Permutation Learning and Structure Prediction. (arXiv:2206.03720v1 [cs.LG])
    The task of learning to map an input set onto a permuted sequence of its elements is challenging for neural networks. Set-to-sequence problems occur in natural language processing, computer vision and structure prediction, where interactions between elements of large sets define the optimal output. Models must exhibit relational reasoning, handle varying cardinalities and manage combinatorial complexity. Previous attention-based methods require $n$ layers of their set transformations to explicitly represent $n$-th order relations. Our aim is to enhance their ability to efficiently model higher-order interactions through an additional interdependence component. We propose a novel neural set encoding method called the Set Interdependence Transformer, capable of relating the set's permutation invariant representation to its elements within sets of any cardinality. We combine it with a permutation learning module into a complete, 3-part set-to-sequence model and demonstrate its state-of-the-art performance on a number of tasks. These range from combinatorial optimization problems, through permutation learning challenges on both synthetic and established NLP datasets for sentence ordering, to a novel domain of product catalog structure prediction. Additionally, the network's ability to generalize to unseen sequence lengths is investigated and a comparative empirical analysis of the existing methods' ability to learn higher-order interactions is provided.
    A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic. (arXiv:2007.05170v4 [math.OC] UPDATED)
    This paper analyzes a two-timescale stochastic algorithm framework for bilevel optimization. Bilevel optimization is a class of problems which exhibit a two-level structure, and its goal is to minimize an outer objective function with variables which are constrained to be the optimal solution to an (inner) optimization problem. We consider the case when the inner problem is unconstrained and strongly convex, while the outer problem is constrained and has a smooth objective function. We propose a two-timescale stochastic approximation (TTSA) algorithm for tackling such a bilevel problem. In the algorithm, a stochastic gradient update with a larger step size is used for the inner problem, while a projected stochastic gradient update with a smaller step size is used for the outer problem. We analyze the convergence rates for the TTSA algorithm under various settings: when the outer problem is strongly convex (resp.~weakly convex), the TTSA algorithm finds an $\mathcal{O}(K^{-2/3})$-optimal (resp.~$\mathcal{O}(K^{-2/5})$-stationary) solution, where $K$ is the total iteration number. As an application, we show that a two-timescale natural actor-critic proximal policy optimization algorithm can be viewed as a special case of our TTSA framework. Importantly, the natural actor-critic algorithm is shown to converge at a rate of $\mathcal{O}(K^{-1/4})$ in terms of the gap in expected discounted reward compared to a global optimal policy.
    Motiflets -- Fast and Accurate Detection of Motifs in Time Series. (arXiv:2206.03735v1 [cs.LG])
    A motif intuitively is a short time series that repeats itself approximately the same within a larger time series. Such motifs often represent concealed structures, such as heart beats in an ECG recording, or sleep spindles in EEG sleep data. Motif discovery (MD) is the task of finding such motifs in a given input series. As there are varying definitions of what exactly a motif is, a number of algorithms exist. As central parameters they all take the length l of the motif and the maximal distance r between the motif's occurrences. In practice, however, suitable values for r are very hard to determine upfront, and the found motifs show a high variability. Setting the wrong input value will result in a motif that is not distinguishable from noise. Accordingly, finding an interesting motif with these methods requires extensive trial-and-error. We present a different approach to the MD problem. We define k-Motiflets as the set of exactly k occurrences of a motif of length l, whose maximum pairwise distance is minimal. This turns the MD problem upside-down: Our central parameter is not the distance threshold r, but the desired size k of a motif set, which we show is considerably more intuitive and easier to set. Based on this definition, we present exact and approximate algorithms for finding k-Motiflets and analyze their complexity. To further ease the use of our method, we describe extensions to automatically determine the right/suitable values for its input parameters. Thus, for the first time, extracting meaningful motif sets without any a-priori knowledge becomes feasible. By evaluating real-world use cases and comparison to 4 state-of-the-art MD algorithms, we show that our proposed algorithm is (a) quantitatively superior, finding larger motif sets at higher similarity, (b) qualitatively better, leading to clearer and easier to interpret motifs, and (c) has the lowest runtime.
    Contributor-Aware Defenses Against Adversarial Backdoor Attacks. (arXiv:2206.03583v1 [cs.CR])
    Deep neural networks for image classification are well-known to be vulnerable to adversarial attacks. One such attack that has garnered recent attention is the adversarial backdoor attack, which has demonstrated the capability to perform targeted misclassification of specific examples. In particular, backdoor attacks attempt to force a model to learn spurious relations between backdoor trigger patterns and false labels. In response to this threat, numerous defensive measures have been proposed; however, defenses against backdoor attacks focus on backdoor pattern detection, which may be unreliable against novel or unexpected types of backdoor pattern designs. We introduce a novel re-contextualization of the adversarial setting, where the presence of an adversary implicitly admits the existence of multiple database contributors. Then, under the mild assumption of contributor awareness, it becomes possible to exploit this knowledge to defend against backdoor attacks by destroying the false label associations. We propose a contributor-aware universal defensive framework for learning in the presence of multiple, potentially adversarial data sources that utilizes semi-supervised ensembles and learning from crowds to filter the false labels produced by adversarial triggers. Importantly, this defensive strategy is agnostic to backdoor pattern design, as it functions without needing -- or even attempting -- to perform either adversary identification or backdoor pattern detection during either training or inference. Our empirical studies demonstrate the robustness of the proposed framework against adversarial backdoor attacks from multiple simultaneous adversaries.
    Lower Bounds and Nearly Optimal Algorithms in Distributed Learning with Communication Compression. (arXiv:2206.03665v1 [cs.LG])
    Recent advances in distributed optimization and learning have shown that communication compression is one of the most effective means of reducing communication. While there have been many results on convergence rates under communication compression, a theoretical lower bound is still missing. Analyses of algorithms with communication compression have attributed convergence to two abstract properties: the unbiased property or the contractive property. They can be applied with either unidirectional compression (only messages from workers to server are compressed) or bidirectional compression. In this paper, we consider distributed stochastic algorithms for minimizing smooth and non-convex objective functions under communication compression. We establish a convergence lower bound for algorithms whether using unbiased or contractive compressors in unidirection or bidirection. To close the gap between the lower bound and the existing upper bounds, we further propose an algorithm, NEOLITHIC, which almost reaches our lower bound (up to logarithm factors) under mild conditions. Our results also show that using contractive bidirectional compression can yield iterative methods that converge as fast as those using unbiased unidirectional compression. The experimental results validate our findings.
    Classification of Stochastic Processes with Topological Data Analysis. (arXiv:2206.03973v1 [stat.ML])
    In this study, we examine if engineered topological features can distinguish time series sampled from different stochastic processes with different noise characteristics, in both balanced and unbalanced sampling schemes. We compare our classification results against the results of the same classification tasks built on statistical and raw features. We conclude that in classification tasks of time series, different machine learning models built on engineered topological features perform consistently better than those built on standard statistical and raw features.
    Click Prediction Boosting via Ensemble Learning Pipelines. (arXiv:2206.03592v1 [cs.LG])
    Online travel agencies (OTA's) advertise their website offers on meta-search bidding engines. The problem of predicting the number of clicks a hotel would receive for a given bid amount is an important step in the management of an OTA's advertisement campaign on a meta-search engine because bid times number of clicks defines the cost to be generated. Various regressors are ensembled in this work to improve click prediction performance. Following the preprocessing procedures, the feature set is divided into train and test groups depending on the samples' logging dates. The data collection is then subjected to XGBoost-based dimension reduction, which significantly reduces the dimension of features. The optimum hyper-parameters are then found by applying Bayesian Hyper-parameter optimization to the XGBoost, LightGBM, and SGD models. Individually, ten distinct machine learning models are tested, as well as combining them to create ensemble models. Three alternative ensemble solutions have been suggested. The same test set is used to test both individual and ensemble models, and the results of 46 model combinations demonstrate that stack ensemble models yield the desired R2 score of all. In conclusion, the ensemble model improves the prediction performance by about 10%.
    Decoupled Self-supervised Learning for Non-Homophilous Graphs. (arXiv:2206.03601v1 [cs.LG])
    In this paper, we study the problem of conducting self-supervised learning for node representation learning on non-homophilous graphs. Existing self-supervised learning methods typically assume the graph is homophilous where linked nodes often belong to the same class or have similar features. However, such assumptions of homophily do not always hold true in real-world graphs. We address this problem by developing a decoupled self-supervised learning (DSSL) framework for graph neural networks. DSSL imitates a generative process of nodes and links from latent variable modeling of the semantic structure, which decouples different underlying semantics between different neighborhoods into the self-supervised node learning process. Our DSSL framework is agnostic to the encoders and does not need prefabricated augmentations, thus is flexible to different graphs. To effectively optimize the framework with latent variables, we derive the evidence lower-bound of the self-supervised objective and develop a scalable training algorithm with variational inference. We provide a theoretical analysis to justify that DSSL enjoys better downstream performance. Extensive experiments on various types of graph benchmarks demonstrate that our proposed framework can significantly achieve better performance compared with competitive self-supervised learning baselines.
    Certifying Data-Bias Robustness in Linear Regression. (arXiv:2206.03575v1 [cs.LG])
    Datasets typically contain inaccuracies due to human error and societal biases, and these inaccuracies can affect the outcomes of models trained on such datasets. We present a technique for certifying whether linear regression models are pointwise-robust to label bias in the training dataset, i.e., whether bounded perturbations to the labels of a training dataset result in models that change the prediction of test points. We show how to solve this problem exactly for individual test points, and provide an approximate but more scalable method that does not require advance knowledge of the test point. We extensively evaluate both techniques and find that linear models -- both regression- and classification-based -- often display high levels of bias-robustness. However, we also unearth gaps in bias-robustness, such as high levels of non-robustness for certain bias assumptions on some datasets. Overall, our approach can serve as a guide for when to trust, or question, a model's output.
    Hub-Pathway: Transfer Learning from A Hub of Pre-trained Models. (arXiv:2206.03726v1 [cs.LG])
    Transfer learning aims to leverage knowledge from pre-trained models to benefit the target task. Prior transfer learning work mainly transfers from a single model. However, with the emergence of deep models pre-trained from different resources, model hubs consisting of diverse models with various architectures, pre-trained datasets and learning paradigms are available. Directly applying single-model transfer learning methods to each model wastes the abundant knowledge of the model hub and suffers from high computational cost. In this paper, we propose a Hub-Pathway framework to enable knowledge transfer from a model hub. The framework generates data-dependent pathway weights, based on which we assign the pathway routes at the input level to decide which pre-trained models are activated and passed through, and then set the pathway aggregation at the output level to aggregate the knowledge from different models to make predictions. The proposed framework can be trained end-to-end with the target task-specific loss, where it learns to explore better pathway configurations and exploit the knowledge in pre-trained models for each target datum. We utilize a noisy pathway generator and design an exploration loss to further explore different pathways throughout the model hub. To fully exploit the knowledge in pre-trained models, each model is further trained by specific data that activate it, which ensures its performance and enhances knowledge transfer. Experiment results on computer vision and reinforcement learning tasks demonstrate that the proposed Hub-Pathway framework achieves the state-of-the-art performance for model hub transfer learning.
    A Privacy-Preserving Subgraph-Level Federated Graph Neural Network via Differential Privacy. (arXiv:2206.03492v1 [cs.CR])
    Currently, the federated graph neural network (GNN) has attracted a lot of attention due to its wide applications in reality without violating the privacy regulations. Among all the privacy-preserving technologies, the differential privacy (DP) is the most promising one due to its effectiveness and light computational overhead. However, the DP-based federated GNN has not been well investigated, especially in the sub-graph-level setting, such as the scenario of recommendation system. The biggest challenge is how to guarantee the privacy and solve the non independent and identically distributed (non-IID) data in federated GNN simultaneously. In this paper, we propose DP-FedRec, a DP-based federated GNN to fill the gap. Private Set Intersection (PSI) is leveraged to extend the local graph for each client, and thus solve the non-IID problem. Most importantly, DP is applied not only on the weights but also on the edges of the intersection graph from PSI to fully protect the privacy of clients. The evaluation demonstrates DP-FedRec achieves better performance with the graph extension and DP only introduces little computations overhead.
    Network Report: A Structured Description for Network Datasets. (arXiv:2206.03635v1 [cs.SI])
    The rapid development of network science and technologies depends on shareable datasets. Currently, there is no standard practice for reporting and sharing network datasets. Some network dataset providers only share links, while others provide some contexts or basic statistics. As a result, critical information may be unintentionally dropped, and network dataset consumers may misunderstand or overlook critical aspects. Inappropriately using a network dataset can lead to severe consequences (e.g., discrimination) especially when machine learning models on networks are deployed in high-stake domains. Challenges arise as networks are often used across different domains (e.g., network science, physics, etc) and have complex structures. To facilitate the communication between network dataset providers and consumers, we propose network report. A network report is a structured description that summarizes and contextualizes a network dataset. Network report extends the idea of dataset reports (e.g., Datasheets for Datasets) from prior work with network-specific descriptions of the non-i.i.d. nature, demographic information, network characteristics, etc. We hope network reports encourage transparency and accountability in network research and development across different fields.
    Toward Certified Robustness Against Real-World Distribution Shifts. (arXiv:2206.03669v1 [cs.LG])
    We consider the problem of certifying the robustness of deep neural networks against real-world distribution shifts. To do so, we bridge the gap between hand-crafted specifications and realistic deployment settings by proposing a novel neural-symbolic verification framework, in which we train a generative model to learn perturbations from data and define specifications with respect to the output of the learned model. A unique challenge arising from this setting is that existing verifiers cannot tightly approximate sigmoid activations, which are fundamental to many state-of-the-art generative models. To address this challenge, we propose a general meta-algorithm for handling sigmoid activations which leverages classical notions of counter-example-guided abstraction refinement. The key idea is to "lazily" refine the abstraction of sigmoid functions to exclude spurious counter-examples found in the previous abstraction, thus guaranteeing progress in the verification process while keeping the state-space small. Experiments on the MNIST and CIFAR-10 datasets show that our framework significantly outperforms existing methods on a range of challenging distribution shifts.
    Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials. (arXiv:2206.03688v1 [cs.LG])
    A recent goal in the theory of deep learning is to identify how neural networks can escape the "lazy training," or Neural Tangent Kernel (NTK) regime, where the network is coupled with its first order Taylor expansion at initialization. While the NTK is minimax optimal for learning dense polynomials (Ghorbani et al, 2021), it cannot learn features, and hence has poor sample complexity for learning many classes of functions including sparse polynomials. Recent works have thus aimed to identify settings where gradient based algorithms provably generalize better than the NTK. One such example is the "QuadNTK" approach of Bai and Lee (2020), which analyzes the second-order term in the Taylor expansion. Bai and Lee (2020) show that the second-order term can learn sparse polynomials efficiently; however, it sacrifices the ability to learn general dense polynomials. In this paper, we analyze how gradient descent on a two-layer neural network can escape the NTK regime by utilizing a spectral characterization of the NTK (Montanari and Zhong, 2020) and building on the QuadNTK approach. We first expand upon the spectral analysis to identify "good" directions in parameter space in which we can move without harming generalization. Next, we show that a wide two-layer neural network can jointly use the NTK and QuadNTK to fit target functions consisting of a dense low-degree term and a sparse high-degree term -- something neither the NTK nor the QuadNTK can do on their own. Finally, we construct a regularizer which encourages our parameter vector to move in the "good" directions, and show that gradient descent on the regularized loss will converge to a global minimizer, which also has low test error. This yields an end to end convergence and generalization guarantee with provable sample complexity improvement over both the NTK and QuadNTK on their own.
    Unsupervised Single-shot Depth Estimation using Perceptual Reconstruction. (arXiv:2201.12170v4 [cs.CV] UPDATED)
    Real-time estimation of actual object depth is an essential module for various autonomous system tasks such as 3D reconstruction, scene understanding and condition assessment. During the last decade of machine learning, extensive deployment of deep learning methods to computer vision tasks has yielded approaches that succeed in achieving realistic depth synthesis out of a simple RGB modality. Most of these models are based on paired RGB-depth data and/or the availability of video sequences and stereo images. The lack of sequences, stereo data and RGB-depth pairs makes depth estimation a fully unsupervised single-image transfer problem that has barely been explored so far. This study builds on recent advances in the field of generative neural networks in order to establish fully unsupervised single-shot depth estimation. Two generators for RGB-to-depth and depth-to-RGB transfer are implemented and simultaneously optimized using the Wasserstein-1 distance, a novel perceptual reconstruction term and hand-crafted image filters. We comprehensively evaluate the models using industrial surface depth data as well as the Texas 3D Face Recognition Database, the CelebAMask-HQ database of human portraits and the SURREAL dataset that records body depth. For each evaluation dataset the proposed method shows a significant increase in depth accuracy compared to state-of-the-art single-image transfer methods.
    High-dimensional limit theorems for SGD: Effective dynamics and critical scaling. (arXiv:2206.04030v1 [stat.ML])
    We study the scaling limits of stochastic gradient descent (SGD) with constant step-size in the high-dimensional regime. We prove limit theorems for the trajectories of summary statistics (i.e., finite-dimensional functions) of SGD as the dimension goes to infinity. Our approach allows one to choose the summary statistics that are tracked, the initialization, and the step-size. It yields both ballistic (ODE) and diffusive (SDE) limits, with the limit depending dramatically on the former choices. Interestingly, we find a critical scaling regime for the step-size below which the effective ballistic dynamics matches gradient flow for the population loss, but at which, a new correction term appears which changes the phase diagram. About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate. We demonstrate our approach on popular examples including estimation for spiked matrix and tensor models and classification via two-layer networks for binary and XOR-type Gaussian mixture models. These examples exhibit surprising phenomena including multimodal timescales to convergence as well as convergence to sub-optimal solutions with probability bounded away from zero from random (e.g., Gaussian) initializations.
    Probabilistically Robust Learning: Balancing Average- and Worst-case Performance. (arXiv:2202.01136v3 [cs.LG] UPDATED)
    Many of the successes of machine learning are based on minimizing an averaged loss function. However, it is well-known that this paradigm suffers from robustness issues that hinder its applicability in safety-critical domains. These issues are often addressed by training against worst-case perturbations of data, a technique known as adversarial training. Although empirically effective, adversarial training can be overly conservative, leading to unfavorable trade-offs between nominal performance and robustness. To this end, in this paper we propose a framework called probabilistic robustness that bridges the gap between the accurate, yet brittle average case and the robust, yet conservative worst case by enforcing robustness to most rather than to all perturbations. From a theoretical point of view, this framework overcomes the trade-offs between the performance and the sample-complexity of worst-case and average-case learning. From a practical point of view, we propose a novel algorithm based on risk-aware optimization that effectively balances average- and worst-case performance at a considerably lower computational cost relative to adversarial training. Our results on MNIST, CIFAR-10, and SVHN illustrate the advantages of this framework on the spectrum from average- to worst-case robustness.
    Machine-Learning the Classification of Spacetimes. (arXiv:2201.01644v2 [gr-qc] UPDATED)
    On the long-established classification problems in general relativity we take a novel perspective by adopting fruitful techniques from machine learning and modern data-science. In particular, we model Petrov's classification of spacetimes, and show that a feed-forward neural network can achieve high degree of success. We also show how data visualization techniques with dimensionality reduction can help analyze the underlying patterns in the structure of the different types of spacetimes.
    Decentralized Safe Multi-agent Stochastic Optimal Control using Deep FBSDEs and ADMM. (arXiv:2202.10658v2 [cs.MA] UPDATED)
    In this work, we propose a novel safe and scalable decentralized solution for multi-agent control in the presence of stochastic disturbances. Safety is mathematically encoded using stochastic control barrier functions and safe controls are computed by solving quadratic programs. Decentralization is achieved by augmenting to each agent's optimization variables, copy variables, for its neighbors. This allows us to decouple the centralized multi-agent optimization problem. However, to ensure safety, neighboring agents must agree on "what is safe for both of us" and this creates a need for consensus. To enable safe consensus solutions, we incorporate an ADMM-based approach. Specifically, we propose a Merged CADMM-OSQP implicit neural network layer, that solves a mini-batch of both, local quadratic programs as well as the overall consensus problem, as a single optimization problem. This layer is embedded within a Deep FBSDEs network architecture at every time step, to facilitate end-to-end differentiable, safe and decentralized stochastic optimal control. The efficacy of the proposed approach is demonstrated on several challenging multi-robot tasks in simulation. By imposing requirements on safety specified by collision avoidance constraints, the safe operation of all agents is ensured during the entire training process. We also demonstrate superior scalability in terms of computational and memory savings as compared to a centralized approach.
    Decision-Focused Learning without Decision-Making: Learning Locally Optimized Decision Losses. (arXiv:2203.16067v2 [cs.LG] UPDATED)
    Decision-Focused Learning (DFL) is a paradigm for tailoring a predictive model to a downstream optimization task that uses its predictions in order to perform better on that specific task. The main technical challenge associated with DFL is that it requires being able to differentiate through the optimization problem, which is difficult due to discontinuous solutions and other challenges. Past work has largely gotten around this this issue by handcrafting task-specific surrogates to the original optimization problem that provide informative gradients when differentiated through. However, the need to handcraft surrogates for each new task limits the usability of DFL. In addition, there are often no guarantees about the convexity of the resulting surrogates and, as a result, training a predictive model using them can lead to inferior local optima. In this paper, we do away with surrogates altogether and instead learn loss functions that capture task-specific information. To the best of our knowledge, ours is the first approach that entirely replaces the optimization component of decision-focused learning with a loss that is automatically learned. Our approach (a) only requires access to a black-box oracle that can solve the optimization problem and is thus generalizable, and (b) can be convex by construction and so can be easily optimized over. We evaluate our approach on three resource allocation problems from the literature and find that our approach outperforms learning without taking into account task-structure in all three domains, and even hand-crafted surrogates from the literature.
    Boosting the Confidence of Generalization for $L_2$-Stable Randomized Learning Algorithms. (arXiv:2206.03834v1 [stat.ML])
    Exponential generalization bounds with near-tight rates have recently been established for uniformly stable learning algorithms. The notion of uniform stability, however, is stringent in the sense that it is invariant to the data-generating distribution. Under the weaker and distribution dependent notions of stability such as hypothesis stability and $L_2$-stability, the literature suggests that only polynomial generalization bounds are possible in general cases. The present paper addresses this long standing tension between these two regimes of results and makes progress towards relaxing it inside a classic framework of confidence-boosting. To this end, we first establish an in-expectation first moment generalization error bound for potentially randomized learning algorithms with $L_2$-stability, based on which we then show that a properly designed subbagging process leads to near-tight exponential generalization bounds over the randomness of both data and algorithm. We further substantialize these generic results to stochastic gradient descent (SGD) to derive improved high-probability generalization bounds for convex or non-convex optimization problems with natural time decaying learning rates, which have not been possible to prove with the existing hypothesis stability or uniform stability based results.
    Multi-channel neural networks for predicting influenza A virus hosts and antigenic types. (arXiv:2206.03823v1 [q-bio.QM])
    Influenza occurs every season and occasionally causes pandemics. Despite its low mortality rate, influenza is a major public health concern, as it can be complicated by severe diseases like pneumonia. A fast, accurate and low-cost method to predict the origin host and subtype of influenza viruses could help reduce virus transmission and benefit resource-poor areas. In this work, we propose multi-channel neural networks to predict antigenic types and hosts of influenza A viruses with hemagglutinin and neuraminidase protein sequences. An integrated data set containing complete protein sequences were used to produce a pre-trained model, and two other data sets were used for testing the model's performance. One test set contained complete protein sequences, and another test set contained incomplete protein sequences. The results suggest that multi-channel neural networks are applicable and promising for predicting influenza A virus hosts and antigenic subtypes with complete and partial protein sequences.
    NOMAD: Nonlinear Manifold Decoders for Operator Learning. (arXiv:2206.03551v1 [cs.LG])
    Supervised learning in function spaces is an emerging area of machine learning research with applications to the prediction of complex physical systems such as fluid flows, solid mechanics, and climate modeling. By directly learning maps (operators) between infinite dimensional function spaces, these models are able to learn discretization invariant representations of target functions. A common approach is to represent such target functions as linear combinations of basis elements learned from data. However, there are simple scenarios where, even though the target functions form a low dimensional submanifold, a very large number of basis elements is needed for an accurate linear representation. Here we present NOMAD, a novel operator learning framework with a nonlinear decoder map capable of learning finite dimensional representations of nonlinear submanifolds in function spaces. We show this method is able to accurately learn low dimensional representations of solution manifolds to partial differential equations while outperforming linear models of larger size. Additionally, we compare to state-of-the-art operator learning methods on a complex fluid dynamics benchmark and achieve competitive performance with a significantly smaller model size and training cost.
    Transfer learning to decode brain states reflecting the relationship between cognitive tasks. (arXiv:2206.03950v1 [q-bio.NC])
    Transfer learning improves the performance of the target task by leveraging the data of a specific source task: the closer the relationship between the source and the target tasks, the greater the performance improvement by transfer learning. In neuroscience, the relationship between cognitive tasks is usually represented by similarity of activated brain regions or neural representation. However, no study has linked transfer learning and neuroscience to reveal the relationship between cognitive tasks. In this study, we propose a transfer learning framework to reflect the relationship between cognitive tasks, and compare the task relations reflected by transfer learning and by the overlaps of brain regions (e.g., neurosynth). Our results of transfer learning create cognitive taskonomy to reflect the relationship between cognitive tasks which is well in line with the task relations derived from neurosynth. Transfer learning performs better in task decoding with fMRI data if the source and target cognitive tasks activate similar brain regions. Our study uncovers the relationship of multiple cognitive tasks and provides guidance for source task selection in transfer learning for neural decoding based on small-sample data.
    Hybrid Physics and Deep Learning Model for Interpretable Vehicle State Prediction. (arXiv:2103.06727v3 [cs.LG] UPDATED)
    Physical motion models offer interpretable predictions for the motion of vehicles. However, some model parameters, such as those related to aero- and hydrodynamics, are expensive to measure and are often only roughly approximated reducing prediction accuracy. Recurrent neural networks achieve high prediction accuracy at low cost, as they can use cheap measurements collected during routine operation of the vehicle, but their results are hard to interpret. To precisely predict vehicle states without expensive measurements of physical parameters, we propose a hybrid approach combining deep learning and physical motion models including a novel two-phase training procedure. We achieve interpretability by restricting the output range of the deep neural network as part of the hybrid model, which limits the uncertainty introduced by the neural network to a known quantity. We have evaluated our approach for the use case of ship and quadcopter motion. The results show that our hybrid model can improve model interpretability with no decrease in accuracy compared to existing deep learning approaches.
    Mathematical model bridges disparate timescales of lifelong learning. (arXiv:2206.03954v1 [physics.soc-ph])
    Lifelong learning occurs on timescales ranging from minutes to decades. People can lose themselves in a new skill, practicing for hours until exhausted. And they can pursue mastery over days or decades, perhaps abandoning old skills entirely to seek out new challenges. A full understanding of learning requires an account that integrates these timescales. Here, we present a minimal quantitative model that unifies the nested timescales of learning. Our dynamical model recovers classic accounts of skill acquisition, and describes how learning emerges from moment-to-moment dynamics of motivation, fatigue, and work, while also situated within longer-term dynamics of skill selection, mastery, and abandonment. We apply this model to explore the benefits and pitfalls of a variety of training regimes and to characterize individual differences in motivation and skill development. Our model connects previously disparate timescales -- and the subdisciplines that typically study each timescale in isolation -- to offer a unified account of the timecourse of skill acquisition.
    FedHPO-B: A Benchmark Suite for Federated Hyperparameter Optimization. (arXiv:2206.03966v1 [cs.LG])
    Hyperparameter optimization (HPO) is crucial for machine learning algorithms to achieve satisfactory performance, whose progress has been boosted by related benchmarks. Nonetheless, existing efforts in benchmarking all focus on HPO for traditional centralized learning while ignoring federated learning (FL), a promising paradigm for collaboratively learning models from dispersed data. In this paper, we first identify some uniqueness of HPO for FL algorithms from various aspects. Due to this uniqueness, existing HPO benchmarks no longer satisfy the need to compare HPO methods in the FL setting. To facilitate the research of HPO in the FL setting, we propose and implement a benchmark suite FedHPO-B that incorporates comprehensive FL tasks, enables efficient function evaluations, and eases continuing extensions. We also conduct extensive experiments based on FedHPO-B to benchmark a few HPO methods. We open-source FedHPO-B at https://github.com/alibaba/FederatedScope/tree/master/benchmark/FedHPOB and will maintain it actively.
    Boundary between noise and information applied to filtering neural network weight matrices. (arXiv:2206.03927v1 [cond-mat.dis-nn])
    Deep neural networks have been successfully applied to a broad range of problems where overparametrization yields weight matrices which are partially random. A comparison of weight matrix singular vectors to the Porter-Thomas distribution suggests that there is a boundary between randomness and learned information in the singular value spectrum. Inspired by this finding, we introduce an algorithm for noise filtering, which both removes small singular values and reduces the magnitude of large singular values to counteract the effect of level repulsion between the noise and the information part of the spectrum. For networks trained in the presence of label noise, we indeed find that the generalization performance improves significantly due to noise filtering.
    Sparse Fusion Mixture-of-Experts are Domain Generalizable Learners. (arXiv:2206.04046v1 [cs.CV])
    Domain generalization (DG) aims at learning generalizable models under distribution shifts to avoid redundantly overfitting massive training data. Previous works with complex loss design and gradient constraint have not yet led to empirical success on large-scale benchmarks. In this work, we reveal the mixture-of-experts (MoE) model's generalizability on DG by leveraging to distributively handle multiple aspects of the predictive features across domains. To this end, we propose Sparse Fusion Mixture-of-Experts (SF-MoE), which incorporates sparsity and fusion mechanisms into the MoE framework to keep the model both sparse and predictive. SF-MoE has two dedicated modules: 1) sparse block and 2) fusion block, which disentangle and aggregate the diverse learned signals of an object, respectively. Extensive experiments demonstrate that SF-MoE is a domain-generalizable learner on large-scale benchmarks. It outperforms state-of-the-art counterparts by more than 2% across 5 large-scale DG datasets (e.g., DomainNet), with the same or even lower computational costs. We further reveal the internal mechanism of SF-MoE from distributed representation perspective (e.g., visual attributes). We hope this framework could facilitate future research to push generalizable object recognition to the real world. Code and models are released at https://github.com/Luodian/SF-MoE-DG.
    Scalable Online Disease Diagnosis via Multi-Model-Fused Actor-Critic Reinforcement Learning. (arXiv:2206.03659v1 [cs.LG])
    For those seeking healthcare advice online, AI based dialogue agents capable of interacting with patients to perform automatic disease diagnosis are a viable option. This application necessitates efficient inquiry of relevant disease symptoms in order to make accurate diagnosis recommendations. This can be formulated as a problem of sequential feature (symptom) selection and classification for which reinforcement learning (RL) approaches have been proposed as a natural solution. They perform well when the feature space is small, that is, the number of symptoms and diagnosable disease categories is limited, but they frequently fail in assignments with a large number of features. To address this challenge, we propose a Multi-Model-Fused Actor-Critic (MMF-AC) RL framework that consists of a generative actor network and a diagnostic critic network. The actor incorporates a Variational AutoEncoder (VAE) to model the uncertainty induced by partial observations of features, thereby facilitating in making appropriate inquiries. In the critic network, a supervised diagnosis model for disease predictions is involved to precisely estimate the state-value function. Furthermore, inspired by the medical concept of differential diagnosis, we combine the generative and diagnosis models to create a novel reward shaping mechanism to address the sparse reward problem in large search spaces. We conduct extensive experiments on both synthetic and real-world datasets for empirical evaluations. The results demonstrate that our approach outperforms state-of-the-art methods in terms of diagnostic accuracy and interaction efficiency while also being more effectively scalable to large search spaces. Besides, our method is adaptable to both categorical and continuous features, making it ideal for online applications.
    Integrating Symmetry into Differentiable Planning. (arXiv:2206.03674v1 [cs.LG])
    We study how group symmetry helps improve data efficiency and generalization for end-to-end differentiable planning algorithms, specifically on 2D robotic path planning problems: navigation and manipulation. We first formalize the idea from Value Iteration Networks (VINs) on using convolutional networks for path planning, because it avoids explicitly constructing equivalence classes and enable end-to-end planning. We then show that value iteration can always be represented as some convolutional form for (2D) path planning, and name the resulting paradigm Symmetric Planner (SymPlan). In implementation, we use steerable convolution networks to incorporate symmetry. Our algorithms on navigation and manipulation, with given or learned maps, improve training efficiency and generalization performance by large margins over non-equivariant counterparts, VIN and GPPN.
    Deeper-GXX: Deepening Arbitrary GNNs. (arXiv:2110.13798v2 [cs.LG] UPDATED)
    Shallow GNNs tend to have sub-optimal performance dealing with large-scale graphs or graphs with missing features. Therefore, it is necessary to increase the depth (i.e., the number of layers) of GNNs to capture more latent knowledge of the input data. On the other hand, including more layers in GNNs typically decreases their performance due to, e.g., vanishing gradient and oversmoothing. Existing methods (e.g., PairNorm and DropEdge) mainly focus on addressing oversmoothing, but they suffer from some drawbacks such as requiring hard-to-acquire knowledge or having large training randomness. In addition, these methods simply incorporate ResNet to address vanishing gradient. They ignore an important fact: by stacking more and more layers with ResNet architecture, the information collected from faraway neighbors becomes dominant, compared with the information collected from the 1-hop and 2-hop neighbors, thus resulting in severe performance degradation. In this paper, we first go deep into the architecture of ResNet and analyze why ResNet is not best suited for deeper GNNs. Then we propose a new residual architecture to attenuate the negative impact caused by ResNet. To address the drawbacks of these existing methods, we introduce the Topology-guided Graph Contrastive Loss named TGCL. It utilizes node topological information and pulls the connected node pairs closer via contrastive learning regularization to obtain discriminative node representations. Combining the new residual architecture with TGCL, an end-to-end framework named Deeper-GXX is proposed towards deeper GNNs. The extensive experiments on real-world data sets demonstrate the effectiveness and efficiency of Deeper-GXX compared with state-of-the-art baselines.
    Mapping the Internet: Modelling Entity Interactions in Complex Heterogeneous Networks. (arXiv:2104.09650v2 [cs.LG] UPDATED)
    Even though machine learning algorithms already play a significant role in data science, many current methods pose unrealistic assumptions on input data. The application of such methods is difficult due to incompatible data formats, or heterogeneous, hierarchical or entirely missing data fragments in the dataset. As a solution, we propose a versatile, unified framework called `HMill' for sample representation, model definition and training. We review in depth a multi-instance paradigm for machine learning that the framework builds on and extends. To theoretically justify the design of key components of HMill, we show an extension of the universal approximation theorem to the set of all functions realized by models implemented in the framework. The text also contains a detailed discussion on technicalities and performance improvements in our implementation, which is published for download under the MIT License. The main asset of the framework is its flexibility, which makes modelling of diverse real-world data sources with the same tool possible. Additionally to the standard setting in which a set of attributes is observed for each object individually, we explain how message-passing inference in graphs that represent whole systems of objects can be implemented in the framework. To support our claims, we solve three different problems from the cybersecurity domain using the framework. The first use case concerns IoT device identification from raw network observations. In the second problem, we study how malicious binary files can be classified using a snapshot of the operating system represented as a directed graph. The last provided example is a task of domain blacklist extension through modelling interactions between entities in the network. In all three problems, the solution based on the proposed framework achieves performance comparable to specialized approaches.
    General Greedy De-bias Learning. (arXiv:2112.10572v3 [cs.LG] UPDATED)
    Neural networks often make predictions relying on the spurious correlations from the datasets rather than the intrinsic properties of the task of interest, facing sharp degradation on out-of-distribution (OOD) test data. Existing de-bias learning frameworks try to capture specific dataset bias by annotations but they fail to handle complicated OOD scenarios. Others implicitly identify the dataset bias by special design low capability biased models or losses, but they degrade when the training and testing data are from the same distribution. In this paper, we propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model. The base model is encouraged to focus on examples that are hard to solve with biased models, thus remaining robust against spurious correlations in the test stage. GGD largely improves models' OOD generalization ability on various tasks, but sometimes over-estimates the bias level and degrades on the in-distribution test. We further re-analyze the ensemble process of GGD and introduce the Curriculum Regularization inspired by curriculum learning, which achieves a good trade-off between in-distribution and out-of-distribution performance. Extensive experiments on image classification, adversarial question answering, and visual question answering demonstrate the effectiveness of our method. GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.  ( 2 min )
    Decentralized Online Regularized Learning Over Random Time-Varying Graphs. (arXiv:2206.03861v1 [cs.LG])
    We study the decentralized online regularized linear regression algorithm over random time-varying graphs. At each time step, every node runs an online estimation algorithm consisting of an innovation term processing its own new measurement, a consensus term taking a weighted sum of estimations of its own and its neighbors with additive and multiplicative communication noises and a regularization term preventing over-fitting. It is not required that the regression matrices and graphs satisfy special statistical assumptions such as mutual independence, spatio-temporal independence or stationarity. We develop the nonnegative supermartingale inequality of the estimation error, and prove that the estimations of all nodes converge to the unknown true parameter vector almost surely if the algorithm gains, graphs and regression matrices jointly satisfy the sample path spatio-temporal persistence of excitation condition. Especially, this condition holds by choosing appropriate algorithm gains if the graphs are uniformly conditionally jointly connected and conditionally balanced, and the regression models of all nodes are uniformly conditionally spatio-temporally jointly observable, under which the algorithm converges in mean square and almost surely. In addition, we prove that the regret upper bound $\mathcal O(T^{1-\tau}\ln T)$, where $\tau\in (0.5,1)$ is a constant depending on the algorithm gains.  ( 2 min )
    pFL-Bench: A Comprehensive Benchmark for Personalized Federated Learning. (arXiv:2206.03655v1 [cs.LG])
    Personalized Federated Learning (pFL) has gained increasing attention in recent years due to its success in handling the statistical heterogeneity of FL clients via utilizing and deploying distinct local models. However, standardized evaluation and systematical analysis of diverse pFL methods remain a challenge. Firstly, the highly varied datasets, FL simulation settings and pFL implementations impede the fast and fair pFL comparison. Secondly, the effectiveness and robustness of pFL methods are under-explored in various practical scenarios, such as new clients generalization and resource-limited clients participation. Finally, the current pFL literature diverges in the adopted evaluation and ablation protocols. To tackle these challenges, we propose the first comprehensive pFL benchmark, pFL-Bench, for facilitating rapid, reproducible, standardized and thorough pFL evaluation. The proposed benchmark contains 9 datasets in diverse application domains with unified data partition and realistic heterogeneous settings; a modular and easy-to-extend pFL codebase with more than 20 competitive pFL baseline implementations; and systematic evaluations under containerized environments in terms of generalization, fairness, system overhead, and convergence. We highlight the benefits and potential of SOTA pFL methods and hope pFL-Bench enables further pFL research and broad applications that would otherwise be difficult owing to the absence of a dedicated benchmark. The code is released at https://github.com/alibaba/FederatedScope/tree/master/benchmark/pFL-Bench.  ( 2 min )
    Predict better with less training data using a QNN. (arXiv:2206.03960v1 [quant-ph])
    Over the past decade, machine learning revolutionized vision-based quality assessment for which convolutional neural networks (CNNs) have now become the standard. In this paper, we consider a potential next step in this development and describe a quanvolutional neural network (QNN) algorithm that efficiently maps classical image data to quantum states and allows for reliable image analysis. We practically demonstrate how to leverage quantum devices in computer vision and how to introduce quantum convolutions into classical CNNs. Dealing with a real world use case in industrial quality control, we implement our hybrid QNN model within the PennyLane framework and empirically observe it to achieve better predictions using much fewer training data than classical CNNs. In other words, we empirically observe a genuine quantum advantage for an industrial application where the advantage is due to superior data encoding.  ( 2 min )
    Efficient Resource Allocation with Fairness Constraints in Restless Multi-Armed Bandits. (arXiv:2206.03883v1 [cs.LG])
    Restless Multi-Armed Bandits (RMAB) is an apt model to represent decision-making problems in public health interventions (e.g., tuberculosis, maternal, and child care), anti-poaching planning, sensor monitoring, personalized recommendations and many more. Existing research in RMAB has contributed mechanisms and theoretical results to a wide variety of settings, where the focus is on maximizing expected value. In this paper, we are interested in ensuring that RMAB decision making is also fair to different arms while maximizing expected value. In the context of public health settings, this would ensure that different people and/or communities are fairly represented while making public health intervention decisions. To achieve this goal, we formally define the fairness constraints in RMAB and provide planning and learning methods to solve RMAB in a fair manner. We demonstrate key theoretical properties of fair RMAB and experimentally demonstrate that our proposed methods handle fairness constraints without sacrificing significantly on solution quality.  ( 2 min )
    Error Rates for Kernel Classification under Source and Capacity Conditions. (arXiv:2201.12655v2 [stat.ML] UPDATED)
    We consider the problem of kernel classification. Works on kernel regression have shown that the rate of decay of the prediction error with the number of samples for a large class of data-sets is well characterized by two quantities: the capacity and source of the data-set. In this work, we compute the decay rates for the misclassification (prediction) error under the Gaussian design, for data-sets satisfying source and capacity assumptions. We derive the rates as a function of the source and capacity coefficients for two standard kernel classification settings, namely margin-maximizing Support Vector Machines (SVM) and ridge classification, and contrast the two methods. As a consequence, we find that the known worst-case rates are loose for this class of data-sets. Finally, we show that the rates presented in this work are also observed on real data-sets.  ( 2 min )
    Distributed Newton-Type Methods with Communication Compression and Bernoulli Aggregation. (arXiv:2206.03588v1 [cs.LG])
    Despite their high computation and communication costs, Newton-type methods remain an appealing option for distributed training due to their robustness against ill-conditioned convex problems. In this work, we study ommunication compression and aggregation mechanisms for curvature information in order to reduce these costs while preserving theoretically superior local convergence guarantees. We prove that the recently developed class of three point compressors (3PC) of Richtarik et al. [2022] for gradient communication can be generalized to Hessian communication as well. This result opens up a wide variety of communication strategies, such as contractive compression} and lazy aggregation, available to our disposal to compress prohibitively costly curvature information. Moreover, we discovered several new 3PC mechanisms, such as adaptive thresholding and Bernoulli aggregation, which require reduced communication and occasional Hessian computations. Furthermore, we extend and analyze our approach to bidirectional communication compression and partial device participation setups to cater to the practical considerations of applications in federated learning. For all our methods, we derive fast condition-number-independent local linear and/or superlinear convergence rates. Finally, with extensive numerical evaluations on convex optimization problems, we illustrate that our designed schemes achieve state-of-the-art communication complexity compared to several key baselines using second-order information.
    Modularized Transfer Learning with Multiple Knowledge Graphs for Zero-shot Commonsense Reasoning. (arXiv:2206.03715v1 [cs.AI])
    Commonsense reasoning systems should be able to generalize to diverse reasoning cases. However, most state-of-the-art approaches depend on expensive data annotations and overfit to a specific benchmark without learning how to perform general semantic reasoning. To overcome these drawbacks, zero-shot QA systems have shown promise as a robust learning scheme by transforming a commonsense knowledge graph (KG) into synthetic QA-form samples for model training. Considering the increasing type of different commonsense KGs, this paper aims to extend the zero-shot transfer learning scenario into multiple-source settings, where different KGs can be utilized synergetically. Towards this goal, we propose to mitigate the loss of knowledge from the interference among the different knowledge sources, by developing a modular variant of the knowledge aggregation as a new zero-shot commonsense reasoning framework. Results on five commonsense reasoning benchmarks demonstrate the efficacy of our framework, improving the performance with multiple KGs.
    Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits. (arXiv:2206.03520v1 [stat.ML])
    We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family, which covers many common reward distributions including Bernoulli, Gaussian, Gamma, Exponential, etc. We propose a Thompson sampling algorithm, termed ExpTS, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm. We provide a tight regret analysis for ExpTS, which simultaneously yields both the finite-time regret bound as well as the asymptotic regret bound. In particular, for a $K$-armed bandit with exponential family rewards, ExpTS over a horizon $T$ is sub-UCB (a strong criterion for the finite-time regret that is problem-dependent), minimax optimal up to a factor $\sqrt{\log K}$, and asymptotically optimal, for exponential family rewards. Moreover, we propose ExpTS$^+$, by adding a greedy exploitation step in addition to the sampling distribution used in ExpTS, to avoid the over-estimation of sub-optimal arms. ExpTS$^+$ is an anytime bandit algorithm and achieves the minimax optimality and asymptotic optimality simultaneously for exponential family reward distributions. Our proof techniques are general and conceptually simple and can be easily applied to analyze standard Thompson sampling with specific reward distributions.
    Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement Learning with Latent Low-Rank Structure. (arXiv:2206.03569v1 [cs.LG])
    The practicality of reinforcement learning algorithms has been limited due to poor scaling with respect to the problem size, as the sample complexity of learning an $\epsilon$-optimal policy is $\Tilde{\Omega}\left(|S||A|H^3 / \eps^2\right)$ over worst case instances of an MDP with state space $S$, action space $A$, and horizon $H$. We consider a class of MDPs that exhibit low rank structure, where the latent features are unknown. We argue that a natural combination of value iteration and low-rank matrix estimation results in an estimation error that grows doubly exponentially in the horizon $H$. We then provide a new algorithm along with statistical guarantees that efficiently exploits low rank structure given access to a generative model, achieving a sample complexity of $\Tilde{O}\left(d^5(|S|+|A|)\mathrm{poly}(H)/\eps^2\right)$ for a rank $d$ setting, which is minimax optimal with respect to the scaling of $|S|, |A|$, and $\eps$. In contrast to literature on linear and low-rank MDPs, we do not require a known feature mapping, our algorithm is computationally simple, and our results hold for long time horizons. Our results provide insights on the minimal low-rank structural assumptions required on the MDP with respect to the transition kernel versus the optimal action-value function.
    White-box Membership Attack Against Machine Learning Based Retinopathy Classification. (arXiv:2206.03584v1 [cs.CR])
    The advances in machine learning (ML) have greatly improved AI-based diagnosis aid systems in medical imaging. However, being based on collecting medical data specific to individuals induces several security issues, especially in terms of privacy. Even though the owner of the images like a hospital put in place strict privacy protection provisions at the level of its information system, the model trained over his images still holds disclosure potential. The trained model may be accessible to an attacker as: 1) White-box: accessing to the model architecture and parameters; 2) Black box: where he can only query the model with his own inputs through an appropriate interface. Existing attack methods include: feature estimation attacks (FEA), membership inference attack (MIA), model memorization attack (MMA) and identification attacks (IA). In this work we focus on MIA against a model that has been trained to detect diabetic retinopathy from retinal images. Diabetic retinopathy is a condition that can cause vision loss and blindness in the people who have diabetes. MIA is the process of determining whether a data sample comes from the training data set of a trained ML model or not. From a privacy perspective in our use case where a diabetic retinopathy classification model is given to partners that have at their disposal images along with patients' identifiers, inferring the membership status of a data sample can help to state if a patient has contributed or not to the training of the model.
    Predictive Modeling of Charge Levels for Battery Electric Vehicles using CNN EfficientNet and IGTD Algorithm. (arXiv:2206.03612v1 [cs.CV])
    Convolutional Neural Networks (CNN) have been a good solution for understanding a vast image dataset. As the increased number of battery-equipped electric vehicles is flourishing globally, there has been much research on understanding which charge levels electric vehicle drivers would choose to charge their vehicles to get to their destination without any prevention. We implemented deep learning approaches to analyze the tabular datasets to understand their state of charge and which charge levels they would choose. In addition, we implemented the Image Generator for Tabular Dataset algorithm to utilize tabular datasets as image datasets to train convolutional neural networks. Also, we integrated other CNN architecture such as EfficientNet to prove that CNN is a great learner for reading information from images that were converted from the tabular dataset, and able to predict charge levels for battery-equipped electric vehicles. We also evaluated several optimization methods to enhance the learning rate of the models and examined further analysis on improving the model architecture.
    On gradient descent training under data augmentation with on-line noisy copies. (arXiv:2206.03734v1 [stat.ML])
    In machine learning, data augmentation (DA) is a technique for improving the generalization performance. In this paper, we mainly considered gradient descent of linear regression under DA using noisy copies of datasets, in which noise is injected into inputs. We analyzed the situation where random noisy copies are newly generated and used at each epoch; i.e., the case of using on-line noisy copies. Therefore, it is viewed as an analysis on a method using noise injection into training process by DA manner; i.e., on-line version of DA. We derived the averaged behavior of training process under three situations which are the full-batch training under the sum of squared errors, the full-batch and mini-batch training under the mean squared error. We showed that, in all cases, training for DA with on-line copies is approximately equivalent to a ridge regression training whose regularization parameter corresponds to the variance of injected noise. On the other hand, we showed that the learning rate is multiplied by the number of noisy copies plus one in full-batch under the sum of squared errors and the mini-batch under the mean squared error; i.e., DA with on-line copies yields apparent acceleration of training. The apparent acceleration and regularization effect come from the original part and noise in a copy data respectively. These results are confirmed in a numerical experiment. In the numerical experiment, we found that our result can be approximately applied to usual off-line DA in under-parameterization scenario and can not in over-parametrization scenario. Moreover, we experimentally investigated the training process of neural networks under DA with off-line noisy copies and found that our analysis on linear regression is possible to be applied to neural networks.
    Fairness-Aware PAC Learning from Corrupted Data. (arXiv:2102.06004v3 [cs.LG] UPDATED)
    Addressing fairness concerns about machine learning models is a crucial step towards their long-term adoption in real-world automated systems. While many approaches have been developed for training fair models from data, little is known about the robustness of these methods to data corruption. In this work we consider fairness-aware learning under worst-case data manipulations. We show that an adversary can in some situations force any learner to return an overly biased classifier, regardless of the sample size and with or without degrading accuracy, and that the strength of the excess bias increases for learning problems with underrepresented protected groups in the data. We also prove that our hardness results are tight up to constant factors. To this end, we study two natural learning algorithms that optimize for both accuracy and fairness and show that these algorithms enjoy guarantees that are order-optimal in terms of the corruption ratio and the protected groups frequencies in the large data limit.
    Neural Collapse: A Review on Modelling Principles and Generalization. (arXiv:2206.04041v1 [cs.LG])
    With a recent observation of the "Neural Collapse (NC)" phenomena by Papyan et al., various efforts have been made to model it and analyse the implications. Neural collapse describes that in deep classifier networks, the class features of the final hidden layer associated with training data tend to collapse to the respective class feature means. Thus, simplifying the behaviour of the last layer classifier to that of a nearest-class center decision rule. In this work, we analyse the principles which aid in modelling such a phenomena from the ground up and show how they can build a common understanding of the recently proposed models that try to explain NC. We hope that our analysis presents a multifaceted perspective on modelling NC and aids in forming connections with the generalization capabilities of neural networks. Finally, we conclude by discussing the avenues for further research and propose potential research problems.
    Dataset Condensation with Contrastive Signals. (arXiv:2202.02916v2 [cs.CV] UPDATED)
    Recent studies have demonstrated that gradient matching-based dataset synthesis, or dataset condensation (DC), methods can achieve state-of-the-art performance when applied to data-efficient learning tasks. However, in this study, we prove that the existing DC methods can perform worse than the random selection method when task-irrelevant information forms a significant part of the training dataset. We attribute this to the lack of participation of the contrastive signals between the classes resulting from the class-wise gradient matching strategy. To address this problem, we propose Dataset Condensation with Contrastive signals (DCC) by modifying the loss function to enable the DC methods to effectively capture the differences between classes. In addition, we analyze the new loss function in terms of training dynamics by tracking the kernel velocity. Furthermore, we introduce a bi-level warm-up strategy to stabilize the optimization. Our experimental results indicate that while the existing methods are ineffective for fine-grained image classification tasks, the proposed method can successfully generate informative synthetic datasets for the same tasks. Moreover, we demonstrate that the proposed method outperforms the baselines even on benchmark datasets such as SVHN, CIFAR-10, and CIFAR-100. Finally, we demonstrate the high applicability of the proposed method by applying it to continual learning tasks.
    Continuous LWE is as Hard as LWE & Applications to Learning Gaussian Mixtures. (arXiv:2204.02550v2 [cs.CR] UPDATED)
    We show direct and conceptually simple reductions between the classical learning with errors (LWE) problem and its continuous analog, CLWE (Bruna, Regev, Song and Tang, STOC 2021). This allows us to bring to bear the powerful machinery of LWE-based cryptography to the applications of CLWE. For example, we obtain the hardness of CLWE under the classical worst-case hardness of the gap shortest vector problem. Previously, this was known only under quantum worst-case hardness of lattice problems. More broadly, with our reductions between the two problems, any future developments to LWE will also apply to CLWE and its downstream applications. As a concrete application, we show an improved hardness result for density estimation for mixtures of Gaussians. In this computational problem, given sample access to a mixture of Gaussians, the goal is to output a function that estimates the density function of the mixture. Under the (plausible and widely believed) exponential hardness of the classical LWE problem, we show that Gaussian mixture density estimation in $\mathbb{R}^n$ with roughly $\log n$ Gaussian components given $\mathsf{poly}(n)$ samples requires time quasi-polynomial in $n$. Under the (conservative) polynomial hardness of LWE, we show hardness of density estimation for $n^{\epsilon}$ Gaussians for any constant $\epsilon > 0$, which improves on Bruna, Regev, Song and Tang (STOC 2021), who show hardness for at least $\sqrt{n}$ Gaussians under polynomial (quantum) hardness assumptions. Our key technical tool is a reduction from classical LWE to LWE with $k$-sparse secrets where the multiplicative increase in the noise is only $O(\sqrt{k})$, independent of the ambient dimension $n$.
    Causal inference for observational longitudinal studies using deep survival models. (arXiv:2101.10643v12 [stat.ML] UPDATED)
    Causal inference for observational longitudinal studies often requires the accurate estimation of treatment effects on time-to-event outcomes in the presence of time-dependent patient history and time-dependent covariates. To tackle this longitudinal treatment effect estimation problem, we have developed a time-variant causal survival (TCS) model that uses the potential outcomes framework with an ensemble of recurrent subnetworks to estimate the difference in survival probabilities and its confidence interval over time as a function of time-dependent covariates and treatments. Using simulated survival datasets, the TCS model showed good causal effect estimation performance across scenarios of varying sample dimensions, event rates, confounding and overlapping. However, increasing the sample size was not effective in alleviating the adverse impact of a high level of confounding. In a large clinical cohort study, TCS identified the expected conditional average treatment effect and detected individual treatment effect heterogeneity over time. TCS provides an efficient way to estimate and update individualized treatment effects over time, in order to improve clinical decisions. The use of a propensity score layer and potential outcome subnetworks helps correcting for selection bias. However, the proposed model is limited in its ability to correct the bias from unmeasured confounding, and more extensive testing of TCS under extreme scenarios such as low overlapping and the presence of unmeasured confounders is desired and left for future work.  ( 3 min )
    Learning Pruned Structure and Weights Simultaneously from Scratch: an Attention based Approach. (arXiv:2111.02399v2 [cs.LG] UPDATED)
    As a deep learning model typically contains millions of trainable weights, there has been a growing demand for a more efficient network structure with reduced storage space and improved run-time efficiency. Pruning is one of the most popular network compression techniques. In this paper, we propose a novel unstructured pruning pipeline, Attention-based Simultaneous sparse structure and Weight Learning (ASWL). Unlike traditional channel-wise or weight-wise attention mechanism, ASWL proposed an efficient algorithm to calculate the pruning ratio through layer-wise attention for each layer, and both weights for the dense network and the sparse network are tracked so that the pruned structure is simultaneously learned from randomly initialized weights. Our experiments on MNIST, Cifar10, and ImageNet show that ASWL achieves superior pruning results in terms of accuracy, pruning ratio and operating efficiency when compared with state-of-the-art network pruning methods.  ( 2 min )
    Quantum continual learning of quantum data realizing knowledge backward transfer. (arXiv:2203.14032v2 [quant-ph] UPDATED)
    For the goal of strong artificial intelligence that can mimic human-level intelligence, AI systems would have the ability to adapt to ever-changing scenarios and learn new knowledge continuously without forgetting previously acquired knowledge. When a machine learning model is consecutively trained on multiple tasks that come in sequence, its performance on previously learned tasks may drop dramatically during the learning process of the newly seen task. To avoid this phenomenon termed catastrophic forgetting, continual learning, also known as lifelong learning, has been proposed and become one of the most up-to-date research areas of machine learning. As quantum machine learning blossoms in recent years, it is interesting to develop quantum continual learning. This paper focuses on the case of quantum models for quantum data where the computation model and the data to be processed are both quantum. The gradient episodic memory method is incorporated to design a quantum continual learning scheme that overcomes catastrophic forgetting and realizes knowledge backward transfer. Specifically, a sequence of quantum state classification tasks is continually learned by a variational quantum classifier whose parameters are optimized by a classical gradient-based optimizer. The gradient of the current task is projected to the closest gradient, avoiding the increase of the loss at previous tasks, but allowing the decrease. Numerical simulation results show that our scheme not only overcomes catastrophic forgetting, but also realize knowledge backward transfer, which means the classifier's performance on previous tasks is enhanced rather than compromised while learning a new task.  ( 2 min )
    Narrowing the Coordinate-frame Gap in Behavior Prediction Models: Distillation for Efficient and Accurate Scene-centric Motion Forecasting. (arXiv:2206.03970v1 [cs.CV])
    Behavior prediction models have proliferated in recent years, especially in the popular real-world robotics application of autonomous driving, where representing the distribution over possible futures of moving agents is essential for safe and comfortable motion planning. In these models, the choice of coordinate frames to represent inputs and outputs has crucial trade offs which broadly fall into one of two categories. Agent-centric models transform inputs and perform inference in agent-centric coordinates. These models are intrinsically invariant to translation and rotation between scene elements, are best-performing on public leaderboards, but scale quadratically with the number of agents and scene elements. Scene-centric models use a fixed coordinate system to process all agents. This gives them the advantage of sharing representations among all agents, offering efficient amortized inference computation which scales linearly with the number of agents. However, these models have to learn invariance to translation and rotation between scene elements, and typically underperform agent-centric models. In this work, we develop knowledge distillation techniques between probabilistic motion forecasting models, and apply these techniques to close the gap in performance between agent-centric and scene-centric models. This improves scene-centric model performance by 13.2% on the public Argoverse benchmark, 7.8% on Waymo Open Dataset and up to 9.4% on a large In-House dataset. These improved scene-centric models rank highly in public leaderboards and are up to 15 times more efficient than their agent-centric teacher counterparts in busy scenes.  ( 2 min )
    Between Stochastic and Adversarial Online Convex Optimization: Improved Regret Bounds via Smoothness. (arXiv:2202.07554v2 [cs.LG] UPDATED)
    Stochastic and adversarial data are two widely studied settings in online learning. But many optimization tasks are neither i.i.d. nor fully adversarial, which makes it of fundamental interest to get a better theoretical understanding of the world between these extremes. In this work we establish novel regret bounds for online convex optimization in a setting that interpolates between stochastic i.i.d. and fully adversarial losses. By exploiting smoothness of the expected losses, these bounds replace a dependence on the maximum gradient length by the variance of the gradients, which was previously known only for linear losses. In addition, they weaken the i.i.d. assumption by allowing, for example, adversarially poisoned rounds, which were previously considered in the expert and bandit setting. Our results extend this to the online convex optimization framework. In the fully i.i.d. case, our bounds match the rates one would expect from results in stochastic acceleration, and in the fully adversarial case they gracefully deteriorate to match the minimax regret. We further provide lower bounds showing that our regret upper bounds are tight for all intermediate regimes in terms of the stochastic variance and the adversarial variation of the loss gradients.  ( 2 min )
    Model-Free $\mu$ Synthesis via Adversarial Reinforcement Learning. (arXiv:2111.15537v2 [cs.LG] UPDATED)
    Motivated by the recent empirical success of policy-based reinforcement learning (RL), there has been a research trend studying the performance of policy-based RL methods on standard control benchmark problems. In this paper, we examine the effectiveness of policy-based RL methods on an important robust control problem, namely $\mu$ synthesis. We build a connection between robust adversarial RL and $\mu$ synthesis, and develop a model-free version of the well-known $DK$-iteration for solving state-feedback $\mu$ synthesis with static $D$-scaling. In the proposed algorithm, the $K$ step mimics the classical central path algorithm via incorporating a recently-developed double-loop adversarial RL method as a subroutine, and the $D$ step is based on model-free finite difference approximation. Extensive numerical study is also presented to demonstrate the utility of our proposed model-free algorithm. Our study sheds new light on the connections between adversarial RL and robust control.  ( 2 min )
    Diversity vs. Recognizability: Human-like generalization in one-shot generative models. (arXiv:2205.10370v2 [cs.AI] UPDATED)
    Robust generalization to new concepts has long remained a distinctive feature of human intelligence. However, recent progress in deep generative models has now led to neural architectures capable of synthesizing novel instances of unknown visual concepts from a single training example. Yet, a more precise comparison between these models and humans is not possible because existing performance metrics for generative models (i.e., FID, IS, likelihood) are not appropriate for the one-shot generation scenario. Here, we propose a new framework to evaluate one-shot generative models along two axes: sample recognizability vs. diversity (i.e., intra-class variability). Using this framework, we perform a systematic evaluation of representative one-shot generative models on the Omniglot handwritten dataset. We first show that GAN-like and VAE-like models fall on opposite ends of the diversity-recognizability space. Extensive analyses of the effect of key model parameters further revealed that spatial attention and context integration have a linear contribution to the diversity-recognizability trade-off. In contrast, disentanglement transports the model along a parabolic curve that could be used to maximize recognizability. Using the diversity-recognizability framework, we were able to identify models and parameters that closely approximate human data.  ( 2 min )
    Few-Shot Audio-Visual Learning of Environment Acoustics. (arXiv:2206.04006v1 [cs.SD])
    Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener, with implications for various applications in AR, VR, and robotics. Whereas traditional methods to estimate RIRs assume dense geometry and/or sound measurements throughout the environment, we explore how to infer RIRs based on a sparse set of images and echoes observed in the space. Towards that goal, we introduce a transformer-based method that uses self-attention to build a rich acoustic context, then predicts RIRs of arbitrary query source-receiver locations through cross-attention. Additionally, we design a novel training objective that improves the match in the acoustic signature between the RIR predictions and the targets. In experiments using a state-of-the-art audio-visual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs, outperforming state-of-the-art methods and--in a major departure from traditional methods--generalizing to novel environments in a few-shot manner. Project: this http URL  ( 2 min )
    Improve Generalization and Robustness of Neural Networks via Weight Scale Shifting Invariant Regularizations. (arXiv:2008.02965v2 [cs.LG] UPDATED)
    Using weight decay to penalize the L2 norms of weights in neural networks has been a standard training practice to regularize the complexity of networks. In this paper, we show that a family of regularizers, including weight decay, is ineffective at penalizing the intrinsic norms of weights for networks with positively homogeneous activation functions, such as linear, ReLU and max-pooling functions. As a result of homogeneity, functions specified by the networks are invariant to the shifting of weight scales between layers. The ineffective regularizers are sensitive to such shifting and thus poorly regularize the model capacity, leading to overfitting. To address this shortcoming, we propose an improved regularizer that is invariant to weight scale shifting and thus effectively constrains the intrinsic norm of a neural network. The derived regularizer is an upper bound for the input gradient of the network so minimizing the improved regularizer also benefits the adversarial robustness. Residual connections are also considered and we show that our regularizer also forms an upper bound to input gradients of such a residual network. We demonstrate the efficacy of our proposed regularizer on various datasets and neural network architectures at improving generalization and adversarial robustness.  ( 2 min )
    Designing Reinforcement Learning Algorithms for Digital Interventions: Pre-implementation Guidelines. (arXiv:2206.03944v1 [cs.LG])
    Online reinforcement learning (RL) algorithms are increasingly used to personalize digital interventions in the fields of mobile health and online education. Common challenges in designing and testing an RL algorithm in these settings include ensuring the RL algorithm can learn and run stably under real-time constraints, and accounting for the complexity of the environment, e.g., a lack of accurate mechanistic models for the user dynamics. To guide how one can tackle these challenges, we extend the PCS (Predictability, Computability, Stability) framework, a data science framework that incorporates best practices from machine learning and statistics in supervised learning (Yu and Kumbier, 2020), to the design of RL algorithms for the digital interventions setting. Further, we provide guidelines on how to design simulation environments, a crucial tool for evaluating RL candidate algorithms using the PCS framework. We illustrate the use of the PCS framework for designing an RL algorithm for Oralytics, a mobile health study aiming to improve users' tooth-brushing behaviors through the personalized delivery of intervention messages. Oralytics will go into the field in late 2022.  ( 2 min )
    A Primal-Dual Approach to Bilevel Optimization with Multiple Inner Minima. (arXiv:2203.01123v2 [math.OC] UPDATED)
    Bilevel optimization has found extensive applications in modern machine learning problems such as hyperparameter optimization, neural architecture search, meta-learning, etc. While bilevel problems with a unique inner minimal point (e.g., where the inner function is strongly convex) are well understood, such a problem with multiple inner minimal points remains to be challenging and open. Existing algorithms designed for such a problem were applicable to restricted situations and do not come with a full guarantee of convergence. In this paper, we adopt a reformulation of bilevel optimization to constrained optimization, and solve the problem via a primal-dual bilevel optimization (PDBO) algorithm. PDBO not only addresses the multiple inner minima challenge, but also features fully first-order efficiency without involving second-order Hessian and Jacobian computations, as opposed to most existing gradient-based bilevel algorithms. We further characterize the convergence rate of PDBO, which serves as the first known non-asymptotic convergence guarantee for bilevel optimization with multiple inner minima. Our experiments demonstrate desired performance of the proposed approach.  ( 2 min )
    Using Mixed-Effect Models to Learn Bayesian Networks from Related Data Sets. (arXiv:2206.03743v1 [stat.ML])
    We commonly assume that data are a homogeneous set of observations when learning the structure of Bayesian networks. However, they often comprise different data sets that are related but not homogeneous because they have been collected in different ways or from different populations. In our previous work (Azzimonti, Corani and Scutari, 2021), we proposed a closed-form Bayesian Hierarchical Dirichlet score for discrete data that pools information across related data sets to learn a single encompassing network structure, while taking into account the differences in their probabilistic structures. In this paper, we provide an analogous solution for learning a Bayesian network from continuous data using mixed-effects models to pool information across the related data sets. We study its structural, parametric, predictive and classification accuracy and we show that it outperforms both conditional Gaussian Bayesian networks (that do not perform any pooling) and classical Gaussian Bayesian networks (that disregard the heterogeneous nature of the data). The improvement is marked for low sample sizes and for unbalanced data sets.
    A Unified Convergence Theorem for Stochastic Optimization Methods. (arXiv:2206.03907v1 [math.OC])
    In this work, we provide a fundamental unified convergence theorem used for deriving expected and almost sure convergence results for a series of stochastic optimization methods. Our unified theorem only requires to verify several representative conditions and is not tailored to any specific algorithm. As a direct application, we recover expected and almost sure convergence results of the stochastic gradient method (SGD) and random reshuffling (RR) under more general settings. Moreover, we establish new expected and almost sure convergence results for the stochastic proximal gradient method (prox-SGD) and stochastic model-based methods (SMM) for nonsmooth nonconvex optimization problems. These applications reveal that our unified theorem provides a plugin-type convergence analysis and strong convergence guarantees for a wide class of stochastic optimization methods.
    Stabilizing Voltage in Power Distribution Networks via Multi-Agent Reinforcement Learning with Transformer. (arXiv:2206.03721v1 [cs.MA])
    The increased integration of renewable energy poses a slew of technical challenges for the operation of power distribution networks. Among them, voltage fluctuations caused by the instability of renewable energy are receiving increasing attention. Utilizing MARL algorithms to coordinate multiple control units in the grid, which is able to handle rapid changes of power systems, has been widely studied in active voltage control task recently. However, existing approaches based on MARL ignore the unique nature of the grid and achieve limited performance. In this paper, we introduce the transformer architecture to extract representations adapting to power network problems and propose a Transformer-based Multi-Agent Actor-Critic framework (T-MAAC) to stabilize voltage in power distribution networks. In addition, we adopt a novel auxiliary-task training process tailored to the voltage control task, which improves the sample efficiency and facilitating the representation learning of the transformer-based model. We couple T-MAAC with different multi-agent actor-critic algorithms, and the consistent improvements on the active voltage control task demonstrate the effectiveness of the proposed method.
    Subject Granular Differential Privacy in Federated Learning. (arXiv:2206.03617v1 [cs.LG])
    This paper introduces subject granular privacy in the Federated Learning (FL) setting, where a subject is an individual whose private information is embodied by several data items either confined within a single federation user or distributed across multiple federation users. We formally define the notion of subject level differential privacy for FL. We propose three new algorithms that enforce subject level DP. Two of these algorithms are based on notions of user level local differential privacy (LDP) and group differential privacy respectively. The third algorithm is based on a novel idea of hierarchical gradient averaging (HiGradAvgDP) for subjects participating in a training mini-batch. We also introduce horizontal composition of privacy loss for a subject across multiple federation users. We show that horizontal composition is equivalent to sequential composition in the worst case. We prove the subject level DP guarantee for all our algorithms and empirically analyze them using the FEMNIST and Shakespeare datasets. Our evaluation shows that, of our three algorithms, HiGradAvgDP delivers the best model performance, approaching that of a model trained using a DP-SGD based algorithm that provides a weaker item level privacy guarantee.  ( 2 min )
    Metric Based Few-Shot Graph Classification. (arXiv:2206.03695v1 [cs.LG])
    Many modern deep-learning techniques do not work without enormous datasets. At the same time, several fields demand methods working in scarcity of data. This problem is even more complex when the samples have varying structures, as in the case of graphs. Graph representation learning techniques have recently proven successful in a variety of domains. Nevertheless, the employed architectures perform miserably when faced with data scarcity. On the other hand, few-shot learning allows employing modern deep learning models in scarce data regimes without waiving their effectiveness. In this work, we tackle the problem of few-shot graph classification, showing that equipping a simple distance metric learning baseline with a state-of-the-art graph embedder allows to obtain competitive results on the task.While the simplicity of the architecture is enough to outperform more complex ones, it also allows straightforward additions. To this end, we show that additional improvements may be obtained by encouraging a task-conditioned embedding space. Finally, we propose a MixUp-based online data augmentation technique acting in the latent space and show its effectiveness on the task.  ( 2 min )
    Solving the Spike Feature Information Vanishing Problem in Spiking Deep Q Network with Potential Based Normalization. (arXiv:2206.03654v1 [cs.NE])
    Brain inspired spiking neural networks (SNNs) have been successfully applied to many pattern recognition domains. The SNNs based deep structure have achieved considerable results in perceptual tasks, such as image classification, target detection. However, the application of deep SNNs in reinforcement learning (RL) tasks is still a problem to be explored. Although there have been previous studies on the combination of SNNs and RL, most of them focus on robotic control problems with shallow networks or using ANN-SNN conversion method to implement spiking deep Q Network (SDQN). In this work, we mathematically analyzed the problem of the disappearance of spiking signal features in SDQN and proposed a potential based layer normalization(pbLN) method to directly train spiking deep Q networks. Experiment shows that compared with state-of-art ANN-SNN conversion method and other SDQN works, the proposed pbLN spiking deep Q networks (PL-SDQN) achieved better performance on Atari game tasks.  ( 2 min )
    EiX-GNN : Concept-level eigencentrality explainer for graph neural networks. (arXiv:2206.03491v1 [cs.AI])
    Explaining is a human knowledge transfer process regarding a phenomenon between an explainer and an explainee. Each word used to explain this phenomenon must be carefully selected by the explainer in accordance with the current explainee phenomenon-related knowledge level and the phenomenon itself in order to have a high understanding from the explainee of the phenomenon. Nowadays, deep models, especially graph neural networks, have a major place in daily life even in critical applications. In such context, those models need to have a human high interpretability also referred as being explainable, in order to improve usage trustability of them in sensitive cases. Explaining is also a human dependent task and methods that explain deep model behavior must include these social-related concerns for providing profitable and quality explanations. Current explaining methods often occlude such social aspect for providing their explanations and only focus on the signal aspect of the question. In this contribution we propose a reliable social-aware explaining method suited for graph neural network that includes this social feature as a modular concept generator and by both leveraging signal and graph domain aspect thanks to an eigencentrality concept ordering approach. Besides our method takes into account the human-dependent aspect underlying any explanation process, we also reach high score regarding state-of-the-art objective metrics assessing explanation methods for graph neural networks models.  ( 2 min )
    Autoregressive Perturbations for Data Poisoning. (arXiv:2206.03693v1 [cs.LG])
    The prevalence of data scraping from social media as a means to obtain datasets has led to growing concerns regarding unauthorized use of data. Data poisoning attacks have been proposed as a bulwark against scraping, as they make data "unlearnable" by adding small, imperceptible perturbations. Unfortunately, existing methods require knowledge of both the target architecture and the complete dataset so that a surrogate network can be trained, the parameters of which are used to generate the attack. In this work, we introduce autoregressive (AR) poisoning, a method that can generate poisoned data without access to the broader dataset. The proposed AR perturbations are generic, can be applied across different datasets, and can poison different architectures. Compared to existing unlearnable methods, our AR poisons are more resistant against common defenses such as adversarial training and strong data augmentations. Our analysis further provides insight into what makes an effective data poison.  ( 2 min )
    Ensembles for Uncertainty Estimation: Benefits of Prior Functions and Bootstrapping. (arXiv:2206.03633v1 [cs.LG])
    In machine learning, an agent needs to estimate uncertainty to efficiently explore and adapt and to make effective decisions. A common approach to uncertainty estimation maintains an ensemble of models. In recent years, several approaches have been proposed for training ensembles, and conflicting views prevail with regards to the importance of various ingredients of these approaches. In this paper, we aim to address the benefits of two ingredients -- prior functions and bootstrapping -- which have come into question. We show that prior functions can significantly improve an ensemble agent's joint predictions across inputs and that bootstrapping affords additional benefits if the signal-to-noise ratio varies across inputs. Our claims are justified by both theoretical and experimental results.  ( 2 min )
    A Penny for Your (visual) Thoughts: Self-Supervised Reconstruction of Natural Movies from Brain Activity. (arXiv:2206.03544v1 [cs.CV])
    Reconstructing natural videos from fMRI brain recordings is very challenging, for two main reasons: (i) As fMRI data acquisition is diffcult, we only have a limited amount of supervised samples, which is not enough to cover the huge space of natural videos; and (ii) The temporal resolution of fMRI recordings is much lower than the frame rate of natural videos. In this paper, we propose a selfsupervised approach for natural movie reconstruction. By employing cycle consistency over Encoding-Decoding natural videos, we can: (i) exploit the full framerate of the training videos, and not be limited only to clips that correspond to fMRI recordings; (ii) exploit massive amounts of external natural videos which the subjects never saw inside the fMRI machine. These enable increasing the applicable training data by several orders of magnitude, introducing natural video priors to the decoding network, as well as temporal coherence. Our approach signifcantly outperforms competing methods, since those train only on the limited supervised data. We further introduce a new and simple temporal prior of natural videos, which when folded into our fMRI decoder further allows us to reconstruct videos at a higher framerate (HFR) of up to x8 of the original fMRI sample rate.  ( 2 min )
    Towards Scalable Hyperbolic Neural Networks using Taylor Series Approximations. (arXiv:2206.03610v1 [cs.LG])
    Hyperbolic networks have shown prominent improvements over their Euclidean counterparts in several areas involving hierarchical datasets in various domains such as computer vision, graph analysis, and natural language processing. However, their adoption in practice remains restricted due to (i) non-scalability on accelerated deep learning hardware, (ii) vanishing gradients due to the closure of hyperbolic space, and (iii) information loss due to frequent mapping between local tangent space and fully hyperbolic space. To tackle these issues, we propose the approximation of hyperbolic operators using Taylor series expansions, which allows us to reformulate the computationally expensive tangent and cosine hyperbolic functions into their polynomial equivariants which are more efficient. This allows us to retain the benefits of preserving the hierarchical anatomy of the hyperbolic space, while maintaining the scalability over current accelerated deep learning infrastructure. The polynomial formulation also enables us to utilize the advancements in Euclidean networks such as gradient clipping and ReLU activation to avoid vanishing gradients and remove errors due to frequent switching between tangent space and hyperbolic space. Our empirical evaluation on standard benchmarks in the domain of graph analysis and computer vision shows that our polynomial formulation is as scalable as Euclidean architectures, both in terms of memory and time complexity, while providing results as effective as hyperbolic models. Moreover, our formulation also shows a considerable improvement over its baselines due to our solution to vanishing gradients and information loss.  ( 2 min )
    Asymptotic Stability in Reservoir Computing. (arXiv:2206.03854v1 [cs.NE])
    Reservoir Computing is a class of Recurrent Neural Networks with internal weights fixed at random. Stability relates to the sensitivity of the network state to perturbations. It is an important property in Reservoir Computing as it directly impacts performance. In practice, it is desirable to stay in a stable regime, where the effect of perturbations does not explode exponentially, but also close to the chaotic frontier where reservoir dynamics are rich. Open questions remain today regarding input regularization and discontinuous activation functions. In this work, we use the recurrent kernel limit to draw new insights on stability in reservoir computing. This limit corresponds to large reservoir sizes, and it already becomes relevant for reservoirs with a few hundred neurons. We obtain a quantitative characterization of the frontier between stability and chaos, which can greatly benefit hyperparameter tuning. In a broader sense, our results contribute to understanding the complex dynamics of Recurrent Neural Networks.  ( 2 min )
    DeepCAVE: An Interactive Analysis Tool for Automated Machine Learning. (arXiv:2206.03493v1 [cs.LG])
    Automated Machine Learning (AutoML) is used more than ever before to support users in determining efficient hyperparameters, neural architectures, or even full machine learning pipelines. However, users tend to mistrust the optimization process and its results due to a lack of transparency, making manual tuning still widespread. We introduce DeepCAVE, an interactive framework to analyze and monitor state-of-the-art optimization procedures for AutoML easily and ad hoc. By aiming for full and accessible transparency, DeepCAVE builds a bridge between users and AutoML and contributes to establishing trust. Our framework's modular and easy-to-extend nature provides users with automatically generated text, tables, and graphic visualizations. We show the value of DeepCAVE in an exemplary use-case of outlier detection, in which our framework makes it easy to identify problems, compare multiple runs and interpret optimization processes. The package is freely available on GitHub https://github.com/automl/DeepCAVE.  ( 2 min )
    FedPop: A Bayesian Approach for Personalised Federated Learning. (arXiv:2206.03611v1 [cs.LG])
    Personalised federated learning (FL) aims at collaboratively learning a machine learning model taylored for each client. Albeit promising advances have been made in this direction, most of existing approaches works do not allow for uncertainty quantification which is crucial in many applications. In addition, personalisation in the cross-device setting still involves important issues, especially for new clients or those having small number of observations. This paper aims at filling these gaps. To this end, we propose a novel methodology coined FedPop by recasting personalised FL into the population modeling paradigm where clients' models involve fixed common population parameters and random effects, aiming at explaining data heterogeneity. To derive convergence guarantees for our scheme, we introduce a new class of federated stochastic optimisation algorithms which relies on Markov chain Monte Carlo methods. Compared to existing personalised FL methods, the proposed methodology has important benefits: it is robust to client drift, practical for inference on new clients, and above all, enables uncertainty quantification under mild computational and memory overheads. We provide non-asymptotic convergence guarantees for the proposed algorithms and illustrate their performances on various personalised federated learning tasks.  ( 2 min )
    Towards Practical Differential Privacy in Data Analysis: Understanding the Effect of Epsilon on Utility in Private ERM. (arXiv:2206.03488v1 [cs.CR])
    In this paper, we focus our attention on private Empirical Risk Minimization (ERM), which is one of the most commonly used data analysis method. We take the first step towards solving the above problem by theoretically exploring the effect of epsilon (the parameter of differential privacy that determines the strength of privacy guarantee) on utility of the learning model. We trace the change of utility with modification of epsilon and reveal an established relationship between epsilon and utility. We then formalize this relationship and propose a practical approach for estimating the utility under an arbitrary value of epsilon. Both theoretical analysis and experimental results demonstrate high estimation accuracy and broad applicability of our approach in practical applications. As providing algorithms with strong utility guarantees that also give privacy when possible becomes more and more accepted, our approach would have high practical value and may be likely to be adopted by companies and organizations that would like to preserve privacy but are unwilling to compromise on utility.  ( 2 min )
  • Open

    Error Rates for Kernel Classification under Source and Capacity Conditions. (arXiv:2201.12655v2 [stat.ML] UPDATED)
    We consider the problem of kernel classification. Works on kernel regression have shown that the rate of decay of the prediction error with the number of samples for a large class of data-sets is well characterized by two quantities: the capacity and source of the data-set. In this work, we compute the decay rates for the misclassification (prediction) error under the Gaussian design, for data-sets satisfying source and capacity assumptions. We derive the rates as a function of the source and capacity coefficients for two standard kernel classification settings, namely margin-maximizing Support Vector Machines (SVM) and ridge classification, and contrast the two methods. As a consequence, we find that the known worst-case rates are loose for this class of data-sets. Finally, we show that the rates presented in this work are also observed on real data-sets.  ( 2 min )
    Neural Bandit with Arm Group Graph. (arXiv:2206.03644v1 [cs.LG])
    Contextual bandits aim to identify among a set of arms the optimal one with the highest reward based on their contextual information. Motivated by the fact that the arms usually exhibit group behaviors and the mutual impacts exist among groups, we introduce a new model, Arm Group Graph (AGG), where the nodes represent the groups of arms and the weighted edges formulate the correlations among groups. To leverage the rich information in AGG, we propose a bandit algorithm, AGG-UCB, where the neural networks are designed to estimate rewards, and we propose to utilize graph neural networks (GNN) to learn the representations of arm groups with correlations. To solve the exploitation-exploration dilemma in bandits, we derive a new upper confidence bound (UCB) built on neural networks (exploitation) for exploration. Furthermore, we prove that AGG-UCB can achieve a near-optimal regret bound with over-parameterized neural networks, and provide the convergence analysis of GNN with fully-connected layers which may be of independent interest. In the end, we conduct extensive experiments against state-of-the-art baselines on multiple public data sets, showing the effectiveness of the proposed algorithm.  ( 2 min )
    Causal inference for observational longitudinal studies using deep survival models. (arXiv:2101.10643v12 [stat.ML] UPDATED)
    Causal inference for observational longitudinal studies often requires the accurate estimation of treatment effects on time-to-event outcomes in the presence of time-dependent patient history and time-dependent covariates. To tackle this longitudinal treatment effect estimation problem, we have developed a time-variant causal survival (TCS) model that uses the potential outcomes framework with an ensemble of recurrent subnetworks to estimate the difference in survival probabilities and its confidence interval over time as a function of time-dependent covariates and treatments. Using simulated survival datasets, the TCS model showed good causal effect estimation performance across scenarios of varying sample dimensions, event rates, confounding and overlapping. However, increasing the sample size was not effective in alleviating the adverse impact of a high level of confounding. In a large clinical cohort study, TCS identified the expected conditional average treatment effect and detected individual treatment effect heterogeneity over time. TCS provides an efficient way to estimate and update individualized treatment effects over time, in order to improve clinical decisions. The use of a propensity score layer and potential outcome subnetworks helps correcting for selection bias. However, the proposed model is limited in its ability to correct the bias from unmeasured confounding, and more extensive testing of TCS under extreme scenarios such as low overlapping and the presence of unmeasured confounders is desired and left for future work.
    Resolving the Human Subjects Status of Machine Learning's Crowdworkers. (arXiv:2206.04039v1 [cs.CY])
    In recent years, machine learning (ML) has come to rely more heavily on crowdworkers, both for building bigger datasets and for addressing research questions requiring human interaction or judgment. Owing to the diverse tasks performed by crowdworkers, and the myriad ways the resulting datasets are used, it can be difficult to determine when these individuals are best thought of as workers, versus as human subjects. These difficulties are compounded by conflicting policies, with some institutions and researchers treating all ML crowdwork as human subjects research, and other institutions holding that ML crowdworkers rarely constitute human subjects. Additionally, few ML papers involving crowdwork mention IRB oversight, raising the prospect that many might not be in compliance with ethical and regulatory requirements. In this paper, we focus on research in natural language processing to investigate the appropriate designation of crowdsourcing studies and the unique challenges that ML research poses for research oversight. Crucially, under the U.S. Common Rule, these judgments hinge on determinations of "aboutness", both whom (or what) the collected data is about and whom (or what) the analysis is about. We highlight two challenges posed by ML: (1) the same set of workers can serve multiple roles and provide many sorts of information; and (2) compared to the life sciences and social sciences, ML research tends to embrace a dynamic workflow, where research questions are seldom stated ex ante and data sharing opens the door for future studies to ask questions about different targets from the original study. In particular, our analysis exposes a potential loophole in the Common Rule, where researchers can elude research ethics oversight by splitting data collection and analysis into distinct studies. We offer several policy recommendations to address these concerns.
    Fairness-Aware PAC Learning from Corrupted Data. (arXiv:2102.06004v3 [cs.LG] UPDATED)
    Addressing fairness concerns about machine learning models is a crucial step towards their long-term adoption in real-world automated systems. While many approaches have been developed for training fair models from data, little is known about the robustness of these methods to data corruption. In this work we consider fairness-aware learning under worst-case data manipulations. We show that an adversary can in some situations force any learner to return an overly biased classifier, regardless of the sample size and with or without degrading accuracy, and that the strength of the excess bias increases for learning problems with underrepresented protected groups in the data. We also prove that our hardness results are tight up to constant factors. To this end, we study two natural learning algorithms that optimize for both accuracy and fairness and show that these algorithms enjoy guarantees that are order-optimal in terms of the corruption ratio and the protected groups frequencies in the large data limit.
    Estimation of Predictive Performance in High-Dimensional Data Settings using Learning Curves. (arXiv:2206.03825v1 [stat.ME])
    In high-dimensional prediction settings, it remains challenging to reliably estimate the test performance. To address this challenge, a novel performance estimation framework is presented. This framework, called Learn2Evaluate, is based on learning curves by fitting a smooth monotone curve depicting test performance as a function of the sample size. Learn2Evaluate has several advantages compared to commonly applied performance estimation methodologies. Firstly, a learning curve offers a graphical overview of a learner. This overview assists in assessing the potential benefit of adding training samples and it provides a more complete comparison between learners than performance estimates at a fixed subsample size. Secondly, a learning curve facilitates in estimating the performance at the total sample size rather than a subsample size. Thirdly, Learn2Evaluate allows the computation of a theoretically justified and useful lower confidence bound. Furthermore, this bound may be tightened by performing a bias correction. The benefits of Learn2Evaluate are illustrated by a simulation study and applications to omics data.
    Decoupled Self-supervised Learning for Non-Homophilous Graphs. (arXiv:2206.03601v1 [cs.LG])
    In this paper, we study the problem of conducting self-supervised learning for node representation learning on non-homophilous graphs. Existing self-supervised learning methods typically assume the graph is homophilous where linked nodes often belong to the same class or have similar features. However, such assumptions of homophily do not always hold true in real-world graphs. We address this problem by developing a decoupled self-supervised learning (DSSL) framework for graph neural networks. DSSL imitates a generative process of nodes and links from latent variable modeling of the semantic structure, which decouples different underlying semantics between different neighborhoods into the self-supervised node learning process. Our DSSL framework is agnostic to the encoders and does not need prefabricated augmentations, thus is flexible to different graphs. To effectively optimize the framework with latent variables, we derive the evidence lower-bound of the self-supervised objective and develop a scalable training algorithm with variational inference. We provide a theoretical analysis to justify that DSSL enjoys better downstream performance. Extensive experiments on various types of graph benchmarks demonstrate that our proposed framework can significantly achieve better performance compared with competitive self-supervised learning baselines.
    Classification of Stochastic Processes with Topological Data Analysis. (arXiv:2206.03973v1 [stat.ML])
    In this study, we examine if engineered topological features can distinguish time series sampled from different stochastic processes with different noise characteristics, in both balanced and unbalanced sampling schemes. We compare our classification results against the results of the same classification tasks built on statistical and raw features. We conclude that in classification tasks of time series, different machine learning models built on engineered topological features perform consistently better than those built on standard statistical and raw features.
    Out-of-Distribution Detection with Class Ratio Estimation. (arXiv:2206.03955v1 [stat.ML])
    Density-based Out-of-distribution (OOD) detection has recently been shown unreliable for the task of detecting OOD images. Various density ratio based approaches achieve good empirical performance, however methods typically lack a principled probabilistic modelling explanation. In this work, we propose to unify density ratio based methods under a novel framework that builds energy-based models and employs differing base distributions. Under our framework, the density ratio can be viewed as the unnormalized density of an implicit semantic distribution. Further, we propose to directly estimate the density ratio of a data sample through class ratio estimation. We report competitive results on OOD image problems in comparison with recent work that alternatively requires training of deep generative models for the task. Our approach enables a simple and yet effective path towards solving the OOD detection problem.
    Improve Generalization and Robustness of Neural Networks via Weight Scale Shifting Invariant Regularizations. (arXiv:2008.02965v2 [cs.LG] UPDATED)
    Using weight decay to penalize the L2 norms of weights in neural networks has been a standard training practice to regularize the complexity of networks. In this paper, we show that a family of regularizers, including weight decay, is ineffective at penalizing the intrinsic norms of weights for networks with positively homogeneous activation functions, such as linear, ReLU and max-pooling functions. As a result of homogeneity, functions specified by the networks are invariant to the shifting of weight scales between layers. The ineffective regularizers are sensitive to such shifting and thus poorly regularize the model capacity, leading to overfitting. To address this shortcoming, we propose an improved regularizer that is invariant to weight scale shifting and thus effectively constrains the intrinsic norm of a neural network. The derived regularizer is an upper bound for the input gradient of the network so minimizing the improved regularizer also benefits the adversarial robustness. Residual connections are also considered and we show that our regularizer also forms an upper bound to input gradients of such a residual network. We demonstrate the efficacy of our proposed regularizer on various datasets and neural network architectures at improving generalization and adversarial robustness.
    Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits. (arXiv:2206.03520v1 [stat.ML])
    We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family, which covers many common reward distributions including Bernoulli, Gaussian, Gamma, Exponential, etc. We propose a Thompson sampling algorithm, termed ExpTS, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm. We provide a tight regret analysis for ExpTS, which simultaneously yields both the finite-time regret bound as well as the asymptotic regret bound. In particular, for a $K$-armed bandit with exponential family rewards, ExpTS over a horizon $T$ is sub-UCB (a strong criterion for the finite-time regret that is problem-dependent), minimax optimal up to a factor $\sqrt{\log K}$, and asymptotically optimal, for exponential family rewards. Moreover, we propose ExpTS$^+$, by adding a greedy exploitation step in addition to the sampling distribution used in ExpTS, to avoid the over-estimation of sub-optimal arms. ExpTS$^+$ is an anytime bandit algorithm and achieves the minimax optimality and asymptotic optimality simultaneously for exponential family reward distributions. Our proof techniques are general and conceptually simple and can be easily applied to analyze standard Thompson sampling with specific reward distributions.
    Ensembles for Uncertainty Estimation: Benefits of Prior Functions and Bootstrapping. (arXiv:2206.03633v1 [cs.LG])
    In machine learning, an agent needs to estimate uncertainty to efficiently explore and adapt and to make effective decisions. A common approach to uncertainty estimation maintains an ensemble of models. In recent years, several approaches have been proposed for training ensembles, and conflicting views prevail with regards to the importance of various ingredients of these approaches. In this paper, we aim to address the benefits of two ingredients -- prior functions and bootstrapping -- which have come into question. We show that prior functions can significantly improve an ensemble agent's joint predictions across inputs and that bootstrapping affords additional benefits if the signal-to-noise ratio varies across inputs. Our claims are justified by both theoretical and experimental results.
    Probabilistically Robust Learning: Balancing Average- and Worst-case Performance. (arXiv:2202.01136v3 [cs.LG] UPDATED)
    Many of the successes of machine learning are based on minimizing an averaged loss function. However, it is well-known that this paradigm suffers from robustness issues that hinder its applicability in safety-critical domains. These issues are often addressed by training against worst-case perturbations of data, a technique known as adversarial training. Although empirically effective, adversarial training can be overly conservative, leading to unfavorable trade-offs between nominal performance and robustness. To this end, in this paper we propose a framework called probabilistic robustness that bridges the gap between the accurate, yet brittle average case and the robust, yet conservative worst case by enforcing robustness to most rather than to all perturbations. From a theoretical point of view, this framework overcomes the trade-offs between the performance and the sample-complexity of worst-case and average-case learning. From a practical point of view, we propose a novel algorithm based on risk-aware optimization that effectively balances average- and worst-case performance at a considerably lower computational cost relative to adversarial training. Our results on MNIST, CIFAR-10, and SVHN illustrate the advantages of this framework on the spectrum from average- to worst-case robustness.
    Using Mixed-Effect Models to Learn Bayesian Networks from Related Data Sets. (arXiv:2206.03743v1 [stat.ML])
    We commonly assume that data are a homogeneous set of observations when learning the structure of Bayesian networks. However, they often comprise different data sets that are related but not homogeneous because they have been collected in different ways or from different populations. In our previous work (Azzimonti, Corani and Scutari, 2021), we proposed a closed-form Bayesian Hierarchical Dirichlet score for discrete data that pools information across related data sets to learn a single encompassing network structure, while taking into account the differences in their probabilistic structures. In this paper, we provide an analogous solution for learning a Bayesian network from continuous data using mixed-effects models to pool information across the related data sets. We study its structural, parametric, predictive and classification accuracy and we show that it outperforms both conditional Gaussian Bayesian networks (that do not perform any pooling) and classical Gaussian Bayesian networks (that disregard the heterogeneous nature of the data). The improvement is marked for low sample sizes and for unbalanced data sets.
    Neural Diffusion Processes. (arXiv:2206.03992v1 [stat.ML])
    Gaussian processes provide an elegant framework for specifying prior and posterior distributions over functions. They are, however, also computationally expensive, and limited by the expressivity of their covariance function. We propose Neural Diffusion Processes (NDPs), a novel approach based upon diffusion models, that learn to sample from distributions over functions. Using a novel attention block, we can incorporate properties of stochastic processes, such as exchangeability, directly into the NDP's architecture. We empirically show that NDPs are able to capture functional distributions that are close to the true Bayesian posterior of a Gaussian process. This enables a variety of downstream tasks, including hyperparameter marginalisation and Bayesian optimisation.
    Inverse Contextual Bandits: Learning How Behavior Evolves over Time. (arXiv:2107.06317v3 [cs.LG] UPDATED)
    Understanding a decision-maker's priorities by observing their behavior is critical for transparency and accountability in decision processes, such as in healthcare. Though conventional approaches to policy learning almost invariably assume stationarity in behavior, this is hardly true in practice: Medical practice is constantly evolving as clinical professionals fine-tune their knowledge over time. For instance, as the medical community's understanding of organ transplantations has progressed over the years, a pertinent question is: How have actual organ allocation policies been evolving? To give an answer, we desire a policy learning method that provides interpretable representations of decision-making, in particular capturing an agent's non-stationary knowledge of the world, as well as operating in an offline manner. First, we model the evolving behavior of decision-makers in terms of contextual bandits, and formalize the problem of Inverse Contextual Bandits (ICB). Second, we propose two concrete algorithms as solutions, learning parametric and nonparametric representations of an agent's behavior. Finally, using both real and simulated data for liver transplantations, we illustrate the applicability and explainability of our method, as well as benchmarking and validating its accuracy.
    $p$-Sparsified Sketches for Fast Multiple Output Kernel Methods. (arXiv:2206.03827v1 [stat.ML])
    Kernel methods are learning algorithms that enjoy solid theoretical foundations while suffering from important computational limitations. Sketching, that consists in looking for solutions among a subspace of reduced dimension, is a widely studied approach to alleviate this numerical burden. However, fast sketching strategies, such as non-adaptive subsampling, significantly degrade the guarantees of the algorithms, while theoretically-accurate sketches, such as the Gaussian one, turn out to remain relatively slow in practice. In this paper, we introduce the $p$-sparsified sketches, that combine the benefits from both approaches to achieve a good tradeoff between statistical accuracy and computational efficiency. To support our method, we derive excess risk bounds for both single and multiple output problems, with generic Lipschitz losses, providing new guarantees for a wide range of applications, from robust regression to multiple quantile regression. We also provide empirical evidences of the superiority of our sketches over recent SOTA approaches.
    An Information-Theoretic Framework for Supervised Learning. (arXiv:2203.00246v5 [cs.LG] UPDATED)
    Each year, deep learning demonstrates new and improved empirical results with deeper and wider neural networks. Meanwhile, with existing theoretical frameworks, it is difficult to analyze networks deeper than two layers without resorting to counting parameters or encountering sample complexity bounds that are exponential in depth. Perhaps it may be fruitful to try to analyze modern machine learning under a different lens. In this paper, we propose a novel information-theoretic framework with its own notions of regret and sample complexity for analyzing the data requirements of machine learning. With our framework, we first work through some classical examples such as scalar estimation and linear regression to build intuition and introduce general techniques. Then, we use the framework to study the sample complexity of learning from data generated by deep sign neural networks, deep ReLU neural networks, and deep networks that are infinitely wide but have a bounded sum of weights. For sign neural networks, we recover sample-complexity bounds that follow from VC-dimension based arguments. For the latter two neural network environments, we establish new results that suggest that the sample complexity of learning under these data generating processes is at most linear and quadratic, respectively, in network depth.
    Modeling Disagreement in Automatic Data Labelling for Semi-Supervised Learning in Clinical Natural Language Processing. (arXiv:2205.14761v2 [cs.LG] UPDATED)
    Computational models providing accurate estimates of their uncertainty are crucial for risk management associated with decision making in healthcare contexts. This is especially true since many state-of-the-art systems are trained using the data which has been labelled automatically (self-supervised mode) and tend to overfit. In this work, we investigate the quality of uncertainty estimates from a range of current state-of-the-art predictive models applied to the problem of observation detection in radiology reports. This problem remains understudied for Natural Language Processing in the healthcare domain. We demonstrate that Gaussian Processes (GPs) provide superior performance in quantifying the risks of 3 uncertainty labels based on the negative log predictive probability (NLPP) evaluation metric and mean maximum predicted confidence levels (MMPCL), whilst retaining strong predictive performance.
    Boosting the Confidence of Generalization for $L_2$-Stable Randomized Learning Algorithms. (arXiv:2206.03834v1 [stat.ML])
    Exponential generalization bounds with near-tight rates have recently been established for uniformly stable learning algorithms. The notion of uniform stability, however, is stringent in the sense that it is invariant to the data-generating distribution. Under the weaker and distribution dependent notions of stability such as hypothesis stability and $L_2$-stability, the literature suggests that only polynomial generalization bounds are possible in general cases. The present paper addresses this long standing tension between these two regimes of results and makes progress towards relaxing it inside a classic framework of confidence-boosting. To this end, we first establish an in-expectation first moment generalization error bound for potentially randomized learning algorithms with $L_2$-stability, based on which we then show that a properly designed subbagging process leads to near-tight exponential generalization bounds over the randomness of both data and algorithm. We further substantialize these generic results to stochastic gradient descent (SGD) to derive improved high-probability generalization bounds for convex or non-convex optimization problems with natural time decaying learning rates, which have not been possible to prove with the existing hypothesis stability or uniform stability based results.
    Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks. (arXiv:2206.03826v1 [cs.LG])
    For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches randomly mask input patches and then reconstruct pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional supervised learning (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative semantics in the pretraining dataset, and accordingly show its provable improvement over SL on the classification downstream task. Specifically, we assume that pretraining dataset contains multi-view samples of ratio $1-\mu$ and single-view samples of ratio $\mu$, where multi/single-view samples has multiple/single discriminative semantics. Then for pretraining, we prove that 1) the convolution kernels of the MRP encoder captures all discriminative semantics in the pretraining data; and 2) a convolution kernel captures at most one semantic. Accordingly, in the downstream supervised fine-tuning, most semantics would be captured and different semantics would not be fused together. This helps the downstream fine-tuned network to easily establish the relation between kernels and semantic class labels. In this way, the fine-tuned encoder in MRP provably achieves zero test error with high probability for both multi-view and single-view test data. In contrast, as proved by~[3], conventional SL can only obtain a test accuracy between around $0.5\mu$ for single-view test data. These results together explain the benefits of MRP in downstream tasks. Experimental results testify to multi-view data assumptions and our theoretical implications.
    How unfair is private learning ?. (arXiv:2206.03985v1 [cs.LG])
    As machine learning algorithms are deployed on sensitive data in critical decision making processes, it is becoming increasingly important that they are also private and fair. In this paper, we show that, when the data has a long-tailed structure, it is not possible to build accurate learning algorithms that are both private and results in higher accuracy on minority subpopulations. We further show that relaxing overall accuracy can lead to good fairness even with strict privacy requirements. To corroborate our theoretical results in practice, we provide an extensive set of experimental results using a variety of synthetic, vision~(\cifar and CelebA), and tabular~(Law School) datasets and learning algorithms.
    Between Stochastic and Adversarial Online Convex Optimization: Improved Regret Bounds via Smoothness. (arXiv:2202.07554v2 [cs.LG] UPDATED)
    Stochastic and adversarial data are two widely studied settings in online learning. But many optimization tasks are neither i.i.d. nor fully adversarial, which makes it of fundamental interest to get a better theoretical understanding of the world between these extremes. In this work we establish novel regret bounds for online convex optimization in a setting that interpolates between stochastic i.i.d. and fully adversarial losses. By exploiting smoothness of the expected losses, these bounds replace a dependence on the maximum gradient length by the variance of the gradients, which was previously known only for linear losses. In addition, they weaken the i.i.d. assumption by allowing, for example, adversarially poisoned rounds, which were previously considered in the expert and bandit setting. Our results extend this to the online convex optimization framework. In the fully i.i.d. case, our bounds match the rates one would expect from results in stochastic acceleration, and in the fully adversarial case they gracefully deteriorate to match the minimax regret. We further provide lower bounds showing that our regret upper bounds are tight for all intermediate regimes in terms of the stochastic variance and the adversarial variation of the loss gradients.
    Predicting Census Survey Response Rates via Interpretable Nonparametric Additive Models with Structured Interactions. (arXiv:2108.11328v2 [stat.ML] UPDATED)
    Accurate and interpretable prediction of survey response rates is important from an operational standpoint. The US Census Bureau's well-known ROAM application uses principled statistical models trained on the US Census Planning Database data to identify hard-to-survey areas. An earlier crowdsourcing competition revealed that an ensemble of regression trees led to the best performance in predicting survey response rates; however, the corresponding models could not be adopted for the intended application due to limited interpretability. In this paper, we present new interpretable statistical methods to predict, with high accuracy, response rates in surveys. We study sparse nonparametric additive models with pairwise interactions via $\ell_0$-regularization, as well as hierarchically structured variants that provide enhanced interpretability. Despite strong methodological underpinnings, such models can be computationally challenging -- we present new scalable algorithms for learning these models. We also establish novel non-asymptotic error bounds for the proposed estimators. Experiments based on the US Census Planning Database demonstrate that our methods lead to high-quality predictive models that permit actionable interpretability for different segments of the population. Interestingly, our methods provide significant gains in interpretability without losing in predictive performance to state-of-the-art black-box machine learning methods based on gradient boosting and feedforward neural networks. Our code implementation in python is available at https://github.com/ShibalIbrahim/Additive-Models-with-Structured-Interactions.
    Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials. (arXiv:2206.03688v1 [cs.LG])
    A recent goal in the theory of deep learning is to identify how neural networks can escape the "lazy training," or Neural Tangent Kernel (NTK) regime, where the network is coupled with its first order Taylor expansion at initialization. While the NTK is minimax optimal for learning dense polynomials (Ghorbani et al, 2021), it cannot learn features, and hence has poor sample complexity for learning many classes of functions including sparse polynomials. Recent works have thus aimed to identify settings where gradient based algorithms provably generalize better than the NTK. One such example is the "QuadNTK" approach of Bai and Lee (2020), which analyzes the second-order term in the Taylor expansion. Bai and Lee (2020) show that the second-order term can learn sparse polynomials efficiently; however, it sacrifices the ability to learn general dense polynomials. In this paper, we analyze how gradient descent on a two-layer neural network can escape the NTK regime by utilizing a spectral characterization of the NTK (Montanari and Zhong, 2020) and building on the QuadNTK approach. We first expand upon the spectral analysis to identify "good" directions in parameter space in which we can move without harming generalization. Next, we show that a wide two-layer neural network can jointly use the NTK and QuadNTK to fit target functions consisting of a dense low-degree term and a sparse high-degree term -- something neither the NTK nor the QuadNTK can do on their own. Finally, we construct a regularizer which encourages our parameter vector to move in the "good" directions, and show that gradient descent on the regularized loss will converge to a global minimizer, which also has low test error. This yields an end to end convergence and generalization guarantee with provable sample complexity improvement over both the NTK and QuadNTK on their own.
    Federated Learning Algorithms for Generalized Mixed-effects Model (GLMM) on Horizontally Partitioned Data from Distributed Sources. (arXiv:2109.14046v2 [stat.ML] UPDATED)
    Objectives: This paper develops two algorithms to achieve federated generalized linear mixed effect models (GLMM), and compares the developed model's outcomes with each other, as well as that from the standard R package (`lme4'). Methods: The log-likelihood function of GLMM is approximated by two numerical methods (Laplace approximation and Gaussian Hermite approximation), which supports federated decomposition of GLMM to bring computation to data. Results: Our developed method can handle GLMM to accommodate hierarchical data with multiple non-independent levels of observations in a federated setting. The experiment results demonstrate comparable (Laplace) and superior (Gaussian-Hermite) performances with simulated and real-world data. Conclusion: We developed and compared federated GLMMs with different approximations, which can support researchers in analyzing biomedical data to accommodate mixed effects and address non-independence due to hierarchical structures (i.e., institutes, region, country, etc.).
    Predictions of Electromotive Force of Magnetic Shape Memory Alloy (MSMA) Using Constitutive Model and Generalized Regression Neural Network. (arXiv:2206.03701v1 [cond-mat.mtrl-sci])
    Ferromagnetic shape memory alloys (MSMAs), such as Ni-Mn-Ga single crystals, can exhibit the shape memory effect due to an applied magnetic field at room temperature. Under a variable magnetic field and a constant bias stress loading, MSMAs have been used for actuation applications. This work introduced a new feature to the existing macroscale magneto-mechanical model for Ni-Mn-Ga single crystal. This model includes the fact that the magnetic easy axis in the two variants is not exactly perpendicular as observed by D silva et al. This offset helps explain some of the power harvesting capabilities of MSMAs. Model predictions are compared to experimental data collected on a Ni-Mn-Ga single crystal. The experiments include both stress-controlled loading with constant bias magnetic field load (which mimics power harvesting or sensing) and fieldcontrolled loading with constant bias compressive stress (which mimics actuation). Each type of test was performed at several different load levels, and the applied field was measured without the MSMA specimen present so that demagnetization does not affect the experimentally measured field as suggested by Eberle et al. Results show decent agreement between model predictions and experimental data. Although the model predicts experimental results decently, it does not capture all the features of the experimental data. In order to capture all the experimental features, finally, a generalized regression neural network (GRNN) was used to train the experimental data (stress, strain, magnetic field, and emf) so that it can make a reasonably better prediction.
    Inferring Lexicographically-Ordered Rewards from Preferences. (arXiv:2202.10153v2 [cs.LG] UPDATED)
    Modeling the preferences of agents over a set of alternatives is a principal concern in many areas. The dominant approach has been to find a single reward/utility function with the property that alternatives yielding higher rewards are preferred over alternatives yielding lower rewards. However, in many settings, preferences are based on multiple, often competing, objectives; a single reward function is not adequate to represent such preferences. This paper proposes a method for inferring multi-objective reward-based representations of an agent's observed preferences. We model the agent's priorities over different objectives as entering lexicographically, so that objectives with lower priorities matter only when the agent is indifferent with respect to objectives with higher priorities. We offer two example applications in healthcare, one inspired by cancer treatment, the other inspired by organ transplantation, to illustrate how the lexicographically-ordered rewards we learn can provide a better understanding of a decision-maker's preferences and help improve policies when used in reinforcement learning.
    A Primal-Dual Approach to Bilevel Optimization with Multiple Inner Minima. (arXiv:2203.01123v2 [math.OC] UPDATED)
    Bilevel optimization has found extensive applications in modern machine learning problems such as hyperparameter optimization, neural architecture search, meta-learning, etc. While bilevel problems with a unique inner minimal point (e.g., where the inner function is strongly convex) are well understood, such a problem with multiple inner minimal points remains to be challenging and open. Existing algorithms designed for such a problem were applicable to restricted situations and do not come with a full guarantee of convergence. In this paper, we adopt a reformulation of bilevel optimization to constrained optimization, and solve the problem via a primal-dual bilevel optimization (PDBO) algorithm. PDBO not only addresses the multiple inner minima challenge, but also features fully first-order efficiency without involving second-order Hessian and Jacobian computations, as opposed to most existing gradient-based bilevel algorithms. We further characterize the convergence rate of PDBO, which serves as the first known non-asymptotic convergence guarantee for bilevel optimization with multiple inner minima. Our experiments demonstrate desired performance of the proposed approach.  ( 2 min )
    High-dimensional limit theorems for SGD: Effective dynamics and critical scaling. (arXiv:2206.04030v1 [stat.ML])
    We study the scaling limits of stochastic gradient descent (SGD) with constant step-size in the high-dimensional regime. We prove limit theorems for the trajectories of summary statistics (i.e., finite-dimensional functions) of SGD as the dimension goes to infinity. Our approach allows one to choose the summary statistics that are tracked, the initialization, and the step-size. It yields both ballistic (ODE) and diffusive (SDE) limits, with the limit depending dramatically on the former choices. Interestingly, we find a critical scaling regime for the step-size below which the effective ballistic dynamics matches gradient flow for the population loss, but at which, a new correction term appears which changes the phase diagram. About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate. We demonstrate our approach on popular examples including estimation for spiked matrix and tensor models and classification via two-layer networks for binary and XOR-type Gaussian mixture models. These examples exhibit surprising phenomena including multimodal timescales to convergence as well as convergence to sub-optimal solutions with probability bounded away from zero from random (e.g., Gaussian) initializations.
    Model-Based Reinforcement Learning Is Minimax-Optimal for Offline Zero-Sum Markov Games. (arXiv:2206.04044v1 [cs.LG])
    This paper makes progress towards learning Nash equilibria in two-player zero-sum Markov games from offline data. Specifically, consider a $\gamma$-discounted infinite-horizon Markov game with $S$ states, where the max-player has $A$ actions and the min-player has $B$ actions. We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds -- called VI-LCB-Game -- that provably finds an $\varepsilon$-approximate Nash equilibrium with a sample complexity no larger than $\frac{C_{\mathsf{clipped}}^{\star}S(A+B)}{(1-\gamma)^{3}\varepsilon^{2}}$ (up to some log factor). Here, $C_{\mathsf{clipped}}^{\star}$ is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-\`a-vis the target data), and the target accuracy $\varepsilon$ can be any value within $\big(0,\frac{1}{1-\gamma}\big]$. Our sample complexity bound strengthens prior art by a factor of $\min\{A,B\}$, achieving minimax optimality for the entire $\varepsilon$-range. An appealing feature of our result lies in algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.
    FedPop: A Bayesian Approach for Personalised Federated Learning. (arXiv:2206.03611v1 [cs.LG])
    Personalised federated learning (FL) aims at collaboratively learning a machine learning model taylored for each client. Albeit promising advances have been made in this direction, most of existing approaches works do not allow for uncertainty quantification which is crucial in many applications. In addition, personalisation in the cross-device setting still involves important issues, especially for new clients or those having small number of observations. This paper aims at filling these gaps. To this end, we propose a novel methodology coined FedPop by recasting personalised FL into the population modeling paradigm where clients' models involve fixed common population parameters and random effects, aiming at explaining data heterogeneity. To derive convergence guarantees for our scheme, we introduce a new class of federated stochastic optimisation algorithms which relies on Markov chain Monte Carlo methods. Compared to existing personalised FL methods, the proposed methodology has important benefits: it is robust to client drift, practical for inference on new clients, and above all, enables uncertainty quantification under mild computational and memory overheads. We provide non-asymptotic convergence guarantees for the proposed algorithms and illustrate their performances on various personalised federated learning tasks.
    Attribution of Predictive Uncertainties in Classification Models. (arXiv:2107.08756v3 [cs.LG] UPDATED)
    Predictive uncertainties in classification tasks are often a consequence of model inadequacy or insufficient training data. In popular applications, such as image processing, we are often required to scrutinise these uncertainties by meaningfully attributing them to input features. This helps to improve interpretability assessments. However, there exist few effective frameworks for this purpose. Vanilla forms of popular methods for the provision of saliency masks, such as SHAP or integrated gradients, adapt poorly to target measures of uncertainty. Thus, state-of-the-art tools instead proceed by creating counterfactual or adversarial feature vectors, and assign attributions by direct comparison to original images. In this paper, we present a novel framework that combines path integrals, counterfactual explanations and generative models, in order to procure attributions that contain few observable artefacts or noise. We evidence that this outperforms existing alternatives through quantitative evaluations with popular benchmarking methods and data sets of varying complexity.
    Decentralized Online Regularized Learning Over Random Time-Varying Graphs. (arXiv:2206.03861v1 [cs.LG])
    We study the decentralized online regularized linear regression algorithm over random time-varying graphs. At each time step, every node runs an online estimation algorithm consisting of an innovation term processing its own new measurement, a consensus term taking a weighted sum of estimations of its own and its neighbors with additive and multiplicative communication noises and a regularization term preventing over-fitting. It is not required that the regression matrices and graphs satisfy special statistical assumptions such as mutual independence, spatio-temporal independence or stationarity. We develop the nonnegative supermartingale inequality of the estimation error, and prove that the estimations of all nodes converge to the unknown true parameter vector almost surely if the algorithm gains, graphs and regression matrices jointly satisfy the sample path spatio-temporal persistence of excitation condition. Especially, this condition holds by choosing appropriate algorithm gains if the graphs are uniformly conditionally jointly connected and conditionally balanced, and the regression models of all nodes are uniformly conditionally spatio-temporally jointly observable, under which the algorithm converges in mean square and almost surely. In addition, we prove that the regret upper bound $\mathcal O(T^{1-\tau}\ln T)$, where $\tau\in (0.5,1)$ is a constant depending on the algorithm gains.
    Asymptotic Stability in Reservoir Computing. (arXiv:2206.03854v1 [cs.NE])
    Reservoir Computing is a class of Recurrent Neural Networks with internal weights fixed at random. Stability relates to the sensitivity of the network state to perturbations. It is an important property in Reservoir Computing as it directly impacts performance. In practice, it is desirable to stay in a stable regime, where the effect of perturbations does not explode exponentially, but also close to the chaotic frontier where reservoir dynamics are rich. Open questions remain today regarding input regularization and discontinuous activation functions. In this work, we use the recurrent kernel limit to draw new insights on stability in reservoir computing. This limit corresponds to large reservoir sizes, and it already becomes relevant for reservoirs with a few hundred neurons. We obtain a quantitative characterization of the frontier between stability and chaos, which can greatly benefit hyperparameter tuning. In a broader sense, our results contribute to understanding the complex dynamics of Recurrent Neural Networks.
    On gradient descent training under data augmentation with on-line noisy copies. (arXiv:2206.03734v1 [stat.ML])
    In machine learning, data augmentation (DA) is a technique for improving the generalization performance. In this paper, we mainly considered gradient descent of linear regression under DA using noisy copies of datasets, in which noise is injected into inputs. We analyzed the situation where random noisy copies are newly generated and used at each epoch; i.e., the case of using on-line noisy copies. Therefore, it is viewed as an analysis on a method using noise injection into training process by DA manner; i.e., on-line version of DA. We derived the averaged behavior of training process under three situations which are the full-batch training under the sum of squared errors, the full-batch and mini-batch training under the mean squared error. We showed that, in all cases, training for DA with on-line copies is approximately equivalent to a ridge regression training whose regularization parameter corresponds to the variance of injected noise. On the other hand, we showed that the learning rate is multiplied by the number of noisy copies plus one in full-batch under the sum of squared errors and the mini-batch under the mean squared error; i.e., DA with on-line copies yields apparent acceleration of training. The apparent acceleration and regularization effect come from the original part and noise in a copy data respectively. These results are confirmed in a numerical experiment. In the numerical experiment, we found that our result can be approximately applied to usual off-line DA in under-parameterization scenario and can not in over-parametrization scenario. Moreover, we experimentally investigated the training process of neural networks under DA with off-line noisy copies and found that our analysis on linear regression is possible to be applied to neural networks.
    Structure-Aware Transformer for Graph Representation Learning. (arXiv:2202.03036v2 [stat.ML] UPDATED)
    The Transformer architecture has gained growing attention in graph representation learning recently, as it naturally overcomes several limitations of graph neural networks (GNNs) by avoiding their strict structural inductive biases and instead only encoding the graph structure via positional encoding. Here, we show that the node representations generated by the Transformer with positional encoding do not necessarily capture structural similarity between them. To address this issue, we propose the Structure-Aware Transformer, a class of simple and flexible graph Transformers built upon a new self-attention mechanism. This new self-attention incorporates structural information into the original self-attention by extracting a subgraph representation rooted at each node before computing the attention. We propose several methods for automatically generating the subgraph representation and show theoretically that the resulting representations are at least as expressive as the subgraph representations. Empirically, our method achieves state-of-the-art performance on five graph prediction benchmarks. Our structure-aware framework can leverage any existing GNN to extract the subgraph representation, and we show that it systematically improves performance relative to the base GNN model, successfully combining the advantages of GNNs and Transformers. Our code is available at https://github.com/BorgwardtLab/SAT .
    Learning Interpretable Decision Rule Sets: A Submodular Optimization Approach. (arXiv:2206.03718v1 [cs.LG])
    Rule sets are highly interpretable logical models in which the predicates for decision are expressed in disjunctive normal form (DNF, OR-of-ANDs), or, equivalently, the overall model comprises an unordered collection of if-then decision rules. In this paper, we consider a submodular optimization based approach for learning rule sets. The learning problem is framed as a subset selection task in which a subset of all possible rules needs to be selected to form an accurate and interpretable rule set. We employ an objective function that exhibits submodularity and thus is amenable to submodular optimization techniques. To overcome the difficulty arose from dealing with the exponential-sized ground set of rules, the subproblem of searching a rule is casted as another subset selection task that asks for a subset of features. We show it is possible to write the induced objective function for the subproblem as a difference of two submodular (DS) functions to make it approximately solvable by DS optimization algorithms. Overall, the proposed approach is simple, scalable, and likely to be benefited from further research on submodular optimization. Experiments on real datasets demonstrate the effectiveness of our method.
    Logistic Regression Through the Veil of Imprecise Data. (arXiv:2106.00492v2 [stat.ME] UPDATED)
    Logistic regression is an important statistical tool for assessing the probability of an outcome based upon some predictive variables. Standard methods can only deal with precisely known data, however many datasets have uncertainties which traditional methods either reduce to a single point or completely disregarded. In this paper we show that it is possible to include these uncertainties by considering an imprecise logistic regression model using the set of possible models that can be obtained from values from within the intervals. This has the advantage of clearly expressing the epistemic uncertainty removed by traditional methods.
    An Analysis of Selection Bias Issue for Online Advertising. (arXiv:2206.03853v1 [cs.IR])
    In online advertising, a set of potential advertisements can be ranked by a certain auction system where usually the top-1 advertisement would be selected and displayed at an advertising space. In this paper, we show a selection bias issue that is present in an auction system. We analyze that the selection bias destroy truthfulness of the auction, which implies that the buyers (advertisers) on the auction can not maximize their profits. Although selection bias is well known in the field of statistics and there are lot of studies for it, our main contribution is to combine the theoretical analysis of the bias with the auction mechanism. In our experiment using online A/B testing, we evaluate the selection bias on an auction system whose ranking score is the function of predicted CTR (click through rate) of advertisement. The experiment showed that the selection bias is drastically reduced by using a multi-task learning which learns the data for all advertisements.
    Data fission: splitting a single data point. (arXiv:2112.11079v3 [stat.ME] UPDATED)
    Suppose we observe a random vector $X$ from some distribution $P$ in a known family with unknown parameters. We ask the following question: when is it possible to split $X$ into two parts $f(X)$ and $g(X)$ such that neither part is sufficient to reconstruct $X$ by itself, but both together can recover $X$ fully, and the joint distribution of $(f(X),g(X))$ is tractable? As one example, if $X=(X_1,\dots,X_n)$ and $P$ is a product distribution, then for any $m<n$, we can split the sample to define $f(X)=(X_1,\dots,X_m)$ and $g(X)=(X_{m+1},\dots,X_n)$. Rasines and Young (2021) offers an alternative route of accomplishing this task through randomization of $X$ with additive Gaussian noise which enables post-selection inference in finite samples for Gaussian distributed data and asymptotically for non-Gaussian additive models. In this paper, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on a few prototypical applications, such as post-selection inference for trend filtering and other regression problems.

  • Open

    [P] WebtoonMe Project: Selfie to Webtoon style (you can try the demo app for free)
    https://www.reddit.com/r/MachineLearning/comments/sfbtds/p_webtoonme_project_selfie_to_webtoon_style/?utm_source=share&utm_medium=web2x&context=3 project page: https://github.com/webtoon/WebtoonMe demo page: https://webtoon.github.io/WebtoonMe/app.html submitted by /u/jis478 [link] [comments]
    [P] GPT3 generation of news stories about AI
    Here's a fun little project I did today on a whim. I happen to have access to the OpenAI API, so I used their playground feature to generate AI headlines with their taglines. I fed it this prompt (sourced from the latest edition of Last Week in AI; I co-run it, apologies for the plug): Last week's top AI news: * Caltech unit creates AI helping drones to withstand violent winds - "Caltech researchers are developing a drone with rapidly reacting artificial intelligence (AI) capacities that allow it to adapt in flight to extreme wind similar to tornado or hurricane conditions." * How Deep Squeak, an AI program with a weird name, is detecting whales - "Artificial Intelligence is booming. And now an AI program is being used to search for whales." * Ex-golf pro links with Seattle-area AI e…  ( 3 min )
    [P] Lorcan Mini robot running fast with AOgmaNeo reinforcement learning
    Hi everyone, I decided to write another blog post finally. This one is about a RL demo we gave at a local conference, involving a tiny quadruped robot that learns to scramble across the floor very quickly. It learns by first mimicking a hand-made policy, and is then trained further in the real-world. Our technology is called Sparse Predictive Hierarchies (SPH), and the library that implements it is called AOgmaNeo. It's a biologically-inspired low-compute sparse online learning system. We are also working on a GPU version of SPH again, so I also included that in the post as well. Enjoy! https://ogma.ai/2022/06/aogmaneo-lorcan-mini-robot-demo-clogmaneo/ submitted by /u/CireNeikual [link] [comments]  ( 1 min )
    [R] Reading list of #ImplicitRepresentations and #NeRF papers relating to #Robotics
    Interested in a reading list of #ImplicitRepresentations and #NeRF papers relating to #Robotics? Check out this list of papers inspired by awesome-computer-vision. https://github.com/zubair-irshad/Awesome-Implicit-NeRF-Robotics… Feel free to share with others! Contributions/Suggestions are welcome. submitted by /u/KaleidoscopeBest1569 [link] [comments]  ( 1 min )
    [D][P] Grounding language to visual observation
    Hi, In my current project, I have a language observation and a visual observation that I would like to encode, both to the same context embedding. The language observation is a description of the visual observation. The goal is to ground the language in the observation. Ultimately, I need to have one Observation Encoder and one Language Encoder that take different inputs, but both output similar context vectors. What would be a technique to make that possible ? My first idea was to learn the Observation Encoder on another task, and then teach the Language Encoder to predict the same context vector as the Observation Encoder (minimizing cross-entropy). But there may be some better approach, maybe using techniques I'm not aware of. I looked briefly into Shared Latent Spaces, but was not sure that it would fit my problem statement. Was I wrong ? Do you guys know any other method I could look into ? Thanks ! submitted by /u/Maxtoq [link] [comments]  ( 1 min )
    [D] Looking for paper on infinite stacking of hyperparam optimizers
    A few years ago, I remember seeing a paper on using optimizers to optimize optimizers. The initial premise was that if you have a model and an optimizer, you need to optimize the hyperparams of the optimizer so you can add a sort of hyperoptimizer on top. But this hyperoptimizer also has hyperparams so they then explore what happens when you start stacking more and more of these hyperoptimizers on top of each other. I believe one of the conclusions was that in the limit, model behaviour ends up being independent of the top-level choice of hyperparameters. I've been trying to find this paper again recently but haven't been able to. Would greatly appreciate any help finding it! submitted by /u/ilia10000 [link] [comments]  ( 1 min )
    [P] Real-time AR for jewelry virtual try on that looks real, done with joliGAN, based on a few 2D videos and no 3D model
    A work from us with GANs recently emerged from stealth https://www.linkedin.com/feed/update/urn:li:activity:6939837590304899072/ The hands are real, but the rings are rendered with a GAN in real-time. A first network detects where to render the ring, a second network does the rendering. There's no 3D model, it's purely 2D to 2D. ​ https://preview.redd.it/9qlgbkyeue491.png?width=1936&format=png&auto=webp&s=ceadda604db236dd3f7d8b665843e786512128b8 We thought we'd share some technical details since the underlying code, JoliGAN is Open Source, https://github.com/jolibrain/joliGAN - The GAN uses a combination of mobile ResNets with attention as a Generator, along with a projected Discriminator [1]. Depending on the stone, we sometimes use transformers as well (customized Segformers and ViT mostly). A series of additional neural networks act as semantic constraints to the space of GAN transforms. - Real-time is achieved through our full C++ Open Source backend DeepDetect, https://github.com/jolibrain/deepdetect. We use CUDA along with OpenCV and TensorRT to chain multiple models (ring detection and generator mostly), and we make sure the data remain within CUDA memory at all time. This allows us to reach ~60 FPS on 1080Ti and 20% more on average on an RTX3090. JoliGAN is a powerful tool for domain to domain adaptation, with applications to AR, dataset augmentation, and sim2real transformation mostly. Documentation is scarce as the software is essentially used by us for solving our customers' problems. But hey, it's open :) [1] https://arxiv.org/abs/2111.01007 submitted by /u/pilooch [link] [comments]  ( 1 min )
    [Discussion] Should we still fly to conferences?
    Now that COVID appears to be less of a problem in many parts of the world, conferences are gradually returning to a physical format. But something has changed: we now know that online conferences are possible. Many here have probably had a mixed (very negative?) experience with the virtual conferences. It probably hasn't yet reached its best form to foster collaboration for the worldwide research community. But what would have been almost unimaginable before 2020 has now been tested repeatedly! This brings me to my question: Should we still burn insane amounts of plane fuel to fly to the other end of the planet to present a paper/poster a come back home 3 days later? Also, as a Ph.D. student, should I refuse to attend a conference because it is too far from where I work, knowing that th…  ( 6 min )
    [Discussion] Why is the Competing Conventions Problem in Neuroevolution a problem?
    The Competing Conventions problem or Permutation problem is a problem that occurs in neuroevolution. It arises when there are more than one way to represent a network as a genotype. The competing conventions problem. [Evolving neural networks through augmenting topologies; Stanley, Miikulainen; 2002] When two different genotypes, that represent the same neural network, are recombined during crossover, the emerging offspring is likely to be damaged and missing information. The figure above visualizes the problem for a small neural network. Since the order of the three hidden nodes A, B and C has no influence on the resulting function, the network can be represented by 3! = 6 different permutations. When two of these permutations are recombined during crossover the resulting offspring is missing information. As depicted in the figure the combination of {A, B, C] and [C, B, A] will result in either [A, B, A] or [C, B, C]. Both of which lack 1/3 of the main components that both their parents had. There is also the problem, that the search space is enourmosly enlarged by all the permutations, but my question refers to the first part of the problem. ​ Why is it a problem, that the children of two genotype permutations of the same underlying neural network miss information from their parents. From my understanding, the point of crossover is also exploration, so why are these networks considered damaged, while in other situation it is considered innovation? Offspring is supposed to be different from its parents, otherwise change would only happen through mutations and be completely random. I have tried to find an explanation, but every paper just seems to see it as a given that the offspring is damaged. submitted by /u/loeffner [link] [comments]  ( 4 min )
    [D] Extracting next action from conversation
    Hello people, I have an NLP problem and I would like some pointers about how to aproach it. The problem is the following: I want to extract an action from a conversation transcript. Let's say we have a transcript of a conversation that ends in a certain decision (meet again, do this thing or send a message/email, etc.). I want to extract a sentence that summarizes the final intent of the conversation, for example, "Meet again tomorrow". I have considered different approaches for now: - Intent extraction models such as https://github.com/thuiar/textoir. My problem with this approach is that they are multi-label classifiers and usually focused on single-sentence classification "Can you get me a table?" would be assigned to the "Reservation" label. I feel that I would lose information such as "Meet at 10PM in this address." - Question answering models that answer a question such as "What will they do after the conversation?". I have the feeling that QA models are not designed for this kind of tasks. I would really appreciate some pointers such as the name of this task in the NLP field. Thanks a lot for reading my post! submitted by /u/LanverYT [link] [comments]  ( 1 min )
    [P] Featureform: Open-Source Virtual Feature Store
    Hey everyone! We’re excited to announce the open-source version of Featureform, an extensible feature store. We’ve found that existing feature stores are either too heavy and replace your existing infrastructure, or don’t handle transformations at all and simply store features. We built a feature store that’s a happy medium between the two, it orchestrates your existing infrastructure to work like a feature store. We wrote more about this in our blog post. Check out the repo: https://github.com/featureform/featureform ​ https://preview.redd.it/vwpe0uypje491.png?width=2084&format=png&auto=webp&s=f81f7447f2c35081b2ae63e885506b9187a73d7b What Is Featureform Featureform is a virtual feature store. It enables data scientists to define, manage, and serve their ML model's features. Featuref…  ( 2 min )
    Measuring distances from known objects [P]
    I am a member of a Formula Student team that is building its first autonomous race car. Our track limits are defined by cones of known size placed on each side of the road, yellow on the right-hand side and blue on the left (see Images). Naturally, we are interested in measuring our distance from them so that we can map the circuit. I want your opinion on which method would yield the most accurate results. What we are currently doing is running Yolo(v5) to extract bounding boxes and then each box goes through an additional neural network that outputs 7 keypoints of the cone (see Images) and just because we know the exact positions of these keypoints relative to each other we can then turn it into a Perspective n-Point problem. https://preview.redd.it/nzbv4m1byd491.png?width=1218&format=png&auto=webp&s=f3f605767109cfc2a8e41d9b279779c542e27b10 https://preview.redd.it/n0y28oi9yd491.png?width=200&format=png&auto=webp&s=a90057af43b3f185e6fdd4755037505eee96c0b9 submitted by /u/Commercial_Put577 [link] [comments]  ( 1 min )
    [P][N] Just launched - nebulgym, a new open-source that accelerates AI training (~1.5-2x as of now) in a few lines of code without requiring you to change your training setup
    Training always takes too long. If it takes an hour, it would be better if it took 30 minutes, or maybe 15 minutes... or just 1 minute, why not? And if you want to speed up training, the techs available usually require to increase the complexity of the training process, whether it's making trade-off in terms of accuracy or time for the developer to learn a new framework. Often times it's trial and error, playing with parameters, training recipes, or switching framework/model. That's definitely not ideal. “Fast & easy-to-use” These were keywords that motivated me to work on a new way of doing training, the library nebulgym, which now is open-source (github link). Fast Training should be fast, period. Wouldn't it be great if in the near future you could train a GPT3 from scratch on your l…  ( 3 min )
    [D] What object detectors have the capability to harness relationship between its detected boxes?
    Typical object detectors do not employ relationships within the detected boxes. No context is being involved. In my problem's case, there are two requirements that would lead to drastically better results if some form of context is formed across detected boxes. Requirement #1 It is a multi-class, but single label problem. There are N classes. But the class can only appear minimum of 0 and maximum of 1 instance. Hence, it kinda needs to know the other detections whether they have already predicted something. Requirement #2 There is some form of ordinance between the predictions based on their proximity to each other. For example, Class 4 should only appear near Class 5-6 and Class 2-3. But should not be anywhere near Class 32. Any architecture that is optimized for this kinds of object detection? submitted by /u/sarmientoj24 [link] [comments]  ( 1 min )
    [D] ML/DL computer build with PCIe 5.0 x8 lanes for RTX 3090
    I'm building my first ML/DL computer around ASUS ProArt Z690 motherboard, which has 2 PCIe 5.0 slots (x8 each) and PCIe 3.0 x16 slot. The CPU is i9-12900K, which comes with 20 PCIe lanes. Since 4 lanes will go to a single NVMe, I think the motherboard will split the two PCIe 5.0 slots into 8x lanes each. My current build is with a single RTX 3080 12GB, but I want to be able to upgrade to 2x3090Ti (or 2x4090) in the future, if needed. This article from 2018 seems to imply that DL is unaffected even when PCIe 4.0 4x are used for up to 2 GPUs. I just want to confirm that this is still considered sound advice. In other words, 2x3090 GPUs won't be throttled by PCIe 5.0 x8 lanes, which is equivalent to PCIe 4.0 x16 (see matrix below). In fact, I'm also wondering if running PCIe 5.0 even at x2 lanes each won't throttle the GPUs since the equivalent transfer rate is still PCIe 4.0 x4, as mentioned in that article. Or does the fact that the motherboard interface is PCIe 5.0 not matter since the GPU can support only up to PCIe 4.0 speeds? Any other comments on my build would be welcome: PCPartpicker. This will be an everyday computer as well to edit photos/videos and used for other analyses, hence the more powerful CPU and the NVMe, which might not matter as much for ML. ​ PCIe Lane vs Speed matrix submitted by /u/Scapius [link] [comments]  ( 7 min )
    [R] What are some interesting and mysterious open problems of generalization in ML?
    I found the generalization problems of machine learning, especially in deep learning, very attractive, I wonder what are some attractive problems nowadays. I know about the double descent problem, which I believe is quite interesting, and does not have a valid answer at this moment. I also know about the implicit inductive bias introduced by SGD, but it seems has been studied widely recently especially with the tool of NTK. I wonder what are some other interesting phenomenon like these mysteries? submitted by /u/pizzaUnderSea [link] [comments]  ( 2 min )
    [R] Differentiable Finite State Machines (Blog Post)
    submitted by /u/hardmaru [link] [comments]
    [R] Intra-agent speech permits zero-shot task acquisition
    submitted by /u/hardmaru [link] [comments]
    [R] From data to functa: Your data point is a function and you can treat it like one
    submitted by /u/hardmaru [link] [comments]  ( 1 min )
  • Open

    Avoid PyBullet collision between gripper and object
    Hello, I am developing an environment in pyBullet for RL policies and I am trying to simplify some stuff. Basically, I have a Sawyer robot that would need to grip something. Let me show you a video so I can explain the issue: https://i.redd.it/pcp2lqf67h491.gif As you can see when the gripper collides with the 'table' it closes due to the collision forces (i am assuming). However, I would like to 'disable' such a thing and make sure that the gripper doesn't move further due to external forces. How could I do this? Is there a pyBullet method to do so? Would I need to change the URDF of the robot? Thanks for the help submitted by /u/gabrigoo [link] [comments]  ( 1 min )
    [P] Lorcan Mini robot running fast with AOgmaNeo reinforcement learning
    submitted by /u/CireNeikual [link] [comments]  ( 1 min )
    How to run parallel for-loop with reinforcement learning inside? Parallelized version gives incorrect output.
    I cannot for the life of me figure out what I'm doing wrong. I'm using StableBaselines3 in Google Colab. I am trying to basically do some cross validation to search for hyperparams for a reinforcement learning model. I know that SB3 has some functions to allow parallelization of agents (multiple agents, multiprocessing), but I cannot use it because I am using a wrapper called ActionMasker, which doesn't work with the multiprocessing of SB3. To be clear: My RL agent's environment is determined by a data table (not an 'simulation" environment like a game). Basically, the code is running an outer for-loop which is supposed to shift a window along a data table, where models are trained with different parameters (inner for loop), best parameters determined, and then one model is trained on …  ( 2 min )
    should different actions have their own output slot even if not a valid action based on state?
    For my problem at every state of an episode the agent will always have two actions to choose from. One action is actually a "non-action" and the other is the "action", but depending on the state the action can mean two very different things, such that they are really two separate actions. My current line of thinking is that I do not want my model to spend any effort on trying to predict the reward for an invalid action, so I just have two outputs and try to let my model decide what action it is actually taking based on the input state. (To be clear I have about 300 inputs, and one input is a binary that defines which action is actually taken). I think that theoretically the model should be able to figure this out. Does this method have any merit? Or should I really have 3 outputs, let my model waste efforts modelling the reward for an invalid output (and just do an argmax for the valid actions), for the tradeoff of us clearly delineating the different types of actions that can be taken so that my model doesn't have to try and figure it out based on the input state. submitted by /u/Yogi_DMT [link] [comments]  ( 1 min )
    Theoretical Research in RL?
    Hello! I currently doing a course in reinforcement learning and am planning to do my master thesis in the fall term. Thus, I start to think about a topic. I definitely would like to do theoretical research without much coding. Coding for experiments is cool and fine, but the main part of work shouldn't be coding. As far as I see it now the whole topic is covered by 'hands-on coding research'. Therefore, I am now here and asking: Are there research topics in reinforcement learning which target theoretical aspects? (Convergence, Analysis of algorithms, Approximation guarantees, ...) If anyone of you has an idea or starting paper for me I would really appreciate it! submitted by /u/Insighteous [link] [comments]  ( 1 min )
    Let’s learn about Deep Q-Learning by training our agent to play Space Invaders (Deep Reinforcement Learning Free Class by Hugging Face 🤗)
    Hey there! We just published the third Unit of Deep Reinforcement Learning Class 🥳. In this Unit, you'll learn about Deep Q-Learning and train a DQN agent to play Atari games using RL-Baselines3-Zoo. You’ll be able to compare the results of your Q-Learning agent using the leaderboard The Deep Q-Learning chapter 👉 https://huggingface.co/blog/deep-rl-dqn The hands-on 👉 https://github.com/huggingface/deep-rl-class/blob/main/unit3/unit3.ipynb The leaderboard 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard https://i.redd.it/mq8fqnmkxe491.gif Deep RL Class, is a free course from beginner to expert, self-paced where you’ll get solid foundations of Deep Reinforcement Learning in theory and practice with hands-on using famous RL libraries such SB3, RL-Baselines3-Zoo, RLlib, CleanRL… You can sign up here 👉 http://eepurl.com/h1pElX And if you have questions and feedback I would love to answer them. submitted by /u/cranthir_ [link] [comments]  ( 1 min )
    Performance of RL vs supervised learning
    I was wondering if there were any studies directly comparing the two. I want to predict the next state in an environment and can either use RL to do so or generate a dataset and do supervised learning on that. Which do you hypothesise to be better and why? submitted by /u/SuperDuperDooken [link] [comments]  ( 1 min )
    Looking for implementation of normalised percentiles for evaluating RL agents
    I was wondering where I could find a software implementation of the technique used in the work "Open-Ended Learning Leads to Generally Capable Agents" (https://arxiv.org/abs/2107.12808) for evaluating agents. It requires computing normalised percentiles and pareto dominance and is described in Section 4.1. submitted by /u/dr_cosmicomical [link] [comments]  ( 1 min )
    Inference with Rainbow
    Hi guys! I am using Rainbow for an environment, and I see progress in the training logs. However, when I want to test my model checkpoints I see the agent only commits to only one action, and of course does not achieve the performance shown in training. What do you think can be the causes? Or what specific thing has to be done with rainbow when doing inference? ​ Thank you! submitted by /u/xWh0am1 [link] [comments]  ( 1 min )
    Have you used any good DRL library?
    Hey, friends, have you used some useful DRL libraries? I hope you can recommend some useful DRL libraries to me! Or what should I pay attention to when choosing a library? ​ I found this summary on github, and it looks pretty complete: https://github.com/wwxFromTju/awesome-reinforcement-learning-lib submitted by /u/AnnualGas3585 [link] [comments]  ( 1 min )
  • Open

    Stanford AI Researchers Propose ‘LinkBERT’: A New Pretraining Method That Improves Language Model Training with Document Links
    👉 LinkBERT consists of three steps: (1) obtaining links between documents to build a document graph from the text corpus, (2) creating link-aware training instances from the graph by placing linked documents together, and finally (3) pretraining the LM with link-aware self-supervised tasks: masked language modeling (MLM) and document relation prediction (DRP). 👉 LinkBERT is especially effective for multi-hop reasoning and few-shot QA (+5% absolute improvement on HotpotQA and TriviaQA) Continue reading | Check out the paper, github and blog post submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    DISCO DIFFUSION 3D AI ART ANIMATION | TRANQUIL BLISS
    submitted by /u/Available_Tadpole829 [link] [comments]
    New Artificial Skin Lets Bionic Arm Or AI Robot Touch & Feel With Extreme Sensitivity | Photonic Chip Processes & Classifies 2 Billion Images Per Second Without Memory Device
    submitted by /u/getrich_or_diemining [link] [comments]
    DISCO DIFFUSION 3D AI ART ANIMATION | VANAHEIM HOME OF THE VANIR GODS
    submitted by /u/Available_Tadpole829 [link] [comments]
    Lamp Vase.
    submitted by /u/cookingandcraft [link] [comments]
    in this article, we showcase how to build an NLP project from zero to hero
    submitted by /u/UBIAI [link] [comments]
    Aquaman - Neural-Art Parody / [4K] Creative Experiment w/ GPT-3, VQGAN+CLIP
    submitted by /u/MLInsights [link] [comments]
    Is it possible
    Is it possible to make an ai to play games with you anything that can allow you to have two players or split screen submitted by /u/OrdinarySlight6992 [link] [comments]  ( 1 min )
    Self study plan for AI?
    I am a recent high school graduate. I have been very eager to begin dabbling with AI this summer. So far, I have been following "Artificial Intelligence: A Modern Approach" and I have reached the second chapter over the past few weeks, but I do not yet have a solid learning plan. I just study bit by bit every other day. I would like to form a solid plan for this summer and I was wondering if anyone has any advice for me. I've completed Calculus 1 in school and I am considering studying Linear Algebra along with AI, but I would like to have some advice on that as well. Is going through AIMA over summer a good plan? Should I start linear algebra along with it? How do I make a study plan that will make me end up actually learning something by the end of summer? If AIMA is not the best resource for my case, what do you recommend for me to follow and what kind of plan should I build? Thank you so much in advance! submitted by /u/obvslynot [link] [comments]  ( 2 min )
    ML is way more fun when you learn/work with someone. A Discord server where anyone learning/working in ML can come and share their projects, learn together, find jobs, and much more now with 25'000+ members.
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
    Open AI...asking for a phone number => not Open then (personal data)
    submitted by /u/the_anonymizer [link] [comments]
    Just launched - nebulgym, a new open-source that accelerates AI training (~1.5-2x as of now) in a few lines of code without requiring you to change your training setup
    Training always takes too long. If it takes an hour, it would be better if it took 30 minutes, or maybe 15 minutes... or just 1 minute, why not? And if you want to speed up training, the techs available usually require to increase the complexity of the training process, whether it's making trade-off in terms of accuracy or time for the developer to learn a new framework. Often times it's trial and error, playing with parameters, training recipes, or switching framework/model. That's definitely not ideal. “Fast & easy-to-use” These were keywords that motivated me to work on a new way of doing training, the library nebulgym, which now is open-source (github link). Fast Training should be fast, period. Wouldn't it be great if in the near future you could train a GPT3 from scratch on your l…  ( 2 min )
    Awesome AI R&D content (with code!) on Computer Vision News of June 2022
    Dear all, Here is awesome AI R&D content (with code!) on Computer Vision News of June 2022. Many great articles (with videos) about AI, Deep Learning, Computer Vision and more... Review of award-winning CRAS2022 and ICLR2022 papers. HTML5 version (recommended) PDF version Dilbert on page 2. Free subscription on page 66. Enjoy! https://preview.redd.it/892fxez70d491.jpg?width=400&format=pjpg&auto=webp&s=92fd215861578e2c8082ba1c60d2643749eb36a5 submitted by /u/Gletta [link] [comments]
    Doctor Strange in the Multiverse of Madness - Neural-Art Parody [4K 60 FPS]
    submitted by /u/MLInsights [link] [comments]
    DALL-E Mini nailed it
    submitted by /u/OneFinding1429 [link] [comments]  ( 1 min )
    Love: A Powerful Force! - [4K 60 FPS] Computer Generated Art
    submitted by /u/MLInsights [link] [comments]  ( 1 min )
  • Open

    Integrate Amazon Lex and Uneeq’s digital human platform
    In today’s digital landscape, customers are expecting a high-quality experience that is responsive and delightful. Chatbots and virtual assistants have transformed the customer experience from a point-and-click or a drag-and-drop experience to one that is driven by voice or text. You can create a more engaging experience by further augmenting the interaction with a visual […]  ( 6 min )
    Easily create and store features in Amazon SageMaker without code
    Data scientists and machine learning (ML) engineers often prepare their data before building ML models. Data preparation typically includes data preprocessing and feature engineering. You preprocess data by transforming data into the right shape and quality for training, and you engineer features by selecting, transforming, and creating variables when building a predictive model. Amazon SageMaker […]  ( 9 min )
  • Open

    New Photonics AI Chip Processes & Classifies 2 Billion Images Per Second Without Using Memory Device
    submitted by /u/tohelpyou88 [link] [comments]
    This cheat sheet provides you with six steps that you can go through to make neural networks in Python with the Keras library.
    submitted by /u/joanna58 [link] [comments]
  • Open

    Infinite periodic table
    All the chemical elements discovered or created so far follow a regular pattern in how their electrons are arranged: the nth shell contains up to 2n – 1 suborbitals that each contain up to two electrons. For a given atomic number, you can determine how its electrons are distributed into shells and suborbitals using the […] Infinite periodic table first appeared on John D. Cook.  ( 2 min )
  • Open

    Stunning Insights from James Webb Space Telescope Are Coming, Thanks to GPU-Powered Deep Learning
    NVIDIA GPUs will play a key role interpreting data streaming in from the James Webb Space Telescope, with NASA preparing to release next month the first full-color images from the $10 billion scientific instrument. The telescope’s iconic array of 18 interlocking hexagonal mirrors, which span a total of 21 feet 4 inches, will be able Read article > The post Stunning Insights from James Webb Space Telescope Are Coming, Thanks to GPU-Powered Deep Learning appeared first on NVIDIA Blog.  ( 4 min )
  • Open

    DSC Weekly 7 June 2022
    Announcements Building a successful data architecture strategy continues to challenge businesses as data management growth and innovation continues through 2022. Discover the blueprint for managing data by joining the Data Architecture & Engineering summit and get ahead with the latest technologies to remain competitive. Companies must effectively manage hybrid cloud operations to manage risk and leverage its… Read More »DSC Weekly 7 June 2022 The post DSC Weekly 7 June 2022 appeared first on Data Science Central.  ( 7 min )
  • Open

    Parotid Gland MRI Segmentation Based on Swin-Unet and Multimodal Images. (arXiv:2206.03336v1 [eess.IV])
    Parotid gland tumors account for approximately 2% to 10% of head and neck tumors. Preoperative tumor localization, differential diagnosis, and subsequent selection of appropriate treatment for parotid gland tumors is critical. However, the relative rarity of these tumors and the highly dispersed tissue types have left an unmet need for a subtle differential diagnosis of such neoplastic lesions based on preoperative radiomics. Recently, deep learning methods have developed rapidly, especially Transformer beats the traditional convolutional neural network in computer vision. Many new Transformer-based networks have been proposed for computer vision tasks. In this study, multicenter multimodal parotid gland MRI images were collected. The Swin-Unet which was based on Transformer was used. MRI images of STIR, T1 and T2 modalities were combined into a three-channel data to train the network. We achieved segmentation of the region of interest for parotid gland and tumor. The DSC of the model on the test set was 88.63%, MPA was 99.31%, MIoU was 83.99%, and HD was 3.04. Then a series of comparison experiments were designed in this paper to further validate the segmentation performance of the algorithm.  ( 2 min )
    Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse. (arXiv:2206.03126v1 [cs.LG])
    Transformers have achieved remarkable success in several domains, ranging from natural language processing to computer vision. Nevertheless, it has been recently shown that stacking self-attention layers - the distinctive architectural component of Transformers - can result in rank collapse of the tokens' representations at initialization. The question of if and how rank collapse affects training is still largely unanswered, and its investigation is necessary for a more comprehensive understanding of this architecture. In this work, we shed new light on the causes and the effects of this phenomenon. First, we show that rank collapse of the tokens' representations hinders training by causing the gradients of the queries and keys to vanish at initialization. Furthermore, we provide a thorough description of the origin of rank collapse and discuss how to prevent it via an appropriate depth-dependent scaling of the residual branches. Finally, our analysis unveils that specific architectural hyperparameters affect the gradients of queries and values differently, leading to disproportionate gradient norms. This suggests an explanation for the widespread use of adaptive methods for Transformers' optimization.  ( 2 min )
    Deep Learning-based FEA surrogate for sub-sea pressure vessel. (arXiv:2206.03322v1 [cs.LG])
    During the design process of an autonomous underwater vehicle (AUV), the pressure vessel has a critical role. The pressure vessel contains dry electronics, power sources, and other sensors that can not be flooded. A traditional design approach for a pressure vessel design involves running multiple Finite Element Analysis (FEA) based simulations and optimizing the design to find the best suitable design which meets the requirement. Running these FEAs are computationally very costly for any optimization process and it becomes difficult to run even hundreds of evaluation. In such a case, a better approach is the surrogate design with the goal of replacing FEA-based prediction with some learning-based regressor. Once the surrogate is trained for a class of problem, then the learned response surface can be used to analyze the stress effect without running the FEA for that class of problem. The challenge of creating a surrogate for a class of problems is data generation. Since the process is computationally costly, it is not possible to densely sample the design space and the learning response surface on sparse data set becomes difficult. During experimentation, we observed that a Deep Learning-based surrogate outperforms other regression models on such sparse data. In the present work, we are utilizing the Deep Learning-based model to replace the costly finite element analysis-based simulation process. By creating the surrogate we speed up the prediction on the other design much faster than direct Finite element Analysis. We also compared our DL-based surrogate with other classical Machine Learning (ML) based regression models( random forest and Gradient Boost regressor). We observed on the sparser data, the DL-based surrogate performs much better than other regression models.  ( 2 min )
    On Recoverability of Graph Neural Network Representations. (arXiv:2201.12843v2 [cs.LG] UPDATED)
    Despite their growing popularity, graph neural networks (GNNs) still have multiple unsolved problems, including lack of embedding expressiveness, propagation of information to distant nodes, and training on large-scale graphs. Understanding the roots of and providing solutions for such problems require developing analytic tools and techniques. In this work, we propose the notion of recoverability, which measures the amount of information contained in a random variable for being able to recover another one form it. We provide a method for an efficient empirical estimation of recoverability, demonstrate a tight relationship of it to information aggregation in GNNs, and show how this new concept can be used in unsupervised graph representation learning. We demonstrate, through extensive experimental results on various datasets and different GNN architectures, that estimated recoverability correlates with aggregation method expressivity and graph sparsification quality, the GNN representations can be learned using our unsupervised approach, and the recoverability regularization can mitigating accuracy drop caused by expanding of GNN depth. The code to reproduce our experiments is available at https://github.com/Anonymous1252022/Recoverability  ( 2 min )
    Accurate Virus Identification with Interpretable Raman Signatures by Machine Learning. (arXiv:2206.02788v1 [q-bio.QM])
    Rapid identification of newly emerging or circulating viruses is an important first step toward managing the public health response to potential outbreaks. A portable virus capture device coupled with label-free Raman Spectroscopy holds the promise of fast detection by rapidly obtaining the Raman signature of a virus followed by a machine learning approach applied to recognize the virus based on its Raman spectrum, which is used as a fingerprint. We present such a machine learning approach for analyzing Raman spectra of human and avian viruses. A Convolutional Neural Network (CNN) classifier specifically designed for spectral data achieves very high accuracy for a variety of virus type or subtype identification tasks. In particular, it achieves 99% accuracy for classifying influenza virus type A vs. type B, 96% accuracy for classifying four subtypes of influenza A, 95% accuracy for differentiating enveloped and non-enveloped viruses, and 99% accuracy for differentiating avian coronavirus (infectious bronchitis virus, IBV) from other avian viruses. Furthermore, interpretation of neural net responses in the trained CNN model using a full-gradient algorithm highlights Raman spectral ranges that are most important to virus identification. By correlating ML-selected salient Raman ranges with the signature ranges of known biomolecules and chemical functional groups (for example, amide, amino acid, carboxylic acid), we verify that our ML model effectively recognizes the Raman signatures of proteins, lipids and other vital functional groups present in different viruses and uses a weighted combination of these signatures to identify viruses.  ( 3 min )
    Look Back When Surprised: Stabilizing Reverse Experience Replay for Neural Approximation. (arXiv:2206.03171v1 [cs.LG])
    Experience replay methods, which are an essential part of reinforcement learning(RL) algorithms, are designed to mitigate spurious correlations and biases while learning from temporally dependent data. Roughly speaking, these methods allow us to draw batched data from a large buffer such that these temporal correlations do not hinder the performance of descent algorithms. In this experimental work, we consider the recently developed and theoretically rigorous reverse experience replay (RER), which has been shown to remove such spurious biases in simplified theoretical settings. We combine RER with optimistic experience replay (OER) to obtain RER++, which is stable under neural function approximation. We show via experiments that this has a better performance than techniques like prioritized experience replay (PER) on various tasks, with a significantly smaller computational complexity. It is well known in the RL literature that choosing examples greedily with the largest TD error (as in OER) or forming mini-batches with consecutive data points (as in RER) leads to poor performance. However, our method, which combines these techniques, works very well.  ( 2 min )
    Cycle-Consistent World Models for Domain Independent Latent Imagination. (arXiv:2110.00808v2 [cs.LG] UPDATED)
    End-to-end autonomous driving seeks to solve the perception, decision, and control problems in an integrated way, which can be easier to generalize at scale and be more adapting to new scenarios. However, high costs and risks make it very hard to train autonomous cars in the real world. Simulations can therefore be a powerful tool to enable training. Due to slightly different observations, agents trained and evaluated solely in simulation often perform well there but have difficulties in real-world environments. To tackle this problem, we propose a novel model-based reinforcement learning approach called Cycleconsistent World Models. Contrary to related approaches, our model can embed two modalities in a shared latent space and thereby learn from samples in one modality (e.g., simulated data) and be used for inference in different domain (e.g., real-world data). Our experiments using different modalities in the CARLA simulator showed that this enables CCWM to outperform state-of-the-art domain adaptation approaches. Furthermore, we show that CCWM can decode a given latent representation into semantically coherent observations in both modalities.  ( 2 min )
    Mean Estimation in High-Dimensional Binary Markov Gaussian Mixture Models. (arXiv:2206.02455v2 [math.ST] UPDATED)
    We consider a high-dimensional mean estimation problem over a binary hidden Markov model, which illuminates the interplay between memory in data, sample size, dimension, and signal strength in statistical inference. In this model, an estimator observes $n$ samples of a $d$-dimensional parameter vector $\theta_{*}\in\mathbb{R}^{d}$, multiplied by a random sign $ S_i $ ($1\le i\le n$), and corrupted by isotropic standard Gaussian noise. The sequence of signs $\{S_{i}\}_{i\in[n]}\in\{-1,1\}^{n}$ is drawn from a stationary homogeneous Markov chain with flip probability $\delta\in[0,1/2]$. As $\delta$ varies, this model smoothly interpolates two well-studied models: the Gaussian Location Model for which $\delta=0$ and the Gaussian Mixture Model for which $\delta=1/2$. Assuming that the estimator knows $\delta$, we establish a nearly minimax optimal (up to logarithmic factors) estimation error rate, as a function of $\|\theta_{*}\|,\delta,d,n$. We then provide an upper bound to the case of estimating $\delta$, assuming a (possibly inaccurate) knowledge of $\theta_{*}$. The bound is proved to be tight when $\theta_{*}$ is an accurately known constant. These results are then combined to an algorithm which estimates $\theta_{*}$ with $\delta$ unknown a priori, and theoretical guarantees on its error are stated.  ( 2 min )
    A Machine Learning Tutorial for Operational Meteorology, Part I: Traditional Machine Learning. (arXiv:2204.07492v2 [physics.ao-ph] UPDATED)
    Recently, the use of machine learning in meteorology has increased greatly. While many machine learning methods are not new, university classes on machine learning are largely unavailable to meteorology students and are not required to become a meteorologist. The lack of formal instruction has contributed to perception that machine learning methods are 'black boxes' and thus end-users are hesitant to apply the machine learning methods in their every day workflow. To reduce the opaqueness of machine learning methods and lower hesitancy towards machine learning in meteorology, this paper provides a survey of some of the most common machine learning methods. A familiar meteorological example is used to contextualize the machine learning methods while also discussing machine learning topics using plain language. The following machine learning methods are demonstrated: linear regression; logistic regression; decision trees; random forest; gradient boosted decision trees; naive Bayes; and support vector machines. Beyond discussing the different methods, the paper also contains discussions on the general machine learning process as well as best practices to enable readers to apply machine learning to their own datasets. Furthermore, all code (in the form of Jupyter notebooks and Google Colaboratory notebooks) used to make the examples in the paper is provided in an effort to catalyse the use of machine learning in meteorology.  ( 2 min )
    Forecasting COVID- 19 cases using Statistical Models and Ontology-based Semantic Modelling: A real time data analytics approach. (arXiv:2206.02795v1 [q-bio.PE])
    SARS-COV-19 is the most prominent issue which many countries face today. The frequent changes in infections, recovered and deaths represents the dynamic nature of this pandemic. It is very crucial to predict the spreading rate of this virus for accurate decision making against fighting with the situation of getting infected through the virus, tracking and controlling the virus transmission in the community. We develop a prediction model using statistical time series models such as SARIMA and FBProphet to monitor the daily active, recovered and death cases of COVID-19 accurately. Then with the help of various details across each individual patient (like height, weight, gender etc.), we designed a set of rules using Semantic Web Rule Language and some mathematical models for dealing with COVID19 infected cases on an individual basis. After combining all the models, a COVID-19 Ontology is developed and performs various queries using SPARQL query on designed Ontology which accumulate the risk factors, provide appropriate diagnosis, precautions and preventive suggestions for COVID Patients. After comparing the performance of SARIMA and FBProphet, it is observed that the SARIMA model performs better in forecasting of COVID cases. On individual basis COVID case prediction, approx. 497 individual samples have been tested and classified into five different levels of COVID classes such as Having COVID, No COVID, High Risk COVID case, Medium to High Risk case, and Control needed case.  ( 2 min )
    Future Artificial Intelligence tools and perspectives in medicine. (arXiv:2206.03289v1 [cs.LG])
    Purpose of review: Artificial intelligence (AI) has become popular in medical applications, specifically as a clinical support tool for computer-aided diagnosis. These tools are typically employed on medical data (i.e., image, molecular data, clinical variables, etc.) and used the statistical and machine learning methods to measure the model performance. In this review, we summarized and discussed the most recent radiomic pipeline used for clinical analysis. Recent findings:Currently, limited management of cancers benefits from artificial intelligence, mostly related to a computer-aided diagnosis that avoids a biopsy analysis that presents additional risks and costs. Most AI tools are based on imaging features, known as radiomic analysis that can be refined into predictive models in non-invasively acquired imaging data. This review explores the progress of AI-based radiomic tools for clinical applications with a brief description of necessary technical steps. Explaining new radiomic approaches based on deep learning techniques will explain how the new radiomic models (deep radiomic analysis) can benefit from deep convolutional neural networks and be applied on limited data sets. Summary: To consider the radiomic algorithms, further investigations are recommended to involve deep learning in radiomic models with additional validation steps on various cancer types.  ( 2 min )
    Beyond Faithfulness: A Framework to Characterize and Compare Saliency Methods. (arXiv:2206.02958v1 [cs.LG])
    Saliency methods calculate how important each input feature is to a machine learning model's prediction, and are commonly used to understand model reasoning. "Faithfulness", or how fully and accurately the saliency output reflects the underlying model, is an oft-cited desideratum for these methods. However, explanation methods must necessarily sacrifice certain information in service of user-oriented goals such as simplicity. To that end, and akin to performance metrics, we frame saliency methods as abstractions: individual tools that provide insight into specific aspects of model behavior and entail tradeoffs. Using this framing, we describe a framework of nine dimensions to characterize and compare the properties of saliency methods. We group these dimensions into three categories that map to different phases of the interpretation process: methodology, or how the saliency is calculated; sensitivity, or relationships between the saliency result and the underlying model or input; and, perceptibility, or how a user interprets the result. As we show, these dimensions give us a granular vocabulary for describing and comparing saliency methods -- for instance, allowing us to develop "saliency cards" as a form of documentation, or helping downstream users understand tradeoffs and choose a method for a particular use case. Moreover, by situating existing saliency methods within this framework, we identify opportunities for future work, including filling gaps in the landscape and developing new evaluation metrics.  ( 2 min )
    UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder. (arXiv:2206.02512v2 [eess.AS] UPDATED)
    In this paper, we propose a novel unsupervised text-to-speech (UTTS) framework which does not require text-audio pairs for the TTS acoustic modeling (AM). UTTS is a multi-speaker speech synthesizer developed from the perspective of disentangled speech representation learning. The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference. We leverage recent advancements in self-supervised speech representation learning as well as speech synthesis front-end techniques for the system development. Specifically, we utilize a lexicon to map input text to the phoneme sequence, which is expanded to the frame-level forced alignment (FA) with a speaker-dependent duration model. Then, we develop an alignment mapping module that converts the FA to the unsupervised alignment (UA). Finally, a Conditional Disentangled Sequential Variational Auto-encoder (C-DSVAE), serving as the self-supervised TTS AM, takes the predicted UA and a target speaker embedding to generate the mel spectrogram, which is ultimately converted to waveform with a neural vocoder. We show how our method enables speech synthesis without using a paired TTS corpus. Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.  ( 2 min )
    On the balance between the training time and interpretability of neural ODE for time series modelling. (arXiv:2206.03304v1 [cs.LG])
    Most machine learning methods are used as a black box for modelling. We may try to extract some knowledge from physics-based training methods, such as neural ODE (ordinary differential equation). Neural ODE has advantages like a possibly higher class of represented functions, the extended interpretability compared to black-box machine learning models, ability to describe both trend and local behaviour. Such advantages are especially critical for time series with complicated trends. However, the known drawback is the high training time compared to the autoregressive models and long-short term memory (LSTM) networks widely used for data-driven time series modelling. Therefore, we should be able to balance interpretability and training time to apply neural ODE in practice. The paper shows that modern neural ODE cannot be reduced to simpler models for time-series modelling applications. The complexity of neural ODE is compared to or exceeds the conventional time-series modelling tools. The only interpretation that could be extracted is the eigenspace of the operator, which is an ill-posed problem for a large system. Spectra could be extracted using different classical analysis methods that do not have the drawback of extended time. Consequently, we reduce the neural ODE to a simpler linear form and propose a new view on time-series modelling using combined neural networks and an ODE system approach.
    Hierarchical Graph-Convolutional Variational AutoEncoding for Generative Modelling of Human Motion. (arXiv:2111.12602v4 [cs.CV] UPDATED)
    Models of human motion commonly focus either on trajectory prediction or action classification but rarely both. The marked heterogeneity and intricate compositionality of human motion render each task vulnerable to the data degradation and distributional shift common to real-world scenarios. A sufficiently expressive generative model of action could in theory enable data conditioning and distributional resilience within a unified framework applicable to both tasks. Here we propose a novel architecture based on hierarchical variational autoencoders and deep graph convolutional neural networks for generating a holistic model of action over multiple time-scales. We show this Hierarchical Graph-convolutional Variational Autoencoder (HG-VAE) to be capable of generating coherent actions, detecting out-of-distribution data, and imputing missing data by gradient ascent on the model's posterior. Trained and evaluated on H3.6M and the largest collection of open source human motion data, AMASS, we show HG-VAE can facilitate downstream discriminative learning better than baseline models.  ( 2 min )
    Time-series image denoising of pressure-sensitive paint data by projected multivariate singular spectrum analysis. (arXiv:2203.07574v2 [eess.IV] UPDATED)
    Time-series data, such as unsteady pressure-sensitive paint (PSP) measurement data, may contain a significant amount of random noise. Thus, in this study, we investigated a noise-reduction method that combines multivariate singular spectrum analysis (MSSA) with low-dimensional data representation. MSSA is a state-space reconstruction technique that utilizes time-delay embedding, and the low-dimensional representation is achieved by projecting data onto the singular value decomposition (SVD) basis. The noise-reduction performance of the proposed method for unsteady PSP data, i.e., the projected MSSA, is compared with that of the truncated SVD method, one of the most employed noise-reduction methods. The result shows that the projected MSSA exhibits better performance in reducing random noise than the truncated SVD method. Additionally, in contrast to that of the truncated SVD method, the performance of the projected MSSA is less sensitive to the truncation rank. Furthermore, the projected MSSA achieves denoising effectively by extracting smooth trajectories in a state space from noisy input data. Expectedly, the projected MSSA will be effective for reducing random noise in not only PSP measurement data, but also various high-dimensional time-series data.  ( 2 min )
    Combining physics-based and data-driven techniques for reliable hybrid analysis and modeling using the corrective source term approach. (arXiv:2206.03451v1 [cs.LG])
    Upcoming technologies like digital twins, autonomous, and artificial intelligent systems involving safety-critical applications require models which are accurate, interpretable, computationally efficient, and generalizable. Unfortunately, the two most commonly used modeling approaches, physics-based modeling (PBM) and data-driven modeling (DDM) fail to satisfy all these requirements. In the current work, we demonstrate how a hybrid approach combining the best of PBM and DDM can result in models which can outperform them both. We do so by combining partial differential equations based on first principles describing partially known physics with a black box DDM, in this case, a deep neural network model compensating for the unknown physics. First, we present a mathematical argument for why this approach should work and then apply the hybrid approach to model two dimensional heat diffusion problem with an unknown source term. The result demonstrates the method's superior performance in terms of accuracy, and generalizability. Additionally, it is shown how the DDM part can be interpreted within the hybrid framework to make the overall approach reliable.  ( 2 min )
    Robust Adversarial Attacks Detection based on Explainable Deep Reinforcement Learning For UAV Guidance and Planning. (arXiv:2206.02670v2 [cs.LG] UPDATED)
    The danger of adversarial attacks to unprotected Uncrewed Aerial Vehicle (UAV) agents operating in public is growing. Adopting AI-based techniques and more specifically Deep Learning (DL) approaches to control and guide these UAVs can be beneficial in terms of performance but add more concerns regarding the safety of those techniques and their vulnerability against adversarial attacks causing the chances of collisions going up as the agent becomes confused. This paper proposes an innovative approach based on the explainability of DL methods to build an efficient detector that will protect these DL schemes and thus the UAVs adopting them from potential attacks. The agent is adopting a Deep Reinforcement Learning (DRL) scheme for guidance and planning. It is formed and trained with a Deep Deterministic Policy Gradient (DDPG) with Prioritised Experience Replay (PER) DRL scheme that utilises Artificial Potential Field (APF) to improve training times and obstacle avoidance performance. The adversarial attacks are generated by Fast Gradient Sign Method (FGSM) and Basic Iterative Method (BIM) algorithms and reduced obstacle course completion rates from 80\% to 35\%. A Realistic Synthetic environment for UAV explainable DRL based planning and guidance including obstacles and adversarial attacks is built. Two adversarial attack detectors are proposed. The first one adopts a Convolutional Neural Network (CNN) architecture and achieves an accuracy in detection of 80\%. The second detector is developed based on a Long Short Term Memory (LSTM) network and achieves an accuracy of 91\% with much faster computing times when compared to the CNN based detector.  ( 2 min )
    An Embedding of ReLU Networks and an Analysis of their Identifiability. (arXiv:2107.09370v5 [cs.LG] UPDATED)
    Neural networks with the Rectified Linear Unit (ReLU) nonlinearity are described by a vector of parameters $\theta$, and realized as a piecewise linear continuous function $R_{\theta}: x \in \mathbb R^{d} \mapsto R_{\theta}(x) \in \mathbb R^{k}$. Natural scalings and permutations operations on the parameters $\theta$ leave the realization unchanged, leading to equivalence classes of parameters that yield the same realization. These considerations in turn lead to the notion of identifiability -- the ability to recover (the equivalence class of) $\theta$ from the sole knowledge of its realization $R_{\theta}$. The overall objective of this paper is to introduce an embedding for ReLU neural networks of any depth, $\Phi(\theta)$, that is invariant to scalings and that provides a locally linear parameterization of the realization of the network. Leveraging these two key properties, we derive some conditions under which a deep ReLU network is indeed locally identifiable from the knowledge of the realization on a finite set of samples $x_{i} \in \mathbb R^{d}$. We study the shallow case in more depth, establishing necessary and sufficient conditions for the network to be identifiable from a bounded subset $\mathcal X \subseteq \mathbb R^{d}$.  ( 2 min )
    Explaining the physics of transfer learning a data-driven subgrid-scale closure to a different turbulent flow. (arXiv:2206.03198v1 [physics.flu-dyn])
    Transfer learning (TL) is becoming a powerful tool in scientific applications of neural networks (NNs), such as weather/climate prediction and turbulence modeling. TL enables out-of-distribution generalization (e.g., extrapolation in parameters) and effective blending of disparate training sets (e.g., simulations and observations). In TL, selected layers of a NN, already trained for a base system, are re-trained using a small dataset from a target system. For effective TL, we need to know 1) what are the best layers to re-train? and 2) what physics are learned during TL? Here, we present novel analyses and a new framework to address (1)-(2) for a broad range of multi-scale, nonlinear systems. Our approach combines spectral analyses of the systems' data with spectral analyses of convolutional NN's activations and kernels, explaining the inner-workings of TL in terms of the system's nonlinear physics. Using subgrid-scale modeling of several setups of 2D turbulence as test cases, we show that the learned kernels are combinations of low-, band-, and high-pass filters, and that TL learns new filters whose nature is consistent with the spectral differences of base and target systems. We also find the shallowest layers are the best to re-train in these cases, which is against the common wisdom guiding TL in machine learning literature. Our framework identifies the best layer(s) to re-train beforehand, based on physics and NN theory. Together, these analyses explain the physics learned in TL and provide a framework to guide TL for wide-ranging applications in science and engineering, such as climate change modeling.  ( 2 min )
    Neural Network Decoders for Permutation Codes Correcting Different Errors. (arXiv:2206.03315v1 [cs.IT])
    Permutation codes were extensively studied in order to correct different types of errors for the applications on power line communication and rank modulation for flash memory. In this paper, we introduce the neural network decoders for permutation codes to correct these errors with one-shot decoding, which treat the decoding as $n$ classification tasks for non-binary symbols for a code of length $n$. These are actually the first general decoders introduced to deal with any error type for these two applications. The performance of the decoders is evaluated by simulations with different error models.  ( 2 min )
    Utility of Equivariant Message Passing in Cortical Mesh Segmentation. (arXiv:2206.03164v1 [cs.CV])
    The automated segmentation of cortical areas has been a long-standing challenge in medical image analysis. The complex geometry of the cortex is commonly represented as a polygon mesh, whose segmentation can be addressed by graph-based learning methods. When cortical meshes are misaligned across subjects, current methods produce significantly worse segmentation results, limiting their ability to handle multi-domain data. In this paper, we investigate the utility of E(n)-equivariant graph neural networks (EGNNs), comparing their performance against plain graph neural networks (GNNs). Our evaluation shows that GNNs outperform EGNNs on aligned meshes, due to their ability to leverage the presence of a global coordinate system. On misaligned meshes, the performance of plain GNNs drop considerably, while E(n)-equivariant message passing maintains the same segmentation results. The best results can also be obtained by using plain GNNs on realigned data (co-registered meshes in a global coordinate system).
    Unstructured Handwashing Recognition using Smartwatch to Reduce Contact Transmission of Pathogens. (arXiv:2107.13405v4 [cs.LG] UPDATED)
    Current guidelines from the World Health Organization indicate that the SARS-CoV-2 coronavirus, which results in the novel coronavirus disease (COVID-19), is transmitted through respiratory droplets or by contact. Contact transmission occurs when contaminated hands touch the mucous membrane of the mouth, nose, or eyes so hands hygiene is extremely important to prevent the spread of the SARSCoV-2 as well as of other pathogens. The vast proliferation of wearable devices, such as smartwatches, containing acceleration, rotation, magnetic field sensors, etc., together with the modern technologies of artificial intelligence, such as machine learning and more recently deep-learning, allow the development of accurate applications for recognition and classification of human activities such as: walking, climbing stairs, running, clapping, sitting, sleeping, etc. In this work, we evaluate the feasibility of a machine learning based system which, starting from inertial signals collected from wearable devices such as current smartwatches, recognizes when a subject is washing or rubbing its hands. Preliminary results, obtained over two different datasets, show a classification accuracy of about 95% and of about 94% for respectively deep and standard learning techniques.
    ByteComp: Revisiting Gradient Compression in Distributed Training. (arXiv:2205.14465v2 [cs.LG] UPDATED)
    Gradient compression (GC) is a promising approach to addressing the communication bottleneck in distributed deep learning (DDL). However, it is challenging to find the optimal compression strategy for applying GC to DDL because of the intricate interactions among tensors. To fully unleash the benefits of GC, two questions must be addressed: 1) How to express all compression strategies and the corresponding interactions among tensors of any DDL training job? 2) How to quickly select a near-optimal compression strategy? In this paper, we propose ByteComp to answer these questions. It first designs a decision tree abstraction to express all the compression strategies and develops empirical models to timeline tensor computation, communication, and compression to enable ByteComp to derive the intricate interactions among tensors. It then designs a compression decision algorithm that analyzes tensor interactions to eliminate and prioritize strategies and optimally offloads compression to CPUs. Experimental evaluations show that ByteComp can improve the training throughput over the start-of-the-art compression-enabled system by up to 77% for representative DDL training jobs. Moreover, the computational time needed to select the compression strategy is measured in milliseconds, and the selected strategy is only a few percent from optimal.
    Neuro-Symbolic Causal Language Planning with Commonsense Prompting. (arXiv:2206.02928v1 [cs.CL])
    Language planning aims to implement complex high-level goals by decomposition into sequential simpler low-level steps. Such procedural reasoning ability is essential for applications such as household robots and virtual assistants. Although language planning is a basic skill set for humans in daily life, it remains a challenge for large language models (LLMs) that lack deep-level commonsense knowledge in the real world. Previous methods require either manual exemplars or annotated programs to acquire such ability from LLMs. In contrast, this paper proposes Neuro-Symbolic Causal Language Planner (CLAP) that elicits procedural knowledge from the LLMs with commonsense-infused prompting. Pre-trained knowledge in LLMs is essentially an unobserved confounder that causes spurious correlations between tasks and action plans. Through the lens of a Structural Causal Model (SCM), we propose an effective strategy in CLAP to construct prompts as a causal intervention toward our SCM. Using graph sampling techniques and symbolic program executors, our strategy formalizes the structured causal prompts from commonsense knowledge bases. CLAP obtains state-of-the-art performance on WikiHow and RobotHow, achieving a relative improvement of 5.28% in human evaluations under the counterfactual setting. This indicates the superiority of CLAP in causal language planning semantically and sequentially.
    Building Robust Ensembles via Margin Boosting. (arXiv:2206.03362v1 [cs.LG])
    In the context of adversarial robustness, a single model does not usually have enough power to defend against all possible adversarial attacks, and as a result, has sub-optimal robustness. Consequently, an emerging line of work has focused on learning an ensemble of neural networks to defend against adversarial attacks. In this work, we take a principled approach towards building robust ensembles. We view this problem from the perspective of margin-boosting and develop an algorithm for learning an ensemble with maximum margin. Through extensive empirical evaluation on benchmark datasets, we show that our algorithm not only outperforms existing ensembling techniques, but also large models trained in an end-to-end fashion. An important byproduct of our work is a margin-maximizing cross-entropy (MCE) loss, which is a better alternative to the standard cross-entropy (CE) loss. Empirically, we show that replacing the CE loss in state-of-the-art adversarial training techniques with our MCE loss leads to significant performance improvement.
    Learning Backward Compatible Embeddings. (arXiv:2206.03040v1 [stat.ML])
    Embeddings, low-dimensional vector representation of objects, are fundamental in building modern machine learning systems. In industrial settings, there is usually an embedding team that trains an embedding model to solve intended tasks (e.g., product recommendation). The produced embeddings are then widely consumed by consumer teams to solve their unintended tasks (e.g., fraud detection). However, as the embedding model gets updated and retrained to improve performance on the intended task, the newly-generated embeddings are no longer compatible with the existing consumer models. This means that historical versions of the embeddings can never be retired or all consumer teams have to retrain their models to make them compatible with the latest version of the embeddings, both of which are extremely costly in practice. Here we study the problem of embedding version updates and their backward compatibility. We formalize the problem where the goal is for the embedding team to keep updating the embedding version, while the consumer teams do not have to retrain their models. We develop a solution based on learning backward compatible embeddings, which allows the embedding model version to be updated frequently, while also allowing the latest version of the embedding to be quickly transformed into any backward compatible historical version of it, so that consumer teams do not have to retrain their models. Under our framework, we explore six methods and systematically evaluate them on a real-world recommender system application. We show that the best method, which we call BC-Aligner, maintains backward compatibility with existing unintended tasks even after multiple model version updates. Simultaneously, BC-Aligner achieves the intended task performance similar to the embedding model that is solely optimized for the intended task.
    Deconstructing Distributions: A Pointwise Framework of Learning. (arXiv:2202.09931v2 [cs.LG] UPDATED)
    In machine learning, we traditionally evaluate the performance of a single model, averaged over a collection of test inputs. In this work, we propose a new approach: we measure the performance of a collection of models when evaluated on a $\textit{single input point}$. Specifically, we study a point's $\textit{profile}$: the relationship between models' average performance on the test distribution and their pointwise performance on this individual point. We find that profiles can yield new insights into the structure of both models and data -- in and out-of-distribution. For example, we empirically show that real data distributions consist of points with qualitatively different profiles. On one hand, there are "compatible" points with strong correlation between the pointwise and average performance. On the other hand, there are points with weak and even $\textit{negative}$ correlation: cases where improving overall model accuracy actually $\textit{hurts}$ performance on these inputs. We prove that these experimental observations are inconsistent with the predictions of several simplified models of learning proposed in prior work. As an application, we use profiles to construct a dataset we call CIFAR-10-NEG: a subset of CINIC-10 such that for standard models, accuracy on CIFAR-10-NEG is $\textit{negatively correlated}$ with accuracy on CIFAR-10 test. This illustrates, for the first time, an OOD dataset that completely inverts "accuracy-on-the-line" (Miller, Taori, Raghunathan, Sagawa, Koh, Shankar, Liang, Carmon, and Schmidt 2021)
    Computational Doob's $h$-transforms for Online Filtering of Discretely Observed Diffusions. (arXiv:2206.03369v1 [stat.ML])
    This paper is concerned with online filtering of discretely observed nonlinear diffusion processes. Our approach is based on the fully adapted auxiliary particle filter, which involves Doob's $h$-transforms that are typically intractable. We propose a computational framework to approximate these $h$-transforms by solving the underlying backward Kolmogorov equations using nonlinear Feynman-Kac formulas and neural networks. The methodology allows one to train a locally optimal particle filter prior to the data-assimilation procedure. Numerical experiments illustrate that the proposed approach can be orders of magnitude more efficient than the bootstrap particle filter in the regime of highly informative observations, when the observations are extreme under the model, and if the state dimension is large.
    Inferring Unfairness and Error from Population Statistics in Binary and Multiclass Classification. (arXiv:2206.03234v1 [cs.LG])
    We propose methods for making inferences on the fairness and accuracy of a given classifier, using only aggregate population statistics. This is necessary when it is impossible to obtain individual classification data, for instance when there is no access to the classifier or to a representative individual-level validation set. We study fairness with respect to the equalized odds criterion, which we generalize to multiclass classification. We propose a measure of unfairness with respect to this criterion, which quantifies the fraction of the population that is treated unfairly. We then show how inferences on the unfairness and error of a given classifier can be obtained using only aggregate label statistics such as the rate of prediction of each label in each sub-population, as well as the true rate of each label. We derive inference procedures for binary classifiers and for multiclass classifiers, for the case where confusion matrices in each sub-population are known, and for the significantly more challenging case where they are unknown. We report experiments on data sets representing diverse applications, which demonstrate the effectiveness and the wide range of possible uses of the proposed methodology.
    On the Convergence of Clustered Federated Learning. (arXiv:2202.06187v2 [cs.LG] UPDATED)
    Knowledge sharing and model personalization are essential components to tackle the non-IID challenge in federated learning (FL). Most existing FL methods focus on two extremes: 1) to learn a shared model to serve all clients with non-IID data, and 2) to learn personalized models for each client, namely personalized FL. There is a trade-off solution, namely clustered FL or cluster-wise personalized FL, which aims to cluster similar clients into one cluster, and then learn a shared model for all clients within a cluster. This paper is to revisit the research of clustered FL by formulating them into a bi-level optimization framework that could unify existing methods. We propose a new theoretical analysis framework to prove the convergence by considering the clusterability among clients. In addition, we embody this framework in an algorithm, named Weighted Clustered Federated Learning (WeCFL). Empirical analysis verifies the theoretical results and demonstrates the effectiveness of the proposed WeCFL under the proposed cluster-wise non-IID settings.
    Assessing Project-Level Fine-Tuning of ML4SE Models. (arXiv:2206.03333v1 [cs.SE])
    Machine Learning for Software Engineering (ML4SE) is an actively growing research area that focuses on methods that help programmers in their work. In order to apply the developed methods in practice, they need to achieve reasonable quality in order to help rather than distract developers. While the development of new approaches to code representation and data collection improves the overall quality of the models, it does not take into account the information that we can get from the project at hand. In this work, we investigate how the model's quality can be improved if we target a specific project. We develop a framework to assess quality improvements that models can get after fine-tuning for the method name prediction task on a particular project. We evaluate three models of different complexity and compare their quality in three settings: trained on a large dataset of Java projects, further fine-tuned on the data from a particular project, and trained from scratch on this data. We show that per-project fine-tuning can greatly improve the models' quality as they capture the project's domain and naming conventions. We open-source the tool we used for data collection, as well as the code to run the experiments: https://zenodo.org/record/6040745.
    Lottery Tickets with Nonzero Biases. (arXiv:2110.11150v2 [cs.LG] UPDATED)
    The strong lottery ticket hypothesis holds the promise that pruning randomly initialized deep neural networks could offer a computationally efficient alternative to deep learning with stochastic gradient descent. Common parameter initialization schemes and existence proofs, however, are focused on networks with zero biases, thus foregoing the potential universal approximation property of pruning. To fill this gap, we extend multiple initialization schemes and existence proofs to nonzero biases, including explicit 'looks-linear' approaches for ReLU activation functions. These do not only enable truly orthogonal parameter initialization but also reduce potential pruning errors. In experiments on standard benchmark data, we further highlight the practical benefits of nonzero bias initialization schemes, and present theoretically inspired extensions for state-of-the-art strong lottery ticket pruning.
    GAAF: Searching Activation Functions for Binary Neural Networks through Genetic Algorithm. (arXiv:2206.03291v1 [cs.NE])
    Binary neural networks (BNNs) show promising utilization in cost and power-restricted domains such as edge devices and mobile systems. This is due to its significantly less computation and storage demand, but at the cost of degraded performance. To close the accuracy gap, in this paper we propose to add a complementary activation function (AF) ahead of the sign based binarization, and rely on the genetic algorithm (GA) to automatically search for the ideal AFs. These AFs can help extract extra information from the input data in the forward pass, while allowing improved gradient approximation in the backward pass. Fifteen novel AFs are identified through our GA-based search, while most of them show improved performance (up to 2.54% on ImageNet) when testing on different datasets and network models. Our method offers a novel approach for designing general and application-specific BNN architecture. Our code is available at this http URL
    Adversarial Reprogramming Revisited. (arXiv:2206.03466v1 [cs.LG])
    Adversarial reprogramming, introduced by Elsayed, Goodfellow, and Sohl-Dickstein, seeks to repurpose a neural network to perform a different task, by manipulating its input without modifying its weights. We prove that two-layer ReLU neural networks with random weights can be adversarially reprogrammed to achieve arbitrarily high accuracy on Bernoulli data models over hypercube vertices, provided the network width is no greater than its input dimension. We also substantially strengthen a recent result of Phuong and Lampert on directional convergence of gradient flow, and obtain as a corollary that training two-layer ReLU neural networks on orthogonally separable datasets can cause their adversarial reprogramming to fail. We support these theoretical results by experiments that demonstrate that, as long as batch normalisation layers are suitably initialised, even untrained networks with random weights are susceptible to adversarial reprogramming. This is in contrast to observations in several recent works that suggested that adversarial reprogramming is not possible for untrained networks to any degree of reliability.
    Neural Lagrangian Schr\"odinger Bridge. (arXiv:2204.04853v3 [cs.LG] UPDATED)
    Population dynamics is the study of temporal and spatial variation in the size of populations of organisms and is a major part of population ecology. One of the main difficulties in analyzing population dynamics is that we can only obtain observation data with coarse time intervals from fixed-point observations due to experimental costs or measurement constraints. Recently, modeling population dynamics by using continuous normalizing flows (CNFs) and dynamic optimal transport has been proposed to infer the sample trajectories from a fixed-point observed population. While the sample behavior in CNFs is deterministic, the actual sample in biological systems moves in an essentially random yet directional manner. Moreover, when a sample moves from point A to point B in dynamical systems, its trajectory typically follows the principle of least action in which the corresponding action has the smallest possible value. To satisfy these requirements of the sample trajectories, we formulate the Lagrangian Schr\"odinger bridge (LSB) problem and propose to solve it approximately using neural SDE with regularization. We also develop a model architecture that enables faster computation. Experimental results show that the proposed method can efficiently approximate the population-level dynamics even for high-dimensional data and that using the prior knowledge introduced by the Lagrangian enables us to estimate the trajectories of individual samples with stochastic behavior.
    Improving the Diagnosis of Psychiatric Disorders with Self-Supervised Graph State Space Models. (arXiv:2206.03331v1 [cs.LG])
    Single subject prediction of brain disorders from neuroimaging data has gained increasing attention in recent years. Yet, for some heterogeneous disorders such as major depression disorder (MDD) and autism spectrum disorder (ASD), the performance of prediction models on large-scale multi-site datasets remains poor. We present a two-stage framework to improve the diagnosis of heterogeneous psychiatric disorders from resting-state functional magnetic resonance imaging (rs-fMRI). First, we propose a self-supervised mask prediction task on data from healthy individuals that can exploit differences between healthy controls and patients in clinical datasets. Next, we train a supervised classifier on the learned discriminative representations. To model rs-fMRI data, we develop Graph-S4; an extension to the recently proposed state-space model S4 to graph settings where the underlying graph structure is not known in advance. We show that combining the framework and Graph-S4 can significantly improve the diagnostic performance of neuroimaging-based single subject prediction models of MDD and ASD on three open-source multi-center rs-fMRI clinical datasets.
    Learning in Observable POMDPs, without Computationally Intractable Oracles. (arXiv:2206.03446v1 [cs.LG])
    Much of reinforcement learning theory is built on top of oracles that are computationally hard to implement. Specifically for learning near-optimal policies in Partially Observable Markov Decision Processes (POMDPs), existing algorithms either need to make strong assumptions about the model dynamics (e.g. deterministic transitions) or assume access to an oracle for solving a hard optimistic planning or estimation problem as a subroutine. In this work we develop the first oracle-free learning algorithm for POMDPs under reasonable assumptions. Specifically, we give a quasipolynomial-time end-to-end algorithm for learning in "observable" POMDPs, where observability is the assumption that well-separated distributions over states induce well-separated distributions over observations. Our techniques circumvent the more traditional approach of using the principle of optimism under uncertainty to promote exploration, and instead give a novel application of barycentric spanners to constructing policy covers.
    On Efficient Approximate Queries over Machine Learning Models. (arXiv:2206.02845v1 [cs.DB])
    The question of answering queries over ML predictions has been gaining attention in the database community. This question is challenging because the cost of finding high quality answers corresponds to invoking an oracle such as a human expert or an expensive deep neural network model on every single item in the DB and then applying the query. We develop a novel unified framework for approximate query answering by leveraging a proxy to minimize the oracle usage of finding high quality answers for both Precision-Target (PT) and Recall-Target (RT) queries. Our framework uses a judicious combination of invoking the expensive oracle on data samples and applying the cheap proxy on the objects in the DB. It relies on two assumptions. Under the Proxy Quality assumption, proxy quality can be quantified in a probabilistic manner w.r.t. the oracle. This allows us to develop two algorithms: PQA that efficiently finds high quality answers with high probability and no oracle calls, and PQE, a heuristic extension that achieves empirically good performance with a small number of oracle calls. Alternatively, under the Core Set Closure assumption, we develop two algorithms: CSC that efficiently returns high quality answers with high probability and minimal oracle usage, and CSE, which extends it to more general settings. Our extensive experiments on five real-world datasets on both query types, PT and RT, demonstrate that our algorithms outperform the state-of-the-art and achieve high result quality with provable statistical guarantees.
    Patch-based image Super Resolution using generalized Gaussian mixture model. (arXiv:2206.03069v1 [eess.IV])
    Single Image Super Resolution (SISR) methods aim to recover the clean images in high resolution from low resolution observations.A family of patch-based approaches have received considerable attention and development. The minimum mean square error (MMSE) methodis a powerful image restoration method that uses a probability model on the patches of images. This paper proposes an algorithm to learn a jointgeneralized Gaussian mixture model (GGMM) from a pair of the low resolution patches and the corresponding high resolution patches fromthe reference data. We then reconstruct the high resolution image based on the MMSE method. Our numerical evaluations indicate that theMMSE-GGMM method competes with other state of the art methods.
    Improving Fairness in Graph Neural Networks via Mitigating Sensitive Attribute Leakage. (arXiv:2206.03426v1 [cs.LG])
    Graph Neural Networks (GNNs) have shown great power in learning node representations on graphs. However, they may inherit historical prejudices from training data, leading to discriminatory bias in predictions. Although some work has developed fair GNNs, most of them directly borrow fair representation learning techniques from non-graph domains without considering the potential problem of sensitive attribute leakage caused by feature propagation in GNNs. However, we empirically observe that feature propagation could vary the correlation of previously innocuous non-sensitive features to the sensitive ones. This can be viewed as a leakage of sensitive information which could further exacerbate discrimination in predictions. Thus, we design two feature masking strategies according to feature correlations to highlight the importance of considering feature propagation and correlation variation in alleviating discrimination. Motivated by our analysis, we propose Fair View Graph Neural Network (FairVGNN) to generate fair views of features by automatically identifying and masking sensitive-correlated features considering correlation variation after feature propagation. Given the learned fair views, we adaptively clamp weights of the encoder to avoid using sensitive-related features. Experiments on real-world datasets demonstrate that FairVGNN enjoys a better trade-off between model utility and fairness. Our code is publicly available at \href{https://github.com/YuWVandy/FairVGNN}{\textcolor{blue}{https://github.com/YuWVandy/FairVGNN}}.
    Yet Another Representation of Binary Decision Trees: A Mathematical Demonstration. (arXiv:2101.07077v5 [cs.LG] UPDATED)
    A decision tree looks like a simple computational graph without cycles, where only the leaf nodes specify the output values and the non-terminals specify their tests or split conditions. From the numerical perspective, we express decision trees in the language of computational graph. We explicitly parameterize the test phase, traversal phase and prediction phase of decision trees based on the bitvectors of non-terminal nodes. As shown later, the decision tree is a shallow binary network in some sense. Especially, we introduce the bitvector matrix to implement the tree traversal in numerical approach, where the core is to convert the logical `AND' operation to arithmetic operations. And we apply this numerical representation to extend and unify diverse decision trees in concept.
    Efficient entity-based reinforcement learning. (arXiv:2206.02855v1 [cs.LG])
    Recent deep reinforcement learning (DRL) successes rely on end-to-end learning from fixed-size observational inputs (e.g. image, state-variables). However, many challenging and interesting problems in decision making involve observations or intermediary representations which are best described as a set of entities: either the image-based approach would miss small but important details in the observations (e.g. ojects on a radar, vehicles on satellite images, etc.), the number of sensed objects is not fixed (e.g. robotic manipulation), or the problem simply cannot be represented in a meaningful way as an image (e.g. power grid control, or logistics). This type of structured representations is not directly compatible with current DRL architectures, however, there has been an increase in machine learning techniques directly targeting structured information, potentially addressing this issue. We propose to combine recent advances in set representations with slot attention and graph neural networks to process structured data, broadening the range of applications of DRL algorithms. This approach allows to address entity-based problems in an efficient and scalable way. We show that it can improve training time and robustness significantly, and demonstrate their potential to handle structured as well as purely visual domains, on multiple environments from the Atari Learning Environment and Simple Playgrounds.
    Singapore Soundscape Site Selection Survey (S5): Identification of Characteristic Soundscapes of Singapore via Weighted k-means Clustering. (arXiv:2206.03112v1 [cs.LG])
    The ecological validity of soundscape studies usually rests on a choice of soundscapes that are representative of the perceptual space under investigation. For example, a soundscape pleasantness study might investigate locations with soundscapes ranging from "pleasant" to "annoying". The choice of soundscapes is typically researcher-led, but a participant-led process can reduce selection bias and improve result reliability. Hence, we propose a robust participant-led method to pinpoint characteristic soundscapes possessing arbitrary perceptual attributes. We validate our method by identifying Singaporean soundscapes spanning the perceptual quadrants generated from the "Pleasantness" and "Eventfulness" axes of the ISO 12913-2 circumplex model of soundscape perception, as perceived by local experts. From memory and experience, 67 participants first selected locations corresponding to each perceptual quadrant in each major planning region of Singapore. We then performed weighted k-means clustering on the selected locations, with weights for each location derived from previous frequencies and durations spent in each location by each participant. Weights hence acted as proxies for participant confidence. In total, 62 locations were thereby identified as suitable locations with characteristic soundscapes for further research utilizing the ISO 12913-2 perceptual quadrants. Audio-visual recordings and acoustic characterization of the soundscapes will be made in a future study.
    Efficient Machine Learning, Compilers, and Optimizations for Embedded Systems. (arXiv:2206.03326v1 [cs.LG])
    Deep Neural Networks (DNNs) have achieved great success in a massive number of artificial intelligence (AI) applications by delivering high-quality computer vision, natural language processing, and virtual reality applications. However, these emerging AI applications also come with increasing computation and memory demands, which are challenging to handle especially for the embedded systems where limited computation/memory resources, tight power budgets, and small form factors are demanded. Challenges also come from the diverse application-specific requirements, including real-time responses, high-throughput performance, and reliable inference accuracy. To address these challenges, we will introduce a series of effective design methods in this book chapter to enable efficient algorithms, compilers, and various optimizations for embedded systems.
    Plant 'n' Seek: Can You Find the Winning Ticket?. (arXiv:2111.11153v2 [cs.LG] UPDATED)
    The lottery ticket hypothesis has sparked the rapid development of pruning algorithms that aim to reduce the computational costs associated with deep learning during training and model deployment. Currently, such algorithms are primarily evaluated on imaging data, for which we lack ground truth information and thus the understanding of how sparse lottery tickets could be. To fill this gap, we develop a framework that allows us to plant and hide winning tickets with desirable properties in randomly initialized neural networks. To analyze the ability of state-of-the-art pruning to identify tickets of extreme sparsity, we design and hide such tickets solving four challenging tasks. In extensive experiments, we observe similar trends as in imaging studies, indicating that our framework can provide transferable insights into realistic problems. Additionally, we can now see beyond such relative trends and highlight limitations of current pruning methods. Based on our results, we conclude that the current limitations in ticket sparsity are likely of algorithmic rather than fundamental nature. We anticipate that comparisons to planted tickets will facilitate future developments of efficient pruning algorithms.
    SubStrat: A Subset-Based Strategy for Faster AutoML. (arXiv:2206.03070v1 [cs.LG])
    Automated machine learning (AutoML) frameworks have become important tools in the data scientists' arsenal, as they dramatically reduce the manual work devoted to the construction of ML pipelines. Such frameworks intelligently search among millions of possible ML pipelines - typically containing feature engineering, model selection and hyper parameters tuning steps - and finally output an optimal pipeline in terms of predictive accuracy. However, when the dataset is large, each individual configuration takes longer to execute, therefore the overall AutoML running times become increasingly high. To this end, we present SubStrat, an AutoML optimization strategy that tackles the data size, rather than configuration space. It wraps existing AutoML tools, and instead of executing them directly on the entire dataset, SubStrat uses a genetic-based algorithm to find a small yet representative data subset which preserves a particular characteristic of the full data. It then employs the AutoML tool on the small subset, and finally, it refines the resulted pipeline by executing a restricted, much shorter, AutoML process on the large dataset. Our experimental results, performed on two popular AutoML frameworks, Auto-Sklearn and TPOT, show that SubStrat reduces their running times by 79% (on average), with less than 2% average loss in the accuracy of the resulted ML pipeline.
    Molecular Representation Learning via Heterogeneous Motif Graph Neural Networks. (arXiv:2202.00529v2 [cs.LG] UPDATED)
    We consider feature representation learning problem of molecular graphs. Graph Neural Networks have been widely used in feature representation learning of molecular graphs. However, most existing methods deal with molecular graphs individually while neglecting their connections, such as motif-level relationships. We propose a novel molecular graph representation learning method by constructing a heterogeneous motif graph to address this issue. In particular, we build a heterogeneous motif graph that contains motif nodes and molecular nodes. Each motif node corresponds to a motif extracted from molecules. Then, we propose a Heterogeneous Motif Graph Neural Network (HM-GNN) to learn feature representations for each node in the heterogeneous motif graph. Our heterogeneous motif graph also enables effective multi-task learning, especially for small molecular datasets. To address the potential efficiency issue, we propose to use an edge sampler, which can significantly reduce computational resources usage. The experimental results show that our model consistently outperforms previous state-of-the-art models. Under multi-task settings, the promising performances of our methods on combined datasets shed light on a new learning paradigm for small molecular datasets. Finally, we show that our model achieves similar performances with significantly less computational resources by using our edge sampler.
    Variable-rate hierarchical CPC leads to acoustic unit discovery in speech. (arXiv:2206.02211v2 [cs.SD] UPDATED)
    The success of deep learning comes from its ability to capture the hierarchical structure of data by learning high-level representations defined in terms of low-level ones. In this paper we explore self-supervised learning of hierarchical representations of speech by applying multiple levels of Contrastive Predictive Coding (CPC). We observe that simply stacking two CPC models does not yield significant improvements over single-level architectures. Inspired by the fact that speech is often described as a sequence of discrete units unevenly distributed in time, we propose a model in which the output of a low-level CPC module is non-uniformly downsampled to directly minimize the loss of a high-level CPC module. The latter is designed to also enforce a prior of separability and discreteness in its representations by enforcing dissimilarity of successive high-level representations through focused negative sampling, and by quantization of the prediction targets. Accounting for the structure of the speech signal improves upon single-level CPC features and enhances the disentanglement of the learned representations, as measured by downstream speech recognition tasks, while resulting in a meaningful segmentation of the signal that closely resembles phone boundaries.
    FairVFL: A Fair Vertical Federated Learning Framework with Contrastive Adversarial Learning. (arXiv:2206.03200v1 [cs.LG])
    Vertical federated learning (VFL) is a privacy-preserving machine learning paradigm that can learn models from features distributed on different platforms in a privacy-preserving way. Since in real-world applications the data may contain bias on fairness-sensitive features (e.g., gender), VFL models may inherit bias from training data and become unfair for some user groups. However, existing fair ML methods usually rely on the centralized storage of fairness-sensitive features to achieve model fairness, which are usually inapplicable in federated scenarios. In this paper, we propose a fair vertical federated learning framework (FairVFL), which can improve the fairness of VFL models. The core idea of FairVFL is to learn unified and fair representations of samples based on the decentralized feature fields in a privacy-preserving way. Specifically, each platform with fairness-insensitive features first learns local data representations from local features. Then, these local representations are uploaded to a server and aggregated into a unified representation for the target task. In order to learn fair unified representations, we send them to each platform storing fairness-sensitive features and apply adversarial learning to remove bias from the unified representations inherited from the biased data. Moreover, for protecting user privacy, we further propose a contrastive adversarial learning method to remove privacy information from the unified representations in server before sending them to the platforms keeping fairness-sensitive features. Experiments on two real-world datasets validate that our method can effectively improve model fairness with user privacy well-protected.
    TUNet: A Block-online Bandwidth Extension Model based on Transformers and Self-supervised Pretraining. (arXiv:2110.13492v5 [cs.LG] UPDATED)
    We introduce a block-online variant of the temporal feature-wise linear modulation (TFiLM) model to achieve bandwidth extension. The proposed architecture simplifies the UNet backbone of the TFiLM to reduce inference time and employs an efficient transformer at the bottleneck to alleviate performance degradation. We also utilize self-supervised pretraining and data augmentation to enhance the quality of bandwidth extended signals and reduce the sensitivity with respect to downsampling methods. Experiment results on the VCTK dataset show that the proposed method outperforms several recent baselines in both intrusive and non-intrusive metrics. Pretraining and filter augmentation also help stabilize and enhance the overall performance.
    Federated Spatial Reuse Optimization in Next-Generation Decentralized IEEE 802.11 WLANs. (arXiv:2203.10472v2 [cs.NI] UPDATED)
    As wireless standards evolve, more complex functionalities are introduced to address the increasing requirements in terms of throughput, latency, security, and efficiency. To unleash the potential of such new features, artificial intelligence (AI) and machine learning (ML) are currently being exploited for deriving models and protocols from data, rather than by hand-programming. In this paper, we explore the feasibility of applying ML in next-generation wireless local area networks (WLANs). More specifically, we focus on the IEEE 802.11ax spatial reuse (SR) problem and predict its performance through federated learning (FL) models. The set of FL solutions overviewed in this work is part of the 2021 International Telecommunication Union (ITU) AI for 5G Challenge.
    Reachability Constrained Reinforcement Learning. (arXiv:2205.07536v2 [cs.LG] UPDATED)
    Constrained reinforcement learning (CRL) has gained significant interest recently, since safety constraints satisfaction is critical for real-world problems. However, existing CRL methods constraining discounted cumulative costs generally lack rigorous definition and guarantee of safety. In contrast, in the safe control research, safety is defined as persistently satisfying certain state constraints. Such persistent safety is possible only on a subset of the state space, called feasible set, where an optimal largest feasible set exists for a given environment. Recent studies incorporate feasible sets into CRL with energy-based methods such as control barrier function (CBF), safety index (SI), and leverage prior conservative estimations of feasible sets, which harms the performance of the learned policy. To deal with this problem, this paper proposes the reachability CRL (RCRL) method by using reachability analysis to establish the novel self-consistency condition and characterize the feasible sets. The feasible sets are represented by the safety value function, which is used as the constraint in CRL. We use the multi-time scale stochastic approximation theory to prove that the proposed algorithm converges to a local optimum, where the largest feasible set can be guaranteed. Empirical results on different benchmarks validate the learned feasible set, the policy performance, and constraint satisfaction of RCRL, compared to CRL and safe control baselines.
    Harnessing spectral representations for subgraph alignment. (arXiv:2205.14938v2 [cs.LG] UPDATED)
    With the rise and advent of graph learning techniques, graph data has become ubiquitous. However, while several efforts are being devoted to the design of new convolutional architectures, pooling or positional encoding schemes, less effort is being spent on problems involving maps between (possibly very large) graphs, such as signal transfer, graph isomorphism and subgraph correspondence. With this paper, we anticipate the need for a convenient framework to deal with such problems, and focus in particular on the challenging subgraph alignment scenario. We claim that, first and foremost, the representation of a map plays a central role on how these problems should be modeled. Taking the hint from recent work in geometry processing, we propose the adoption of a spectral representation for maps that is compact, easy to compute, robust to topological changes, easy to plug into existing pipelines, and is especially effective for subgraph alignment problems. We report for the first time a surprising phenomenon where the partiality arising in the subgraph alignment task is manifested as a special structure of the map coefficients, even in the absence of exact subgraph isomorphism, and which is consistently observed over different families of graphs up to several thousand nodes.
    GradMax: Growing Neural Networks using Gradient Information. (arXiv:2201.05125v3 [cs.LG] UPDATED)
    The architecture and the parameters of neural networks are often optimized independently, which requires costly retraining of the parameters whenever the architecture is modified. In this work we instead focus on growing the architecture without requiring costly retraining. We present a method that adds new neurons during training without impacting what is already learned, while improving the training dynamics. We achieve the latter by maximizing the gradients of the new weights and find the optimal initialization efficiently by means of the singular value decomposition (SVD). We call this technique Gradient Maximizing Growth (GradMax) and demonstrate its effectiveness in variety of vision tasks and architectures.
    Scientific Machine Learning through Physics-Informed Neural Networks: Where we are and What's next. (arXiv:2201.05624v4 [cs.LG] UPDATED)
    Physics-Informed Neural Networks (PINN) are neural networks (NNs) that encode model equations, like Partial Differential Equations (PDE), as a component of the neural network itself. PINNs are nowadays used to solve PDEs, fractional equations, integral-differential equations, and stochastic PDEs. This novel methodology has arisen as a multi-task learning framework in which a NN must fit observed data while reducing a PDE residual. This article provides a comprehensive review of the literature on PINNs: while the primary goal of the study was to characterize these networks and their related advantages and disadvantages. The review also attempts to incorporate publications on a broader range of collocation-based physics informed neural networks, which stars form the vanilla PINN, as well as many other variants, such as physics-constrained neural networks (PCNN), variational hp-VPINN, and conservative PINN (CPINN). The study indicates that most research has focused on customizing the PINN through different activation functions, gradient optimization techniques, neural network structures, and loss function structures. Despite the wide range of applications for which PINNs have been used, by demonstrating their ability to be more feasible in some contexts than classical numerical techniques like Finite Element Method (FEM), advancements are still possible, most notably theoretical issues that remain unresolved.
    Consistency Regularization for Variational Auto-Encoders. (arXiv:2105.14859v2 [cs.LG] UPDATED)
    Variational auto-encoders (VAEs) are a powerful approach to unsupervised learning. They enable scalable approximate posterior inference in latent-variable models using variational inference (VI). A VAE posits a variational family parameterized by a deep neural network called an encoder that takes data as input. This encoder is shared across all the observations, which amortizes the cost of inference. However the encoder of a VAE has the undesirable property that it maps a given observation and a semantics-preserving transformation of it to different latent representations. This "inconsistency" of the encoder lowers the quality of the learned representations, especially for downstream tasks, and also negatively affects generalization. In this paper, we propose a regularization method to enforce consistency in VAEs. The idea is to minimize the Kullback-Leibler (KL) divergence between the variational distribution when conditioning on the observation and the variational distribution when conditioning on a random semantic-preserving transformation of this observation. This regularization is applicable to any VAE. In our experiments we apply it to four different VAE variants on several benchmark datasets and found it always improves the quality of the learned representations but also leads to better generalization. In particular, when applied to the Nouveau Variational Auto-Encoder (NVAE), our regularization method yields state-of-the-art performance on MNIST and CIFAR-10. We also applied our method to 3D data and found it learns representations of superior quality as measured by accuracy on a downstream classification task.
    CANShield: Signal-based Intrusion Detection for Controller Area Networks. (arXiv:2205.01306v3 [cs.CR] UPDATED)
    Modern vehicles rely on a fleet of electronic control units (ECUs) connected through controller area network (CAN) buses for critical vehicular control. However, with the expansion of advanced connectivity features in automobiles and the elevated risks of internal system exposure, the CAN bus is increasingly prone to intrusions and injection attacks. The ordinary injection attacks disrupt the typical timing properties of the CAN data stream, and the rule-based intrusion detection systems (IDS) can easily detect them. However, advanced attackers can inject false data to the time series sensory data (signal), while looking innocuous by the pattern/frequency of the CAN messages. Such attacks can bypass the rule-based IDS or any anomaly-based IDS built on binary payload data. To make the vehicles robust against such intelligent attacks, we propose CANShield, a signal-based intrusion detection framework for the CAN bus. CANShield consists of three modules: a data preprocessing module that handles the high-dimensional CAN data stream at the signal level and makes them suitable for a deep learning model; a data analyzer module consisting of multiple deep autoencoder (AE) networks, each analyzing the time-series data from a different temporal perspective; and finally an attack detection module that uses an ensemble method to make the final decision. Evaluation results on two high-fidelity signal-based CAN attack datasets show the high accuracy and responsiveness of CANShield in detecting wide-range of advanced intrusion attacks.
    DeepMTS: Deep Multi-task Learning for Survival Prediction in Patients with Advanced Nasopharyngeal Carcinoma using Pretreatment PET/CT. (arXiv:2109.07711v2 [eess.IV] UPDATED)
    Nasopharyngeal Carcinoma (NPC) is a malignant epithelial cancer arising from the nasopharynx. Survival prediction is a major concern for NPC patients, as it provides early prognostic information to plan treatments. Recently, deep survival models based on deep learning have demonstrated the potential to outperform traditional radiomics-based survival prediction models. Deep survival models usually use image patches covering the whole target regions (e.g., nasopharynx for NPC) or containing only segmented tumor regions as the input. However, the models using the whole target regions will also include non-relevant background information, while the models using segmented tumor regions will disregard potentially prognostic information existing out of primary tumors (e.g., local lymph node metastasis and adjacent tissue invasion). In this study, we propose a 3D end-to-end Deep Multi-Task Survival model (DeepMTS) for joint survival prediction and tumor segmentation in advanced NPC from pretreatment PET/CT. Our novelty is the introduction of a hard-sharing segmentation backbone to guide the extraction of local features related to the primary tumors, which reduces the interference from non-relevant background information. In addition, we also introduce a cascaded survival network to capture the prognostic information existing out of primary tumors and further leverage the global tumor information (e.g., tumor size, shape, and locations) derived from the segmentation backbone. Our experiments with two clinical datasets demonstrate that our DeepMTS can consistently outperform traditional radiomics-based survival prediction models and existing deep survival models.
    Learning in High-Dimensional Feature Spaces Using ANOVA-Based Fast Matrix-Vector Multiplication. (arXiv:2111.10140v2 [cs.LG] UPDATED)
    Kernel matrices are crucial in many learning tasks such as support vector machines or kernel ridge regression. The kernel matrix is typically dense and large-scale. Depending on the dimension of the feature space even the computation of all of its entries in reasonable time becomes a challenging task. For such dense matrices the cost of a matrix-vector product scales quadratically with the dimensionality N , if no customized methods are applied. We propose the use of an ANOVA kernel, where we construct several kernels based on lower-dimensional feature spaces for which we provide fast algorithms realizing the matrix-vector products. We employ the non-equispaced fast Fourier transform (NFFT), which is of linear complexity for fixed accuracy. Based on a feature grouping approach, we then show how the fast matrix-vector products can be embedded into a learning method choosing kernel ridge regression and the conjugate gradient solver. We illustrate the performance of our approach on several data sets.
    Identifiability of Causal-based Fairness Notions: A State of the Art. (arXiv:2203.05900v2 [cs.LG] UPDATED)
    Machine learning algorithms can produce biased outcome/prediction, typically, against minorities and under-represented sub-populations. Therefore, fairness is emerging as an important requirement for the large scale application of machine learning based technologies. The most commonly used fairness notions (e.g. statistical parity, equalized odds, predictive parity, etc.) are observational and rely on mere correlation between variables. These notions fail to identify bias in case of statistical anomalies such as Simpson's or Berkson's paradoxes. Causality-based fairness notions (e.g. counterfactual fairness, no-proxy discrimination, etc.) are immune to such anomalies and hence more reliable to assess fairness. The problem of causality-based fairness notions, however, is that they are defined in terms of quantities (e.g. causal, counterfactual, and path-specific effects) that are not always measurable. This is known as the identifiability problem and is the topic of a large body of work in the causal inference literature. This paper is a compilation of the major identifiability results which are of particular relevance for machine learning fairness. The results are illustrated using a large number of examples and causal graphs. The paper would be of particular interest to fairness researchers, practitioners, and policy makers who are considering the use of causality-based fairness notions as it summarizes and illustrates the major identifiability results
    Improved Cardiac Arrhythmia Prediction Based on Heart Rate Variability Analysis. (arXiv:2206.03222v1 [cs.LG])
    Many types of ventricular and atrial cardiac arrhythmias have been discovered in clinical practice in the past 100 years, and these arrhythmias are a major contributor to sudden cardiac death. Ventricular tachycardia, ventricular fibrillation, and paroxysmal atrial fibrillation are the most commonly-occurring and dangerous arrhythmias, therefore early detection is crucial to prevent any further complications and reduce fatalities. Implantable devices such as pacemakers are commonly used in patients at high risk of sudden cardiac death. While great advances have been made in medical technology, there remain significant challenges in effective management of common arrhythmias. This thesis proposes novel arrhythmia detection and prediction methods to differentiate cardiac arrhythmias from non-life-threatening cardiac events, to increase the likelihood of detecting events that may lead to mortality, as well as reduce the incidence of unnecessary therapeutic intervention. The methods are based on detailed analysis of Heart Rate Variability (HRV) information. The results of the work show good performance of the proposed methods and support the potential for their deployment in resource-constrained devices for ventricular and atrial arrhythmia prediction, such as implantable pacemakers and defibrillators.
    Debiased Self-Training for Semi-Supervised Learning. (arXiv:2202.07136v3 [cs.LG] UPDATED)
    Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets. Yet these datasets are time-consuming and labor-exhaustive to obtain on realistic tasks. To mitigate the requirement for labeled data, self-training is widely used in semi-supervised learning by iteratively assigning pseudo labels to unlabeled samples. Despite its popularity, self-training is well-believed to be unreliable and often leads to training instability. Our experimental studies further reveal that the bias in semi-supervised learning arises from both the problem itself and the inappropriate training with potentially incorrect pseudo labels, which accumulates the error in the iterative self-training process. To reduce the above bias, we propose Debiased Self-Training (DST). First, the generation and utilization of pseudo labels are decoupled by two parameter-independent classifier heads to avoid direct error accumulation. Second, we estimate the worst case of self-training bias, where the pseudo labeling function is accurate on labeled samples, yet makes as many mistakes as possible on unlabeled samples. We then adversarially optimize the representations to improve the quality of pseudo labels by avoiding the worst case. Extensive experiments justify that DST achieves an average improvement of 6.3% against state-of-the-art methods on standard semi-supervised learning benchmark datasets and 18.9%$ against FixMatch on 13 diverse tasks. Furthermore, DST can be seamlessly adapted to other self-training methods and help stabilize their training and balance performance across classes in both cases of training from scratch and finetuning from pre-trained models.
    Adaptive Weighted Nonnegative Matrix Factorization for Robust Feature Representation. (arXiv:2206.03020v1 [cs.LG])
    Nonnegative matrix factorization (NMF) has been widely used to dimensionality reduction in machine learning. However, the traditional NMF does not properly handle outliers, so that it is sensitive to noise. In order to improve the robustness of NMF, this paper proposes an adaptive weighted NMF, which introduces weights to emphasize the different importance of each data point, thus the algorithmic sensitivity to noisy data is decreased. It is very different from the existing robust NMFs that use a slow growth similarity measure. Specifically, two strategies are proposed to achieve this: fuzzier weighted technique and entropy weighted regularized technique, and both of them lead to an iterative solution with a simple form. Experimental results showed that new methods have more robust feature representation on several real datasets with noise than exsiting methods.
    Joint Manifold Learning and Density Estimation Using Normalizing Flows. (arXiv:2206.03293v1 [cs.LG])
    Based on the manifold hypothesis, real-world data often lie on a low-dimensional manifold, while normalizing flows as a likelihood-based generative model are incapable of finding this manifold due to their structural constraints. So, one interesting question arises: $\textit{"Can we find sub-manifold(s) of data in normalizing flows and estimate the density of the data on the sub-manifold(s)?"}$. In this paper, we introduce two approaches, namely per-pixel penalized log-likelihood and hierarchical training, to answer the mentioned question. We propose a single-step method for joint manifold learning and density estimation by disentangling the transformed space obtained by normalizing flows to manifold and off-manifold parts. This is done by a per-pixel penalized likelihood function for learning a sub-manifold of the data. Normalizing flows assume the transformed data is Gaussianizationed, but this imposed assumption is not necessarily true, especially in high dimensions. To tackle this problem, a hierarchical training approach is employed to improve the density estimation on the sub-manifold. The results validate the superiority of the proposed methods in simultaneous manifold learning and density estimation using normalizing flows in terms of generated image quality and likelihood.
    Specification-Guided Learning of Nash Equilibria with High Social Welfare. (arXiv:2206.03348v1 [cs.GT])
    Reinforcement learning has been shown to be an effective strategy for automatically training policies for challenging control problems. Focusing on non-cooperative multi-agent systems, we propose a novel reinforcement learning framework for training joint policies that form a Nash equilibrium. In our approach, rather than providing low-level reward functions, the user provides high-level specifications that encode the objective of each agent. Then, guided by the structure of the specifications, our algorithm searches over policies to identify one that provably forms an $\epsilon$-Nash equilibrium (with high probability). Importantly, it prioritizes policies in a way that maximizes social welfare across all agents. Our empirical evaluation demonstrates that our algorithm computes equilibrium policies with high social welfare, whereas state-of-the-art baselines either fail to compute Nash equilibria or compute ones with comparatively lower social welfare.
    Data Stealing Attack on Medical Images: Is it Safe to Export Networks from Data Lakes?. (arXiv:2206.03391v1 [cs.CR])
    In privacy-preserving machine learning, it is common that the owner of the learned model does not have any physical access to the data. Instead, only a secured remote access to a data lake is granted to the model owner without any ability to retrieve data from the data lake. Yet, the model owner may want to export the trained model periodically from the remote repository and a question arises whether this may cause is a risk of data leakage. In this paper, we introduce the concept of data stealing attack during the export of neural networks. It consists in hiding some information in the exported network that allows the reconstruction outside the data lake of images initially stored in that data lake. More precisely, we show that it is possible to train a network that can perform lossy image compression and at the same time solve some utility tasks such as image segmentation. The attack then proceeds by exporting the compression decoder network together with some image codes that leads to the image reconstruction outside the data lake. We explore the feasibility of such attacks on databases of CT and MR images, showing that it is possible to obtain perceptually meaningful reconstructions of the target dataset, and that the stolen dataset can be used in turns to solve a broad range of tasks. Comprehensive experiments and analyses show that data stealing attacks should be considered as a threat for sensitive imaging data sources.
    Plug & Play Attacks: Towards Robust and Flexible Model Inversion Attacks. (arXiv:2201.12179v3 [cs.LG] UPDATED)
    Model inversion attacks (MIAs) aim to create synthetic images that reflect the class-wise characteristics from a target classifier's private training data by exploiting the model's learned knowledge. Previous research has developed generative MIAs that use generative adversarial networks (GANs) as image priors tailored to a specific target model. This makes the attacks time- and resource-consuming, inflexible, and susceptible to distributional shifts between datasets. To overcome these drawbacks, we present Plug & Play Attacks, which relax the dependency between the target model and image prior, and enable the use of a single GAN to attack a wide range of targets, requiring only minor adjustments to the attack. Moreover, we show that powerful MIAs are possible even with publicly available pre-trained GANs and under strong distributional shifts, for which previous approaches fail to produce meaningful results. Our extensive evaluation confirms the improved robustness and flexibility of Plug & Play Attacks and their ability to create high-quality images revealing sensitive class characteristics.
    Towards Meta-learned Algorithm Selection using Implicit Fidelity Information. (arXiv:2206.03130v1 [cs.LG])
    Automatically selecting the best performing algorithm for a given dataset or ranking multiple of them by their expected performance supports users in developing new machine learning applications. Most approaches for this problem rely on dataset meta-features and landmarking performances to capture the salient topology of the datasets and those topologies that the algorithms attend to. Landmarking usually exploits cheap algorithms not necessarily in the pool of candidate algorithms to get inexpensive approximations of the topology. While somewhat indicative, handcrafted dataset meta-features and landmarks are likely insufficient descriptors, strongly depending on the alignment of the geometries the landmarks and candidates search for. We propose IMFAS, a method to exploit multi-fidelity landmarking information directly from the candidate algorithms in the form of non-parametrically non-myopic meta-learned learning curves via LSTM networks in a few-shot setting during testing. Using this mechanism, IMFAS jointly learns the topology of of the datasets and the inductive biases of algorithms without expensively training them to convergence. IMFAS produces informative landmarks, easily enriched by arbitrary meta-features at a low computational cost, capable of producing the desired ranking using cheaper fidelities. We additionally show that it is able to beat Successive Halving with at most half the fidelity sequence during test time
    Efficient and Accurate Physics-aware Multiplex Graph Neural Networks for 3D Small Molecules and Macromolecule Complexes. (arXiv:2206.02789v1 [q-bio.BM])
    Recent advances in applying Graph Neural Networks (GNNs) to molecular science have showcased the power of learning three-dimensional (3D) structure representations with GNNs. However, most existing GNNs suffer from the limitations of insufficient modeling of diverse interactions, computational expensive operations, and ignorance of vectorial values. Here, we tackle these limitations by proposing a novel GNN model, Physics-aware Multiplex Graph Neural Network (PaxNet), to efficiently and accurately learn the representations of 3D molecules for both small organic compounds and macromolecule complexes. PaxNet separates the modeling of local and non-local interactions inspired by molecular mechanics, and reduces the expensive angle-related computations. Besides scalar properties, PaxNet can also predict vectorial properties by learning an associated vector for each atom. To evaluate the performance of PaxNet, we compare it with state-of-the-art baselines in two tasks. On small molecule dataset for predicting quantum chemical properties, PaxNet reduces the prediction error by 15% and uses 73% less memory than the best baseline. On macromolecule dataset for predicting protein-ligand binding affinities, PaxNet outperforms the best baseline while reducing the memory consumption by 33% and the inference time by 85%. Thus, PaxNet provides a universal, robust and accurate method for large-scale machine learning of molecules.
    On Outer Bi-Lipschitz Extensions of Linear Johnson-Lindenstrauss Embeddings of Low-Dimensional Submanifolds of $\mathbb{R}^N$. (arXiv:2206.03376v1 [math.NA])
    Let $\mathcal{M}$ be a compact $d$-dimensional submanifold of $\mathbb{R}^N$ with reach $\tau$ and volume $V_{\mathcal M}$. Fix $\epsilon \in (0,1)$. In this paper we prove that a nonlinear function $f: \mathbb{R}^N \rightarrow \mathbb{R}^{m}$ exists with $m \leq C \left(d / \epsilon^2 \right) \log \left(\frac{\sqrt[d]{V_{\mathcal M}}}{\tau} \right)$ such that $$(1 - \epsilon) \| {\bf x} - {\bf y} \|_2 \leq \left\| f({\bf x}) - f({\bf y}) \right\|_2 \leq (1 + \epsilon) \| {\bf x} - {\bf y} \|_2$$ holds for all ${\bf x} \in \mathcal{M}$ and ${\bf y} \in \mathbb{R}^N$. In effect, $f$ not only serves as a bi-Lipschitz function from $\mathcal{M}$ into $\mathbb{R}^{m}$ with bi-Lipschitz constants close to one, but also approximately preserves all distances from points not in $\mathcal{M}$ to all points in $\mathcal{M}$ in its image. Furthermore, the proof is constructive and yields an algorithm which works well in practice. In particular, it is empirically demonstrated herein that such nonlinear functions allow for more accurate compressive nearest neighbor classification than standard linear Johnson-Lindenstrauss embeddings do in practice.
    Adaptive Regularization for Adversarial Training. (arXiv:2206.03353v1 [stat.ML])
    Adversarial training, which is to enhance robustness against adversarial attacks, has received much attention because it is easy to generate human-imperceptible perturbations of data to deceive a given deep neural network. In this paper, we propose a new adversarial training algorithm that is theoretically well motivated and empirically superior to other existing algorithms. A novel feature of the proposed algorithm is to use a data-adaptive regularization for robustifying a prediction model. We apply more regularization to data which are more vulnerable to adversarial attacks and vice versa. Even though the idea of data-adaptive regularization is not new, our data-adaptive regularization has a firm theoretical base of reducing an upper bound of the robust risk. Numerical experiments illustrate that our proposed algorithm improves the generalization (accuracy on clean samples) and robustness (accuracy on adversarial attacks) simultaneously to achieve the state-of-the-art performance.
    Sample Complexity of Nonparametric Off-Policy Evaluation on Low-Dimensional Manifolds using Deep Networks. (arXiv:2206.02887v1 [cs.LG])
    We consider the off-policy evaluation problem of reinforcement learning using deep neural networks. We analyze the deep fitted Q-evaluation method for estimating the expected cumulative reward of a target policy, when the data are generated from an unknown behavior policy. We show that, by choosing network size appropriately, one can leverage the low-dimensional manifold structure in the Markov decision process and obtain a sample-efficient estimator without suffering from the curse of high representation dimensionality. Specifically, we establish a sharp error bound for the fitted Q-evaluation that depends on the intrinsic low dimension, the smoothness of the state-action space, and a function class-restricted $\chi^2$-divergence. It is noteworthy that the restricted $\chi^2$-divergence measures the behavior and target policies' {\it mismatch in the function space}, which can be small even if the two policies are not close to each other in their tabular forms. Numerical experiments are provided to support our theoretical analysis.
    Schema-Guided Event Graph Completion. (arXiv:2206.02921v1 [cs.LG])
    We tackle a new task, event graph completion, which aims to predict missing event nodes for event graphs. Existing link prediction or graph completion methods have difficulty dealing with event graphs because they are usually designed for a single large graph such as a social network or a knowledge graph, rather than multiple small dynamic event graphs. Moreover, they can only predict missing edges rather than missing nodes. In this work, we propose to utilize event schema, a template that describes the stereotypical structure of event graphs, to address the above issues. Our schema-guided event graph completion approach first maps an instance event graph to a subgraph of the schema graph by a heuristic subgraph matching algorithm. Then it predicts whether a candidate event node in the schema graph should be added to the instantiated schema subgraph by characterizing two types of local topology of the schema graph: neighbors of the candidate node and the subgraph, and paths that connect the candidate node and the subgraph. These two modules are later combined together for the final prediction. We also propose a self-supervised strategy to construct training samples, as well as an inference algorithm that is specifically designed to complete event graphs. Extensive experimental results on four datasets demonstrate that our proposed method achieves state-of-the-art performance, with 4.3% to 19.4% absolute F1 gains over the best baseline method on the four datasets.
    The Survival Bandit Problem. (arXiv:2206.03019v1 [cs.LG])
    We study the survival bandit problem, a variant of the multi-armed bandit problem introduced in an open problem by Perotto et al. (2019), with a constraint on the cumulative reward; at each time step, the agent receives a (possibly negative) reward and if the cumulative reward becomes lower than a prespecified threshold, the procedure stops, and this phenomenon is called ruin. This is the first paper studying a framework where the ruin might occur but not always. We first discuss that a sublinear regret is unachievable under a naive definition of the regret. Next, we provide tight lower bounds on the probability of ruin (as well as matching policies). Based on this lower bound, we define the survival regret as an objective to minimize and provide a policy achieving a sublinear survival regret (at least in the case of integral rewards) when the time horizon $T$ is known.
    Benign Underfitting of Stochastic Gradient Descent. (arXiv:2202.13361v3 [cs.LG] UPDATED)
    We study to what extent may stochastic gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, without-replacement) SGD is classically known to minimize the population risk at rate $O(1/\sqrt n)$, and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and generalization gap of $\Omega(1)$. Consequently, it turns out that SGD is not algorithmically stable in any sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the closely related with-replacement SGD, for which we show that an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate. Finally, we interpret our main results in the context of without-replacement SGD for finite-sum convex optimization problems, and derive upper and lower bounds for the multi-epoch regime that significantly improve upon previously known results.
    Dual Decomposition of Convex Optimization Layers for Consistent Attention in Medical Images. (arXiv:2206.02761v2 [cs.CV] UPDATED)
    A key concern in integrating machine learning models in medicine is the ability to interpret their reasoning. Popular explainability methods have demonstrated satisfactory results in natural image recognition, yet in medical image analysis, many of these approaches provide partial and noisy explanations. Recently, attention mechanisms have shown compelling results both in their predictive performance and in their interpretable qualities. A fundamental trait of attention is that it leverages salient parts of the input which contribute to the model's prediction. To this end, our work focuses on the explanatory value of attention weight distributions. We propose a multi-layer attention mechanism that enforces consistent interpretations between attended convolutional layers using convex optimization. We apply duality to decompose the consistency constraints between the layers by reparameterizing their attention probability distributions. We further suggest learning the dual witness by optimizing with respect to our objective; thus, our implementation uses standard back-propagation, hence it is highly efficient. While preserving predictive performance, our proposed method leverages weakly annotated medical imaging data and provides complete and faithful explanations to the model's prediction.
    Spectral Bias Outside the Training Set for Deep Networks in the Kernel Regime. (arXiv:2206.02927v1 [stat.ML])
    We provide quantitative bounds measuring the $L^2$ difference in function space between the trajectory of a finite-width network trained on finitely many samples from the idealized kernel dynamics of infinite width and infinite data. An implication of the bounds is that the network is biased to learn the top eigenfunctions of the Neural Tangent Kernel not just on the training set but over the entire input space. This bias depends on the model architecture and input distribution alone and thus does not depend on the target function which does not need to be in the RKHS of the kernel. The result is valid for deep architectures with fully connected, convolutional, and residual layers. Furthermore the width does not need to grow polynomially with the number of samples in order to obtain high probability bounds up to a stopping time. The proof exploits the low-effective-rank property of the Fisher Information Matrix at initialization, which implies a low effective dimension of the model (far smaller than the number of parameters). We conclude that local capacity control from the low effective rank of the Fisher Information Matrix is still underexplored theoretically.
    Improving Knowledge Graph Embedding via Iterative Self-Semantic Knowledge Distillation. (arXiv:2206.02963v1 [cs.LG])
    Knowledge graph embedding (KGE) has been intensively investigated for link prediction by projecting entities and relations into continuous vector spaces. Current popular high-dimensional KGE methods obtain quite slight performance gains while require enormous computation and memory costs. In contrast to high-dimensional KGE models, training low-dimensional models is more efficient and worthwhile for better deployments to practical intelligent systems. However, the model expressiveness of semantic information in knowledge graphs (KGs) is highly limited in the low dimension parameter space. In this paper, we propose iterative self-semantic knowledge distillation strategy to improve the KGE model expressiveness in the low dimension space. KGE model combined with our proposed strategy plays the teacher and student roles alternatively during the whole training process. Specifically, at a certain iteration, the model is regarded as a teacher to provide semantic information for the student. At next iteration, the model is regard as a student to incorporate the semantic information transferred from the teacher. We also design a novel semantic extraction block to extract iteration-based semantic information for the training model self-distillation. Iteratively incorporating and accumulating iteration-based semantic information enables the low-dimensional model to be more expressive for better link prediction in KGs. There is only one model during the whole training, which alleviates the increase of computational expensiveness and memory requirements. Furthermore, the proposed strategy is model-agnostic and can be seamlessly combined with other KGE models. Consistent and significant performance gains in experimental evaluations on four standard datasets demonstrate the effectiveness of the proposed self-distillation strategy.
    Adaptive Rollout Length for Model-Based RL Using Model-Free Deep RL. (arXiv:2206.02380v2 [cs.LG] UPDATED)
    Model-based reinforcement learning promises to learn an optimal policy from fewer interactions with the environment compared to model-free reinforcement learning by learning an intermediate model of the environment in order to predict future interactions. When predicting a sequence of interactions, the rollout length, which limits the prediction horizon, is a critical hyperparameter as accuracy of the predictions diminishes in the regions that are further away from real experience. As a result, with a longer rollout length, an overall worse policy is learned in the long run. Thus, the hyperparameter provides a trade-off between quality and efficiency. In this work, we frame the problem of tuning the rollout length as a meta-level sequential decision-making problem that optimizes the final policy learned by model-based reinforcement learning given a fixed budget of environment interactions by adapting the hyperparameter dynamically based on feedback from the learning process, such as accuracy of the model and the remaining budget of interactions. We use model-free deep reinforcement learning to solve the meta-level decision problem and demonstrate that our approach outperforms common heuristic baselines on two well-known reinforcement learning environments.
    Bump Hunting in Latent Space. (arXiv:2103.06595v2 [hep-ph] UPDATED)
    Unsupervised anomaly detection could be crucial in future analyses searching for rare phenomena in large datasets, as for example collected at the LHC. To this end, we introduce a physics inspired variational autoencoder (VAE) architecture which performs competitively and robustly on the LHC Olympics Machine Learning Challenge datasets. We demonstrate how embedding some physical observables directly into the VAE latent space, while at the same time keeping the classifier manifestly agnostic to them, can help to identify and characterise features in measured spectra as caused by the presence of anomalies in a dataset.
    Beyond Lipschitz: Sharp Generalization and Excess Risk Bounds for Full-Batch GD. (arXiv:2204.12446v3 [stat.ML] UPDATED)
    We provide sharp path-dependent generalization and excess risk guarantees for the full-batch Gradient Descent (GD) algorithm on smooth losses (possibly non-Lipschitz, possibly nonconvex), under an interpolation regime. At the heart of our analysis is a new generalization error bound for deterministic symmetric algorithms, which implies that average output stability and a bounded expected optimization error at termination lead to generalization. This result shows that small generalization error occurs along the optimization path, and allows us to bypass Lipschitz or sub-Gaussian assumptions on the loss prevalent in previous works. For nonconvex, Polyak-Lojasiewicz (PL), convex and strongly convex losses, we show the explicit dependence of the generalization error in terms of the accumulated path-dependent optimization error, terminal optimization error, number of samples, and number of iterations. For nonconvex smooth losses, we prove that full-batch GD efficiently generalizes close to any stationary point at termination, under the proper choice of a decreasing step size. Further, if the loss is nonconvex but the objective is PL, we derive quadratically vanishing bounds on the generalization error and the corresponding excess risk, for a choice of a large constant step size. For (resp. strongly-) convex smooth losses, we prove that full-batch GD also generalizes for large constant step sizes, and achieves (resp. quadratically) small excess risk while training fast. In all cases, we close the generalization error gap, by showing matching generalization and optimization error rates. Our full-batch GD generalization error and excess risk bounds are strictly tighter than existing bounds for (stochastic) GD, when the loss is smooth (but possibly non-Lipschitz).
    On the Use and Misuse of Absorbing States in Multi-agent Reinforcement Learning. (arXiv:2111.05992v2 [cs.LG] UPDATED)
    The creation and destruction of agents in cooperative multi-agent reinforcement learning (MARL) is a critically under-explored area of research. Current MARL algorithms often assume that the number of agents within a group remains fixed throughout an experiment. However, in many practical problems, an agent may terminate before their teammates. This early termination issue presents a challenge: the terminated agent must learn from the group's success or failure which occurs beyond its own existence. We refer to propagating value from rewards earned by remaining teammates to terminated agents as the Posthumous Credit Assignment problem. Current MARL methods handle this problem by placing these agents in an absorbing state until the entire group of agents reaches a termination condition. Although absorbing states enable existing algorithms and APIs to handle terminated agents without modification, practical training efficiency and resource use problems exist. In this work, we first demonstrate that sample complexity increases with the quantity of absorbing states in a toy supervised learning task for a fully connected network, while attention is more robust to variable size input. Then, we present a novel architecture for an existing state-of-the-art MARL algorithm which uses attention instead of a fully connected layer with absorbing states. Finally, we demonstrate that this novel architecture significantly outperforms the standard architecture on tasks in which agents are created or destroyed within episodes as well as standard multi-agent coordination tasks.
    Adversarial Bandits Robust to $S$-Switch Regret. (arXiv:2205.14839v2 [cs.LG] UPDATED)
    We study the adversarial bandit problem under $S$ number of switching best arms for unknown $S$. For handling this problem, we adopt the master-base framework using the online mirror descent method (OMD). We first provide a master-base algorithm with basic OMD, achieving $\tilde{O}(S^{1/2}K^{1/3}T^{2/3})$. For improving the regret bound with respect to $T$, we propose to use adaptive learning rates for OMD to control variance of loss estimators, and achieve $\tilde{O}(\min\{\mathbb{E}[\sqrt{SKT\rho_T(h^\dagger)}],S\sqrt{KT}\})$, where $\rho_T(h^\dagger)$ is a variance term for loss estimators.
    The Pareto Frontier of Instance-Dependent Guarantees in Multi-Player Multi-Armed Bandits with no Communication. (arXiv:2202.09653v2 [cs.LG] UPDATED)
    We study the stochastic multi-player multi-armed bandit problem. In this problem, $m$ players cooperate to maximize their total reward from $K > m$ arms. However the players cannot communicate and are penalized (e.g. receive no reward) if they pull the same arm at the same time. We ask whether it is possible to obtain optimal instance-dependent regret $\tilde{O}(1/\Delta)$ where $\Delta$ is the gap between the $m$-th and $m+1$-st best arms. Such guarantees were recently achieved in a model allowing the players to implicitly communicate through intentional collisions. Surprisingly, we show that with no communication at all, such guarantees are not achievable. In fact, obtaining the optimal $\tilde{O}(1/\Delta)$ regret for some values of $\Delta$ necessarily implies strictly sub-optimal regret in other regimes. Our main result is a complete characterization of the Pareto optimal instance-dependent trade-offs that are possible with no communication. Our algorithm generalizes that of Bubeck, Budzinski, and the second author. As there, our algorithm succeeds even when feedback upon collision can be corrupted by an adaptive adversary, thanks to a strong no-collision property. Our lower bound is based on topological obstructions at multiple scales and is completely new.
    Reachability In Simple Neural Networks. (arXiv:2203.07941v2 [cs.CC] UPDATED)
    We investigate the complexity of the reachability problem for (deep) neural networks: does it compute valid output given some valid input? It was recently claimed that the problem is NP-complete for general neural networks and specifications over the input/output dimension given by conjunctions of linear inequalities. We recapitulate the proof and repair some flaws in the original upper and lower bound proofs. Motivated by the general result, we show that NP-hardness already holds for restricted classes of simple specifications and neural networks. Allowing for a single hidden layer and an output dimension of one as well as neural networks with just one negative, zero and one positive weight or bias is sufficient to ensure NP-hardness. Additionally, we give a thorough discussion and outlook of possible extensions for this direction of research on neural network verification.
    Few-Shot Learning on Graphs. (arXiv:2203.09308v2 [cs.LG] UPDATED)
    Graph representation learning has attracted tremendous attention due to its remarkable performance in many real-world applications. However, prevailing supervised graph representation learning models for specific tasks often suffer from label sparsity issue as data labeling is always time and resource consuming. In light of this, few-shot learning on graphs (FSLG), which combines the strengths of graph representation learning and few-shot learning together, has been proposed to tackle the performance degradation in face of limited annotated data challenge. There have been many studies working on FSLG recently. In this paper, we comprehensively survey these work in the form of a series of methods and applications. Specifically, we first introduce FSLG challenges and bases, then categorize and summarize existing work of FSLG in terms of three major graph mining tasks at different granularity levels, i.e., node, edge, and graph. Finally, we share our thoughts on some future research directions of FSLG. The authors of this survey have contributed significantly to the AI literature on FSLG over the last few years.
    KPGT: Knowledge-Guided Pre-training of Graph Transformer for Molecular Property Prediction. (arXiv:2206.03364v1 [q-bio.BM])
    Designing accurate deep learning models for molecular property prediction plays an increasingly essential role in drug and material discovery. Recently, due to the scarcity of labeled molecules, self-supervised learning methods for learning generalizable and transferable representations of molecular graphs have attracted lots of attention. In this paper, we argue that there exist two major issues hindering current self-supervised learning methods from obtaining desired performance on molecular property prediction, that is, the ill-defined pre-training tasks and the limited model capacity. To this end, we introduce Knowledge-guided Pre-training of Graph Transformer (KPGT), a novel self-supervised learning framework for molecular graph representation learning, to alleviate the aforementioned issues and improve the performance on the downstream molecular property prediction tasks. More specifically, we first introduce a high-capacity model, named Line Graph Transformer (LiGhT), which emphasizes the importance of chemical bonds and is mainly designed to model the structural information of molecular graphs. Then, a knowledge-guided pre-training strategy is proposed to exploit the additional knowledge of molecules to guide the model to capture the abundant structural and semantic information from large-scale unlabeled molecular graphs. Extensive computational tests demonstrated that KPGT can offer superior performance over current state-of-the-art methods on several molecular property prediction tasks.
    Gender Bias in Word Embeddings: A Comprehensive Analysis of Frequency, Syntax, and Semantics. (arXiv:2206.03390v1 [cs.CY])
    The statistical regularities in language corpora encode well-known social biases into word embeddings. Here, we focus on gender to provide a comprehensive analysis of group-based biases in widely-used static English word embeddings trained on internet corpora (GloVe 2014, fastText 2017). Using the Single-Category Word Embedding Association Test, we demonstrate the widespread prevalence of gender biases that also show differences in: (1) frequencies of words associated with men versus women; (b) part-of-speech tags in gender-associated words; (c) semantic categories in gender-associated words; and (d) valence, arousal, and dominance in gender-associated words. First, in terms of word frequency: we find that, of the 1,000 most frequent words in the vocabulary, 77% are more associated with men than women, providing direct evidence of a masculine default in the everyday language of the English-speaking world. Second, turning to parts-of-speech: the top male-associated words are typically verbs (e.g., fight, overpower) while the top female-associated words are typically adjectives and adverbs (e.g., giving, emotionally). Gender biases in embeddings also permeate parts-of-speech. Third, for semantic categories: bottom-up, cluster analyses of the top 1,000 words associated with each gender. The top male-associated concepts include roles and domains of big tech, engineering, religion, sports, and violence; in contrast, the top female-associated concepts are less focused on roles, including, instead, female-specific slurs and sexual content, as well as appearance and kitchen terms. Fourth, using human ratings of word valence, arousal, and dominance from a ~20,000 word lexicon, we find that male-associated words are higher on arousal and dominance, while female-associated words are higher on valence.
    Task-aware Privacy Preservation for Multi-dimensional Data. (arXiv:2110.02329v2 [cs.CR] UPDATED)
    Local differential privacy (LDP) can be adopted to anonymize richer user data attributes that will be input to sophisticated machine learning (ML) tasks. However, today's LDP approaches are largely task-agnostic and often lead to severe performance loss -- they simply inject noise to all data attributes according to a given privacy budget, regardless of what features are most relevant for the ultimate task. In this paper, we address how to significantly improve the ultimate task performance with multi-dimensional user data by considering a task-aware privacy preservation problem. The key idea is to use an encoder-decoder framework to learn (and anonymize) a task-relevant latent representation of user data. We obtain an analytical near-optimal solution for the linear setting with mean-squared error (MSE) task loss. We also provide an approximate solution through a gradient-based learning algorithm for general nonlinear cases. Extensive experiments demonstrate that our task-aware approach significantly improves ultimate task accuracy compared to standard benchmark LDP approaches with the same level of privacy guarantee.
    Reweighing auxiliary losses in supervised learning. (arXiv:2202.03250v2 [cs.LG] UPDATED)
    Apart from the standard supervised learning using hard labels, often auxiliary losses are used in many supervised learning settings to improve the model's generalisation. For example, knowledge distillation adds a second, teacher mimicking loss to the training of a model, where the teacher may be a pretrained model that outputs a richer distribution over labels. Similarly, in settings with limited labelled data, weak labelling information is used in form of labelling functions. Auxiliary losses are introduced here to combat labelling functions that may be noisy rule-based approximations of true labels. We tackle the problem of learning to combine these losses in a principled manner. We introduce AMAL which learns instance-specific weights using meta learning on a validation metric to achieve optimal mixing of losses. Experiments in a number of knowledge distillation and rule denoising domains show that AMAL provides noticeable gains over competitive baselines in those domains. We empirically analyze our method and share insights into the mechanisms through which it provides performance gains.
    Physics-Inspired Temporal Learning of Quadrotor Dynamics for Accurate Model Predictive Trajectory Tracking. (arXiv:2206.03305v1 [cs.RO])
    Accurately modeling quadrotor's system dynamics is critical for guaranteeing agile, safe, and stable navigation. The model needs to capture the system behavior in multiple flight regimes and operating conditions, including those producing highly nonlinear effects such as aerodynamic forces and torques, rotor interactions, or possible system configuration modifications. Classical approaches rely on handcrafted models and struggle to generalize and scale to capture these effects. In this paper, we present a novel Physics-Inspired Temporal Convolutional Network (PI-TCN) approach to learning quadrotor's system dynamics purely from robot experience. Our approach combines the expressive power of sparse temporal convolutions and dense feed-forward connections to make accurate system predictions. In addition, physics constraints are embedded in the training process to facilitate the network's generalization capabilities to data outside the training distribution. Finally, we design a model predictive control approach that incorporates the learned dynamics for accurate closed-loop trajectory tracking fully exploiting the learned model predictions in a receding horizon fashion. Experimental results demonstrate that our approach accurately extracts the structure of the quadrotor's dynamics from data, capturing effects that would remain hidden to classical approaches. To the best of our knowledge, this is the first time physics-inspired deep learning is successfully applied to temporal convolutional networks and to the system identification task, while concurrently enabling predictive control.
    Few-Shot Learning by Dimensionality Reduction in Gradient Space. (arXiv:2206.03483v1 [cs.LG])
    We introduce SubGD, a novel few-shot learning method which is based on the recent finding that stochastic gradient descent updates tend to live in a low-dimensional parameter subspace. In experimental and theoretical analyses, we show that models confined to a suitable predefined subspace generalize well for few-shot learning. A suitable subspace fulfills three criteria across the given tasks: it (a) allows to reduce the training error by gradient flow, (b) leads to models that generalize well, and (c) can be identified by stochastic gradient descent. SubGD identifies these subspaces from an eigendecomposition of the auto-correlation matrix of update directions across different tasks. Demonstrably, we can identify low-dimensional suitable subspaces for few-shot learning of dynamical systems, which have varying properties described by one or few parameters of the analytical system description. Such systems are ubiquitous among real-world applications in science and engineering. We experimentally corroborate the advantages of SubGD on three distinct dynamical systems problem settings, significantly outperforming popular few-shot learning methods both in terms of sample efficiency and performance.
    Truncated Diffusion Probabilistic Models. (arXiv:2202.09671v2 [stat.ML] UPDATED)
    Employing a forward Markov diffusion chain to gradually map the data to a noise distribution, diffusion probabilistic models learn how to generate the data by inferring a reverse Markov diffusion chain to invert the forward diffusion process. To achieve competitive data generation performance, they demand a long diffusion chain that makes them computationally intensive in not only training but also generation. To significantly improve the computation efficiency, we propose to truncate the forward diffusion chain by abolishing the requirement of diffusing the data to random noise. Consequently, we start the inverse diffusion chain from an implicit generative distribution, rather than random noise, and learn its parameters by matching it to the distribution of the data corrupted by the truncated forward diffusion chain. Experimental results show our truncated diffusion probabilistic models provide consistent improvements over the non-truncated ones in terms of the generation performance and the number of required inverse diffusion steps.
    First is Better Than Last for Training Data Influence. (arXiv:2202.11844v2 [cs.LG] UPDATED)
    The ability to identify influential training examples enables us to debug training data and explain model behavior. Existing techniques to do so are based on the flow of training data influence through the model parameters. For large models in NLP applications, it is often computationally infeasible to study this flow through all model parameters, therefore techniques usually pick the last layer of weights. However, we observe that since the activation connected to the last layer of weights contains ``shared logic'', the data influenced calculated via the last layer weights prone to a ``cancellation effect'', where the data influence of different examples have large magnitude that contradicts each other. The cancellation effect lowers the discriminative power of the influence score, and deleting influential examples according to this measure often does not change the model's behavior by much. To mitigate this, we propose a technique called TracIn-WE that modifies a method called TracIn to operate on the word embedding layer instead of the last layer, where the cancellation effect is less severe. One potential concern is that influence based on the word embedding layer may not encode sufficient high level information. However, we find that gradients (unlike embeddings) do not suffer from this, possibly because they chain through higher layers. We show that TracIn-WE significantly outperforms other data influence methods applied on the last layer by 4-10 on the case deletion evaluation on three language classification tasks. In addition, TracIn-WE can produce scores not just at the level of the overall training input, but also at the level of words within the training input, a further aid in debugging.
    Towards a General Purpose CNN for Long Range Dependencies in $\mathrm{N}$D. (arXiv:2206.03398v1 [cs.LG])
    The use of Convolutional Neural Networks (CNNs) is widespread in Deep Learning due to a range of desirable model properties which result in an efficient and effective machine learning framework. However, performant CNN architectures must be tailored to specific tasks in order to incorporate considerations such as the input length, resolution, and dimentionality. In this work, we overcome the need for problem-specific CNN architectures with our Continuous Convolutional Neural Network (CCNN): a single CNN architecture equipped with continuous convolutional kernels that can be used for tasks on data of arbitrary resolution, dimensionality and length without structural changes. Continuous convolutional kernels model long range dependencies at every layer, and remove the need for downsampling layers and task-dependent depths needed in current CNN architectures. We show the generality of our approach by applying the same CCNN to a wide set of tasks on sequential (1$\mathrm{D}$) and visual data (2$\mathrm{D}$). Our CCNN performs competitively and often outperforms the current state-of-the-art across all tasks considered.
    A Robust Classification-autoencoder to Defend Outliers and Adversaries. (arXiv:2106.15927v2 [cs.LG] UPDATED)
    In this paper, a robust classification-autoencoder (CAE) is proposed, which has strong ability to recognize outliers and defend adversaries. The main idea is to change the autoencoder from an unsupervised learning model into a classifier, where the encoder is used to compress samples with different labels into disjoint compression spaces and the decoder is used to recover samples from their compression spaces. The encoder is used both as a compressed feature learner and as a classifier, and the decoder is used to decide whether the classification given by the encoder is correct by comparing the input sample with the output. Since adversary samples are seemingly inevitable for the current DNN framework, the list classifier to defend adversaries is introduced based on CAE, which outputs several labels and the corresponding samples recovered by the CAE. Extensive experimental results are used to show that the CAE achieves state of the art to recognize outliers by finding almost all outliers; the list classifier gives near lossless classification in the sense that the output list contains the correct label for almost all adversaries and the size of the output list is reasonably small.
    DNNFuser: Generative Pre-Trained Transformer as a Generalized Mapper for Layer Fusion in DNN Accelerators. (arXiv:2201.11218v2 [cs.LG] UPDATED)
    Dataflow/mapping decides the compute and energy efficiency of DNN accelerators. Many mappers have been proposed to tackle the intra-layer map-space. However, mappers for inter-layer map-space (aka layer-fusion map-space), have been rarely discussed. In this work, we propose a mapper, DNNFuser, specifically focusing on this layer-fusion map-space. While existing SOTA DNN mapping explorations rely on search-based mappers, this is the first work, to the best of our knowledge, to propose a one-shot inference-based mapper. We leverage Transformer as our DNN architecture to learn layer-fusion optimization as a sequence modeling problem. Further, the trained DNNFuser can generalize its knowledge and infer new solutions for unseen conditions. Within one inference pass, DNNFuser can infer solutions with compatible performance to the ones found by a highly optimized search-based mapper while being 66x-127x faster.
    Generative modeling via tensor train sketching. (arXiv:2202.11788v2 [math.NA] UPDATED)
    In this paper we introduce a sketching algorithm for constructing a tensor train representation of a probability density from its samples. Our method deviates from the standard recursive SVD-based procedure for constructing a tensor train. Instead we formulate and solve a sequence of small linear systems for the individual tensor train cores. This approach can avoid the curse of dimensionality that threatens both the algorithmic and sample complexities of the recovery problem. Specifically, for Markov models, we prove that the tensor cores can be recovered with a sample complexity that is constant with respect to the dimension. Finally, we illustrate the performance of the method with several numerical experiments.
    Progressive Distillation for Fast Sampling of Diffusion Models. (arXiv:2202.00512v2 [cs.LG] UPDATED)
    Diffusion models have recently shown great promise for generative modeling, outperforming GANs on perceptual quality and autoregressive models at density estimation. A remaining downside is their slow sampling time: generating high quality samples takes many hundreds or thousands of model evaluations. Here we make two contributions to help eliminate this downside: First, we present new parameterizations of diffusion models that provide increased stability when using few sampling steps. Second, we present a method to distill a trained deterministic diffusion sampler, using many steps, into a new diffusion model that takes half as many sampling steps. We then keep progressively applying this distillation procedure to our model, halving the number of required sampling steps each time. On standard image generation benchmarks like CIFAR-10, ImageNet, and LSUN, we start out with state-of-the-art samplers taking as many as 8192 steps, and are able to distill down to models taking as few as 4 steps without losing much perceptual quality; achieving, for example, a FID of 3.0 on CIFAR-10 in 4 steps. Finally, we show that the full progressive distillation procedure does not take more time than it takes to train the original model, thus representing an efficient solution for generative modeling using diffusion at both train and test time.
    Computing Graph Descriptors on Edge Streams. (arXiv:2109.01494v4 [cs.LG] UPDATED)
    Feature extraction is an essential task in graph analytics. These feature vectors, called graph descriptors, are used in downstream vector-space-based graph analysis models. This idea has proved fruitful in the past, with spectral-based graph descriptors providing state-of-the-art classification accuracy. However, known algorithms to compute meaningful descriptors do not scale to large graphs since: (1) they require storing the entire graph in memory, and (2) the end-user has no control over the algorithm's runtime. In this paper, we present streaming algorithms to approximately compute three different graph descriptors capturing the essential structure of graphs. Operating on edge streams allows us to avoid storing the entire graph in memory, and controlling the sample size enables us to keep the runtime of our algorithms within desired bounds. We demonstrate the efficacy of the proposed descriptors by analyzing the approximation error and classification accuracy. Our scalable algorithms compute descriptors of graphs with millions of edges within minutes. Moreover, these descriptors yield predictive accuracy comparable to the state-of-the-art methods but can be computed using only 25% as much memory.
    Unbiased estimators for random design regression. (arXiv:1907.03411v2 [stat.ML] UPDATED)
    In linear regression we wish to estimate the optimum linear least squares predictor for a distribution over $d$-dimensional input points and real-valued responses, based on a small sample. Under standard random design analysis, where the sample is drawn i.i.d. from the input distribution, the least squares solution for that sample can be viewed as the natural estimator of the optimum. Unfortunately, this estimator almost always incurs an undesirable bias coming from the randomness of the input points, which is a significant bottleneck in model averaging. In this paper we show that it is possible to draw a non-i.i.d. sample of input points such that, regardless of the response model, the least squares solution is an unbiased estimator of the optimum. Moreover, this sample can be produced efficiently by augmenting a previously drawn i.i.d. sample with an additional set of $d$ points, drawn jointly according to a certain determinantal point process constructed from the input distribution rescaled by the squared volume spanned by the points. Motivated by this, we develop a theoretical framework for studying volume-rescaled sampling, and in the process prove a number of new matrix expectation identities. We use them to show that for any input distribution and $\epsilon>0$ there is a random design consisting of $O(d\log d+ d/\epsilon)$ points from which an unbiased estimator can be constructed whose expected square loss over the entire distribution is bounded by $1+\epsilon$ times the loss of the optimum. We provide efficient algorithms for generating such unbiased estimators in a number of practical settings and support our claims experimentally.
    Analyzing the impact of feature selection on the accuracy of heart disease prediction. (arXiv:2206.03239v1 [cs.LG])
    Heart Disease has become one of the most serious diseases that has a significant impact on human life. It has emerged as one of the leading causes of mortality among the people across the globe during the last decade. In order to prevent patients from further damage, an accurate diagnosis of heart disease on time is an essential factor. Recently we have seen the usage of non-invasive medical procedures, such as artificial intelligence-based techniques in the field of medical. Specially machine learning employs several algorithms and techniques that are widely used and are highly useful in accurately diagnosing the heart disease with less amount of time. However, the prediction of heart disease is not an easy task. The increasing size of medical datasets has made it a complicated task for practitioners to understand the complex feature relations and make disease predictions. Accordingly, the aim of this research is to identify the most important risk-factors from a highly dimensional dataset which helps in the accurate classification of heart disease with less complications. For a broader analysis, we have used two heart disease datasets with various medical features. The classification results of the benchmarked models proved that there is a high impact of relevant features on the classification accuracy. Even with a reduced number of features, the performance of the classification models improved significantly with a reduced training time as compared with models trained on full feature set.
    Generating Long Videos of Dynamic Scenes. (arXiv:2206.03429v1 [cs.CV])
    We present a video generation model that accurately reproduces object motion, changes in camera viewpoint, and new content that arises over time. Existing video generation methods often fail to produce new content as a function of time while maintaining consistencies expected in real environments, such as plausible dynamics and object persistence. A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency, such as a single latent code that dictates content for the entire video. On the other extreme, without long-term consistency, generated videos may morph unrealistically between different scenes. To address these limitations, we prioritize the time axis by redesigning the temporal latent representation and learning long-term consistency from data by training on longer videos. To this end, we leverage a two-phase training strategy, where we separately train using longer videos at a low resolution and shorter videos at a high resolution. To evaluate the capabilities of our model, we introduce two new benchmark datasets with explicit focus on long-term temporal dynamics.
    On Transportation of Mini-batches: A Hierarchical Approach. (arXiv:2102.05912v5 [stat.ML] UPDATED)
    Mini-batch optimal transport (m-OT) has been successfully used in practical applications that involve probability measures with a very high number of supports. The m-OT solves several smaller optimal transport problems and then returns the average of their costs and transportation plans. Despite its scalability advantage, the m-OT does not consider the relationship between mini-batches which leads to undesirable estimation. Moreover, the m-OT does not approximate a proper metric between probability measures since the identity property is not satisfied. To address these problems, we propose a novel mini-batch scheme for optimal transport, named Batch of Mini-batches Optimal Transport (BoMb-OT), that finds the optimal coupling between mini-batches and it can be seen as an approximation to a well-defined distance on the space of probability measures. Furthermore, we show that the m-OT is a limit of the entropic regularized version of the BoMb-OT when the regularized parameter goes to infinity. Finally, we carry out experiments on various applications including deep generative models, deep domain adaptation, approximate Bayesian computation, color transfer, and gradient flow to show that the BoMb-OT can be widely applied and performs well in various applications.
    Towards Fairness-Aware Federated Learning. (arXiv:2111.01872v2 [cs.LG] UPDATED)
    Recent advances in Federated Learning (FL) have brought large-scale collaborative machine learning opportunities for massively distributed clients with performance and data privacy guarantees. However, most current works focus on the interest of the central controller in FL,and overlook the interests of the FL clients. This may result in unfair treatment of clients which discourages them from actively participating in the learning process and damages the sustainability of the FL ecosystem. Therefore, the topic of ensuring fairness in FL is attracting a great deal of research interest. In recent years, diverse Fairness-Aware FL (FAFL) approaches have been proposed in an effort to achieve fairness in FL from different perspectives. However, there is no comprehensive survey which helps readers gain insight into this interdisciplinary field. This paper aims to provide such a survey. By examining the fundamental and simplifying assumptions, as well as the notions of fairness adopted by existing literature in this field, we propose a taxonomy of FAFL approaches covering major steps in FL, including client selection, optimization, contribution evaluation and incentive distribution. In addition, we discuss the main metrics for experimentally evaluating the performance of FAFL approaches, and suggest promising future research directions towards fairness-aware federated learning.
    Towards Understanding and Mitigating Audio Adversarial Examples for Speaker Recognition. (arXiv:2206.03393v1 [cs.SD])
    Speaker recognition systems (SRSs) have recently been shown to be vulnerable to adversarial attacks, raising significant security concerns. In this work, we systematically investigate transformation and adversarial training based defenses for securing SRSs. According to the characteristic of SRSs, we present 22 diverse transformations and thoroughly evaluate them using 7 recent promising adversarial attacks (4 white-box and 3 black-box) on speaker recognition. With careful regard for best practices in defense evaluations, we analyze the strength of transformations to withstand adaptive attacks. We also evaluate and understand their effectiveness against adaptive attacks when combined with adversarial training. Our study provides lots of useful insights and findings, many of them are new or inconsistent with the conclusions in the image and speech recognition domains, e.g., variable and constant bit rate speech compressions have different performance, and some non-differentiable transformations remain effective against current promising evasion techniques which often work well in the image domain. We demonstrate that the proposed novel feature-level transformation combined with adversarial training is rather effective compared to the sole adversarial training in a complete white-box setting, e.g., increasing the accuracy by 13.62% and attack cost by two orders of magnitude, while other transformations do not necessarily improve the overall defense capability. This work sheds further light on the research directions in this field. We also release our evaluation platform SPEAKERGUARD to foster further research.
    Machine learning fairness notions: Bridging the gap with real-world applications. (arXiv:2006.16745v5 [cs.LG] UPDATED)
    Fairness emerged as an important requirement to guarantee that Machine Learning (ML) predictive systems do not discriminate against specific individuals or entire sub-populations, in particular, minorities. Given the inherent subjectivity of viewing the concept of fairness, several notions of fairness have been introduced in the literature. This paper is a survey that illustrates the subtleties between fairness notions through a large number of examples and scenarios. In addition, unlike other surveys in the literature, it addresses the question of: which notion of fairness is most suited to a given real-world scenario and why? Our attempt to answer this question consists in (1) identifying the set of fairness-related characteristics of the real-world scenario at hand, (2) analyzing the behavior of each fairness notion, and then (3) fitting these two elements to recommend the most suitable fairness notion in every specific setup. The results are summarized in a decision diagram that can be used by practitioners and policymakers to navigate the relatively large catalog of ML.
    Survey Descent: A Multipoint Generalization of Gradient Descent for Nonsmooth Optimization. (arXiv:2111.15645v4 [math.OC] UPDATED)
    For strongly convex objectives that are smooth, the classical theory of gradient descent ensures linear convergence relative to the number of gradient evaluations. An analogous nonsmooth theory is challenging. Even when the objective is smooth at every iterate, the corresponding local models are unstable and the number of cutting planes invoked by traditional remedies is difficult to bound, leading to convergences guarantees that are sublinear relative to the cumulative number of gradient evaluations. We instead propose a multipoint generalization of the gradient descent iteration for local optimization. While designed with general objectives in mind, we are motivated by a ``max-of-smooth'' model that captures the subdifferential dimension at optimality. We prove linear convergence when the objective is itself max-of-smooth, and experiments suggest a more general phenomenon.
    On the Role of Discount Factor in Offline Reinforcement Learning. (arXiv:2206.03383v1 [cs.LG])
    Offline reinforcement learning (RL) enables effective learning from previously collected data without exploration, which shows great promise in real-world applications when exploration is expensive or even infeasible. The discount factor, $\gamma$, plays a vital role in improving online RL sample efficiency and estimation accuracy, but the role of the discount factor in offline RL is not well explored. This paper examines two distinct effects of $\gamma$ in offline RL with theoretical analysis, namely the regularization effect and the pessimism effect. On the one hand, $\gamma$ is a regulator to trade-off optimality with sample efficiency upon existing offline techniques. On the other hand, lower guidance $\gamma$ can also be seen as a way of pessimism where we optimize the policy's performance in the worst possible models. We empirically verify the above theoretical observation with tabular MDPs and standard D4RL tasks. The results show that the discount factor plays an essential role in the performance of offline RL algorithms, both under small data regimes upon existing offline methods and in large data regimes without other conservatisms.
    The Fragility of Optimized Bandit Algorithms. (arXiv:2109.13595v2 [cs.LG] UPDATED)
    Much of the literature on optimal design of bandit algorithms is based on minimization of expected regret. It is well known that designs that are optimal over certain exponential families can achieve expected regret that grows logarithmically in the number of arm plays, at a rate governed by the Lai-Robbins lower bound. In this paper, we show that when one uses such optimized designs, the regret distribution of the associated algorithms necessarily has a very heavy tail, specifically, that of a truncated Cauchy distribution. Furthermore, for $p>1$, the $p$'th moment of the regret distribution grows much faster than poly-logarithmically, in particular as a power of the total number of arm plays. We show that optimized UCB bandit designs are also fragile in an additional sense, namely when the problem is even slightly mis-specified, the regret can grow much faster than the conventional theory suggests. Our arguments are based on standard change-of-measure ideas, and indicate that the most likely way that regret becomes larger than expected is when the optimal arm returns below-average rewards in the first few arm plays, thereby causing the algorithm to believe that the arm is sub-optimal. To alleviate the fragility issues exposed, we show that UCB algorithms can be modified so as to ensure a desired degree of robustness to mis-specification. In doing so, we also provide a sharp trade-off between the amount of UCB exploration and the tail exponent of the resulting regret distribution.
    Concentration bounds for SSP Q-learning for average cost MDPs. (arXiv:2206.03328v1 [cs.LG])
    We derive a concentration bound for a Q-learning algorithm for average cost Markov decision processes based on an equivalent shortest path problem, and compare it numerically with the alternative scheme based on relative value iteration.
    Recent Advances for Quantum Neural Networks in Generative Learning. (arXiv:2206.03066v1 [quant-ph])
    Quantum computers are next-generation devices that hold promise to perform calculations beyond the reach of classical computers. A leading method towards achieving this goal is through quantum machine learning, especially quantum generative learning. Due to the intrinsic probabilistic nature of quantum mechanics, it is reasonable to postulate that quantum generative learning models (QGLMs) may surpass their classical counterparts. As such, QGLMs are receiving growing attention from the quantum physics and computer science communities, where various QGLMs that can be efficiently implemented on near-term quantum machines with potential computational advantages are proposed. In this paper, we review the current progress of QGLMs from the perspective of machine learning. Particularly, we interpret these QGLMs, covering quantum circuit born machines, quantum generative adversarial networks, quantum Boltzmann machines, and quantum autoencoders, as the quantum extension of classical generative learning models. In this context, we explore their intrinsic relation and their fundamental differences. We further summarize the potential applications of QGLMs in both conventional machine learning tasks and quantum physics. Last, we discuss the challenges and further research directions for QGLMs.
    Demystifying the Global Convergence Puzzle of Learning Over-parameterized ReLU Nets in Very High Dimensions. (arXiv:2206.03254v1 [cs.LG])
    This theoretical paper is devoted to developing a rigorous theory for demystifying the global convergence phenomenon in a challenging scenario: learning over-parameterized Rectified Linear Unit (ReLU) nets for very high dimensional dataset under very mild assumptions. A major ingredient of our analysis is a fine-grained analysis of random activation matrices. The essential virtue of dissecting activation matrices is that it bridges the dynamics of optimization and angular distribution in high-dimensional data space. This angle-based detailed analysis leads to asymptotic characterizations of gradient norm and directional curvature of objective function at each gradient descent iteration, revealing that the empirical loss function enjoys nice geometrical properties in the overparameterized setting. Along the way, we significantly improve existing theoretical bounds on both over-parameterization condition and learning rate with very mild assumptions for learning very high dimensional data. Moreover, we uncover the role of the geometrical and spectral properties of the input data in determining desired over-parameterization size and global convergence rate. All these clues allow us to discover a novel geometric picture of nonconvex optimization in deep learning: angular distribution in high-dimensional data space $\mapsto$ spectrums of overparameterized activation matrices $\mapsto$ favorable geometrical properties of empirical loss landscape $\mapsto$ global convergence phenomenon. Furthremore, our theoretical results imply that gradient-based nonconvex optimization algorithms have much stronger statistical guarantees with much milder over-parameterization condition than exisiting theory states for learning very high dimensional data, which is rarely explored so far.
    Searching Similarity Measure for Binarized Neural Networks. (arXiv:2206.03325v1 [cs.LG])
    Being a promising model to be deployed in resource-limited devices, Binarized Neural Networks (BNNs) have drawn extensive attention from both academic and industry. However, comparing to the full-precision deep neural networks (DNNs), BNNs suffer from non-trivial accuracy degradation, limiting its applicability in various domains. This is partially because existing network components, such as the similarity measure, are specially designed for DNNs, and might be sub-optimal for BNNs. In this work, we focus on the key component of BNNs -- the similarity measure, which quantifies the distance between input feature maps and filters, and propose an automatic searching method, based on genetic algorithm, for BNN-tailored similarity measure. Evaluation results on Cifar10 and Cifar100 using ResNet, NIN and VGG show that most of the identified similarty measure can achieve considerable accuracy improvement (up to 3.39%) over the commonly-used cross-correlation approach.
    Rites de Passage: Elucidating Displacement to Emplacement of Refugees. (arXiv:2206.03248v1 [cs.CY])
    Social media deliberations allow to explore refugee-related is-sues. AI-based studies have investigated refugee issues mostly around a specific event and considered unimodal approaches. Contrarily, we have employed a multimodal architecture for probing the refugee journeys from their home to host nations. We draw insights from Arnold van Gennep's anthropological work 'Les Rites de Passage', which systematically analyzed an individual's transition from one group or society to another. Based on Gennep's separation-transition-incorporation framework, we have identified four phases of refugee journeys: Arrival of Refugees, Temporal stay at Asylums, Rehabilitation, and Integration of Refugees into the host nation. We collected 0.23 million multimodal tweets from April 2020 to March 2021 for testing this proposed frame-work. We find that a combination of transformer-based language models and state-of-the-art image recognition models, such as fusion of BERT+LSTM and InceptionV4, can out-perform unimodal models. Subsequently, to test the practical implication of our proposed model in real-time, we have considered 0.01 million multimodal tweets related to the 2022 Ukrainian refugee crisis. An F1-score of 71.88 % for this 2022 crisis confirms the generalizability of our proposed framework.
    FedRel: An Adaptive Federated Relevance Framework for Spatial Temporal Graph Learning. (arXiv:2206.03420v1 [cs.LG])
    Spatial-temporal data contains rich information and has been widely studied in recent years due to the rapid development of relevant applications in many fields. For instance, medical institutions often use electrodes attached to different parts of a patient to analyse the electorencephal data rich with spatial and temporal features for health assessment and disease diagnosis. Existing research has mainly used deep learning techniques such as convolutional neural network (CNN) or recurrent neural network (RNN) to extract hidden spatial-temporal features. Yet, it is challenging to incorporate both inter-dependencies spatial information and dynamic temporal changes simultaneously. In reality, for a model that leverages these spatial-temporal features to fulfil complex prediction tasks, it often requires a colossal amount of training data in order to obtain satisfactory model performance. Considering the above-mentioned challenges, we propose an adaptive federated relevance framework, namely FedRel, for spatial-temporal graph learning in this paper. After transforming the raw spatial-temporal data into high quality features, the core Dynamic Inter-Intra Graph (DIIG) module in the framework is able to use these features to generate the spatial-temporal graphs capable of capturing the hidden topological and long-term temporal correlation information in these graphs. To improve the model generalization ability and performance while preserving the local data privacy, we also design a relevance-driven federated learning module in our framework to leverage diverse data distributions from different participants with attentive aggregations of their models.
    Deep Neural Patchworks: Coping with Large Segmentation Tasks. (arXiv:2206.03210v1 [cs.CV])
    Convolutional neural networks are the way to solve arbitrary image segmentation tasks. However, when images are large, memory demands often exceed the available resources, in particular on a common GPU. Especially in biomedical imaging, where 3D images are common, the problems are apparent. A typical approach to solve this limitation is to break the task into smaller subtasks by dividing images into smaller image patches. Another approach, if applicable, is to look at the 2D image sections separately, and to solve the problem in 2D. Often, the loss of global context makes such approaches less effective; important global information might not be present in the current image patch, or the selected 2D image section. Here, we propose Deep Neural Patchworks (DNP), a segmentation framework that is based on hierarchical and nested stacking of patch-based networks that solves the dilemma between global context and memory limitations.
    FDGNN: Fully Dynamic Graph Neural Network. (arXiv:2206.03469v1 [cs.LG])
    Dynamic Graph Neural Networks recently became more and more important as graphs from many scientific fields, ranging from mathematics, biology, social sciences, and physics to computer science, are dynamic by nature. While temporal changes (dynamics) play an essential role in many real-world applications, most of the models in the literature on Graph Neural Networks (GNN) process static graphs. The few GNN models on dynamic graphs only consider exceptional cases of dynamics, e.g., node attribute-dynamic graphs or structure-dynamic graphs limited to additions or changes to the graph's edges, etc. Therefore, we present a novel Fully Dynamic Graph Neural Network (FDGNN) that can handle fully-dynamic graphs in continuous time. The proposed method provides a node and an edge embedding that includes their activity to address added and deleted nodes or edges, and possible attributes. Furthermore, the embeddings specify Temporal Point Processes for each event to encode the distributions of the structure- and attribute-related incoming graph events. In addition, our model can be updated efficiently by considering single events for local retraining.
    Quantum Neural Network Classifiers: A Tutorial. (arXiv:2206.02806v1 [quant-ph])
    Machine learning has achieved dramatic success over the past decade, with applications ranging from face recognition to natural language processing. Meanwhile, rapid progress has been made in the field of quantum computation including developing both powerful quantum algorithms and advanced quantum devices. The interplay between machine learning and quantum physics holds the intriguing potential for bringing practical applications to the modern society. Here, we focus on quantum neural networks in the form of parameterized quantum circuits. We will mainly discuss different structures and encoding strategies of quantum neural networks for supervised learning tasks, and benchmark their performance utilizing Yao.jl, a quantum simulation package written in Julia Language. The codes are efficient, aiming to provide convenience for beginners in scientific works such as developing powerful variational quantum learning models and assisting the corresponding experimental demonstrations.
    A Contribution-based Device Selection Scheme in Federated Learning. (arXiv:2203.05369v2 [cs.LG] UPDATED)
    In a Federated Learning (FL) setup, a number of devices contribute to the training of a common model. We present a method for selecting the devices that provide updates in order to achieve improved generalization, fast convergence, and better device-level performance. We formulate a min-max optimization problem and decompose it into a primal-dual setup, where the duality gap is used to quantify the device-level performance. Our strategy combines \emph{exploration} of data freshness through a random device selection with \emph{exploitation} through simplified estimates of device contributions. This improves the performance of the trained model both in terms of generalization and personalization. A modified Truncated Monte-Carlo (TMC) method is applied during the exploitation phase to estimate the device's contribution and lower the communication overhead. The experimental results show that the proposed approach has a competitive performance, with lower communication overhead and competitive personalization performance against the baseline schemes.
    Early Abnormal Detection of Sewage Pipe Network: Bagging of Various Abnormal Detection Algorithms. (arXiv:2206.03321v1 [cs.LG])
    Abnormalities of the sewage pipe network will affect the normal operation of the whole city. Therefore, it is important to detect the abnormalities early. This paper propose an early abnormal-detection method. The abnormalities are detected by using the conventional algorithms, such as isolation forest algorithm, two innovations are given: (1) The current and historical data measured by the sensors placed in the sewage pipe network (such as ultrasonic Doppler flowmeter) are taken as the overall dataset, and then the general dataset is detected by using the conventional anomaly detection method to diagnose the anomaly of the data. The anomaly refers to the sample different from the others samples in the whole dataset. Because the definition of anomaly is not through the algorithm, but the whole dataset, the construction of the whole dataset is the key to propose the early abnormal-detection algorithms. (2) A bagging strategy for a variety of conventional anomaly detection algorithms is proposed to achieve the early detection of anomalies with the high precision and recall. The results show that this method can achieve the early anomaly detection with the highest precision of 98.21%, the recall rate 63.58% and F1-score of 0.774.
    Short Blocklength Wiretap Channel Codes via Deep Learning: Design and Performance Evaluation. (arXiv:2206.03477v1 [cs.IT])
    We design short blocklength codes for the Gaussian wiretap channel under information-theoretic security guarantees. Our approach consists in decoupling the reliability and secrecy constraints in our code design. Specifically, we handle the reliability constraint via an autoencoder, and handle the secrecy constraint with hash functions. For blocklengths smaller than or equal to 16, we evaluate through simulations the probability of error at the legitimate receiver and the leakage at the eavesdropper for our code construction. This leakage is defined as the mutual information between the confidential message and the eavesdropper's channel observations, and is empirically measured via a neural network-based mutual information estimator. Our simulation results provide examples of codes with positive secrecy rates that outperform the best known achievable secrecy rates obtained non-constructively for the Gaussian wiretap channel. Additionally, we show that our code design is suitable for the compound and arbitrarily varying Gaussian wiretap channels, for which the channel statistics are not perfectly known but only known to belong to a pre-specified uncertainty set. These models not only capture uncertainty related to channel statistics estimation, but also scenarios where the eavesdropper jams the legitimate transmission or influences its own channel statistics by changing its location.
    An efficient semi-supervised quality control system trained using physics-based MRI-artefact generators and adversarial training. (arXiv:2206.03359v1 [eess.IV])
    Large medical imaging data sets are becoming increasingly available. A common challenge in these data sets is to ensure that each sample meets minimum quality requirements devoid of significant artefacts. Despite a wide range of existing automatic methods having been developed to identify imperfections and artefacts in medical imaging, they mostly rely on data-hungry methods. In particular, the lack of sufficient scans with artefacts available for training has created a barrier in designing and deploying machine learning in clinical research. To tackle this problem, we propose a novel framework having four main components: (1) a set of artefact generators inspired by magnetic resonance physics to corrupt brain MRI scans and augment a training dataset, (2) a set of abstract and engineered features to represent images compactly, (3) a feature selection process that depends on the class of artefact to improve classification performance, and (4) a set of Support Vector Machine (SVM) classifiers trained to identify artefacts. Our novel contributions are threefold: first, we use the novel physics-based artefact generators to generate synthetic brain MRI scans with controlled artefacts as a data augmentation technique. This will avoid the labour-intensive collection and labelling process of scans with rare artefacts. Second, we propose a large pool of abstract and engineered image features developed to identify 9 different artefacts for structural MRI. Finally, we use an artefact-based feature selection block that, for each class of artefacts, finds the set of features that provide the best classification performance. We performed validation experiments on a large data set of scans with artificially-generated artefacts, and in a multiple sclerosis clinical trial where real artefacts were identified by experts, showing that the proposed pipeline outperforms traditional methods.
    Unsupervised Domain Adaptation across FMCW Radar Configurations Using Margin Disparity Discrepancy. (arXiv:2203.04588v2 [eess.SP] UPDATED)
    Commercial radar sensing is gaining relevance and machine learning algorithms constitute one of the key components that are enabling the spread of this radio technology into areas like surveillance or healthcare. However, radar datasets are still scarce and generalization cannot be yet achieved for all radar systems, environment conditions or design parameters. A certain degree of fine tuning is, therefore, usually required to deploy machine-learning-enabled radar applications. In this work, we consider the problem of unsupervised domain adaptation across radar configurations in the context of deep-learning human activity classification using frequency-modulated continuous-wave. For that, we focus on the theory-inspired technique of Margin Disparity Discrepancy, which has already been proved successful in the area of computer vision. Our experiments extend this technique to radar data, achieving a comparable accuracy to fewshot supervised approaches for the same classification problem.
    Improving Mini-batch Optimal Transport via Partial Transportation. (arXiv:2108.09645v4 [stat.ML] UPDATED)
    Mini-batch optimal transport (m-OT) has been widely used recently to deal with the memory issue of OT in large-scale applications. Despite their practicality, m-OT suffers from misspecified mappings, namely, mappings that are optimal on the mini-batch level but are partially wrong in the comparison with the optimal transportation plan between the original measures. Motivated by the misspecified mappings issue, we propose a novel mini-batch method by using partial optimal transport (POT) between mini-batch empirical measures, which we refer to as mini-batch partial optimal transport (m-POT). Leveraging the insight from the partial transportation, we explain the source of misspecified mappings from the m-OT and motivate why limiting the amount of transported masses among mini-batches via POT can alleviate the incorrect mappings. Finally, we carry out extensive experiments on various applications such as deep domain adaptation, partial domain adaptation, deep generative model, color transfer, and gradient flow to demonstrate the favorable performance of m-POT compared to current mini-batch methods.
    DeepOPF-AL: Augmented Learning for Solving AC-OPF Problems with Multiple Load-Solution Mappings. (arXiv:2206.03365v1 [cs.LG])
    The existence of multiple load-solution mappings of non-convex AC-OPF problems poses a fundamental challenge to deep neural network (DNN) schemes. As the training dataset may contain a mixture of data points corresponding to different load-solution mappings, the DNN can fail to learn a legitimate mapping and generate inferior solutions. We propose DeepOPF-AL as an augmented-learning approach to tackle this issue. The idea is to train a DNN to learn a unique mapping from an augmented input, i.e., (load, initial point), to the solution generated by an iterative OPF solver with the load and initial point as intake. We then apply the learned augmented mapping to solve AC-OPF problems much faster than conventional solvers. Simulation results over IEEE test cases show that DeepOPF-AL achieves noticeably better optimality and similar feasibility and speedup performance, as compared to a recent DNN scheme, with the same DNN size yet elevated training complexity.
    DETR++: Taming Your Multi-Scale Detection Transformer. (arXiv:2206.02977v1 [cs.CV])
    Convolutional Neural Networks (CNN) have dominated the field of detection ever since the success of AlexNet in ImageNet classification [12]. With the sweeping reform of Transformers [27] in natural language processing, Carion et al. [2] introduce the Transformer-based detection method, i.e., DETR. However, due to the quadratic complexity in the self-attention mechanism in the Transformer, DETR is never able to incorporate multi-scale features as performed in existing CNN-based detectors, leading to inferior results in small object detection. To mitigate this issue and further improve performance of DETR, in this work, we investigate different methods to incorporate multi-scale features and find that a Bi-directional Feature Pyramid (BiFPN) works best with DETR in further raising the detection precision. With this discovery, we propose DETR++, a new architecture that improves detection results by 1.9% AP on MS COCO 2017, 11.5% AP on RICO icon detection, and 9.1% AP on RICO layout extraction over existing baselines.
    Generalized Data Distribution Iteration. (arXiv:2206.03192v1 [cs.LG])
    To obtain higher sample efficiency and superior final performance simultaneously has been one of the major challenges for deep reinforcement learning (DRL). Previous work could handle one of these challenges but typically failed to address them concurrently. In this paper, we try to tackle these two challenges simultaneously. To achieve this, we firstly decouple these challenges into two classic RL problems: data richness and exploration-exploitation trade-off. Then, we cast these two problems into the training data distribution optimization problem, namely to obtain desired training data within limited interactions, and address them concurrently via i) explicit modeling and control of the capacity and diversity of behavior policy and ii) more fine-grained and adaptive control of selective/sampling distribution of the behavior policy using a monotonic data distribution optimization. Finally, we integrate this process into Generalized Policy Iteration (GPI) and obtain a more general framework called Generalized Data Distribution Iteration (GDI). We use the GDI framework to introduce operator-based versions of well-known RL methods from DQN to Agent57. Theoretical guarantee of the superiority of GDI compared with GPI is concluded. We also demonstrate our state-of-the-art (SOTA) performance on Arcade Learning Environment (ALE), wherein our algorithm has achieved 9620.33% mean human normalized score (HNS), 1146.39% median HNS and surpassed 22 human world records using only 200M training frames. Our performance is comparable to Agent57's while we consume 500 times less data. We argue that there is still a long way to go before obtaining real superhuman agents in ALE.
    Machine Learning Sensors. (arXiv:2206.03266v1 [cs.LG])
    Machine learning sensors represent a paradigm shift for the future of embedded machine learning applications. Current instantiations of embedded machine learning (ML) suffer from complex integration, lack of modularity, and privacy and security concerns from data movement. This article proposes a more data-centric paradigm for embedding sensor intelligence on edge devices to combat these challenges. Our vision for "sensor 2.0" entails segregating sensor input data and ML processing from the wider system at the hardware level and providing a thin interface that mimics traditional sensors in functionality. This separation leads to a modular and easy-to-use ML sensor device. We discuss challenges presented by the standard approach of building ML processing into the software stack of the controlling microprocessor on an embedded system and how the modularity of ML sensors alleviates these problems. ML sensors increase privacy and accuracy while making it easier for system builders to integrate ML into their products as a simple component. We provide examples of prospective ML sensors and an illustrative datasheet as a demonstration and hope that this will build a dialogue to progress us towards sensor 2.0.
    Integrating Random Effects in Deep Neural Networks. (arXiv:2206.03314v1 [stat.ML])
    Modern approaches to supervised learning like deep neural networks (DNNs) typically implicitly assume that observed responses are statistically independent. In contrast, correlated data are prevalent in real-life large-scale applications, with typical sources of correlation including spatial, temporal and clustering structures. These correlations are either ignored by DNNs, or ad-hoc solutions are developed for specific use cases. We propose to use the mixed models framework to handle correlated data in DNNs. By treating the effects underlying the correlation structure as random effects, mixed models are able to avoid overfitted parameter estimates and ultimately yield better predictive performance. The key to combining mixed models and DNNs is using the Gaussian negative log-likelihood (NLL) as a natural loss function that is minimized with DNN machinery including stochastic gradient descent (SGD). Since NLL does not decompose like standard DNN loss functions, the use of SGD with NLL presents some theoretical and implementation challenges, which we address. Our approach which we call LMMNN is demonstrated to improve performance over natural competitors in various correlation scenarios on diverse simulated and real datasets. Our focus is on a regression setting and tabular datasets, but we also show some results for classification. Our code is available at https://github.com/gsimchoni/lmmnn.
    Boosting Search Engines with Interactive Agents. (arXiv:2109.00527v3 [cs.CL] UPDATED)
    This paper presents first successful steps in designing search agents that learn meta-strategies for iterative query refinement in information-seeking tasks. Our approach uses machine reading to guide the selection of refinement terms from aggregated search results. Agents are then empowered with simple but effective search operators to exert fine-grained and transparent control over queries and search results. We develop a novel way of generating synthetic search sessions, which leverages the power of transformer-based language models through (self-)supervised learning. We also present a reinforcement learning agent with dynamically constrained actions that learns interactive search strategies from scratch. Our search agents obtain retrieval and answer quality performance comparable to recent neural methods, using only a traditional term-based BM25 ranking function and interpretable discrete reranking and filtering actions.
    On the Convergence of Optimizing Persistent-Homology-Based Losses. (arXiv:2206.02946v1 [cs.LG])
    Topological loss based on persistent homology has shown promise in various applications. A topological loss enforces the model to achieve certain desired topological property. Despite its empirical success, less is known about the optimization behavior of the loss. In fact, the topological loss involves combinatorial configurations that may oscillate during optimization. In this paper, we introduce a general purpose regularized topology-aware loss. We propose a novel regularization term and also modify existing topological loss. These contributions lead to a new loss function that not only enforces the model to have desired topological behavior, but also achieves satisfying convergence behavior. Our main theoretical result guarantees that the loss can be optimized efficiently, under mild assumptions.
    Deep Learning Models of the Discrete Component of the Galactic Interstellar Gamma-Ray Emission. (arXiv:2206.02819v1 [astro-ph.HE])
    A significant point-like component from the small scale (or discrete) structure in the H2 interstellar gas might be present in the Fermi-LAT data, but modeling this emission relies on observations of rare gas tracers only available in limited regions of the sky. Identifying this contribution is important to discriminate gamma-ray point sources from interstellar gas, and to better characterize extended gamma-ray sources. We design and train convolutional neural networks to predict this emission where observations of these rare tracers do not exist and discuss the impact of this component on the analysis of the Fermi-LAT data. In particular, we evaluate prospects to exploit this methodology in the characterization of the Fermi-LAT Galactic center excess through accurate modeling of point-like structures in the data to help distinguish between a point-like or smooth nature for the excess. We show that deep learning may be effectively employed to model the gamma-ray emission traced by these rare H2 proxies within statistical significance in data-rich regions, supporting prospects to employ these methods in yet unobserved regions.
    How Far I'll Go: Offline Goal-Conditioned Reinforcement Learning via $f$-Advantage Regression. (arXiv:2206.03023v1 [cs.LG])
    Offline goal-conditioned reinforcement learning (GCRL) promises general-purpose skill learning in the form of reaching diverse goals from purely offline datasets. We propose $\textbf{Go}$al-conditioned $f$-$\textbf{A}$dvantage $\textbf{R}$egression (GoFAR), a novel regression-based offline GCRL algorithm derived from a state-occupancy matching perspective; the key intuition is that the goal-reaching task can be formulated as a state-occupancy matching problem between a dynamics-abiding imitator agent and an expert agent that directly teleports to the goal. In contrast to prior approaches, GoFAR does not require any hindsight relabeling and enjoys uninterleaved optimization for its value and policy networks. These distinct features confer GoFAR with much better offline performance and stability as well as statistical performance guarantee that is unattainable for prior methods. Furthermore, we demonstrate that GoFAR's training objectives can be re-purposed to learn an agent-independent goal-conditioned planner from purely offline source-domain data, which enables zero-shot transfer to new target domains. Through extensive experiments, we validate GoFAR's effectiveness in various problem settings and tasks, significantly outperforming prior state-of-art. Notably, on a real robotic dexterous manipulation task, while no other method makes meaningful progress, GoFAR acquires complex manipulation behavior that successfully accomplishes diverse goals.
    Simple Contrastive Graph Clustering. (arXiv:2205.07865v2 [cs.LG] UPDATED)
    Contrastive learning has recently attracted plenty of attention in deep graph clustering for its promising performance. However, complicated data augmentations and time-consuming graph convolutional operation undermine the efficiency of these methods. To solve this problem, we propose a Simple Contrastive Graph Clustering (SCGC) algorithm to improve the existing methods from the perspectives of network architecture, data augmentation, and objective function. As to the architecture, our network includes two main parts, i.e., pre-processing and network backbone. A simple low-pass denoising operation conducts neighbor information aggregation as an independent pre-processing, and only two multilayer perceptrons (MLPs) are included as the backbone. For data augmentation, instead of introducing complex operations over graphs, we construct two augmented views of the same vertex by designing parameter un-shared siamese encoders and corrupting the node embeddings directly. Finally, as to the objective function, to further improve the clustering performance, a novel cross-view structural consistency objective function is designed to enhance the discriminative capability of the learned network. Extensive experimental results on seven benchmark datasets validate our proposed algorithm's effectiveness and superiority. Significantly, our algorithm outperforms the recent contrastive deep clustering competitors with at least seven times speedup on average.
    A new Hyper-heuristic based on Adaptive Simulated Annealing and Reinforcement Learning for the Capacitated Electric Vehicle Routing Problem. (arXiv:2206.03185v1 [cs.AI])
    Electric vehicles (EVs) have been adopted in urban areas to reduce environmental pollution and global warming as a result of the increasing number of freight vehicles. However, there are still deficiencies in routing the trajectories of last-mile logistics that continue to impact social and economic sustainability. For that reason, in this paper, a hyper-heuristic (HH) approach called Hyper-heuristic Adaptive Simulated Annealing with Reinforcement Learning (HHASA$_{RL}$) is proposed. It is composed of a multi-armed bandit method and the self-adaptive Simulated Annealing (SA) metaheuristic algorithm for solving the problem called Capacitated Electric Vehicle Routing Problem (CEVRP). Due to the limited number of charging stations and the travel range of EVs, the EVs must require battery recharging moments in advance and reduce travel times and costs. The HH implemented improves multiple minimum best-known solutions and obtains the best mean values for some high-dimensional instances for the proposed benchmark for the IEEE WCCI2020 competition.
    SelfReformer: Self-Refined Network with Transformer for Salient Object Detection. (arXiv:2205.11283v2 [cs.CV] UPDATED)
    The global and local contexts significantly contribute to the integrity of predictions in Salient Object Detection (SOD). Unfortunately, existing methods still struggle to generate complete predictions with fine details. There are two major problems in conventional approaches: first, for global context, high-level CNN-based encoder features cannot effectively catch long-range dependencies, resulting in incomplete predictions. Second, downsampling the ground truth to fit the size of predictions will introduce inaccuracy as the ground truth details are lost during interpolation or pooling. Thus, in this work, we developed a Transformer-based network and framed a supervised task for a branch to learn the global context information explicitly. Besides, we adopt Pixel Shuffle from Super-Resolution (SR) to reshape the predictions back to the size of ground truth instead of the reverse. Thus details in the ground truth are untouched. In addition, we developed a two-stage Context Refinement Module (CRM) to fuse global context and automatically locate and refine the local details in the predictions. The proposed network can guide and correct itself based on the global and local context generated, thus is named, Self-Refined Transformer (SelfReformer). Extensive experiments and evaluation results on five benchmark datasets demonstrate the outstanding performance of the network, and we achieved the state-of-the-art.
    8-bit Numerical Formats for Deep Neural Networks. (arXiv:2206.02915v1 [cs.LG])
    Given the current trend of increasing size and complexity of machine learning architectures, it has become of critical importance to identify new approaches to improve the computational efficiency of model training. In this context, we address the advantages of floating-point over fixed-point representation, and present an in-depth study on the use of 8-bit floating-point number formats for activations, weights, and gradients for both training and inference. We explore the effect of different bit-widths for exponents and significands and different exponent biases. The experimental results demonstrate that a suitable choice of these low-precision formats enables faster training and reduced power consumption without any degradation in accuracy for a range of deep learning models for image classification and language processing.
    Tight basis cycle representatives for persistent homology of large data sets. (arXiv:2206.02925v1 [cs.LG])
    Persistent homology (PH) is a popular tool for topological data analysis that has found applications across diverse areas of research. It provides a rigorous method to compute robust topological features in discrete experimental observations that often contain various sources of uncertainties. Although powerful in theory, PH suffers from high computation cost that precludes its application to large data sets. Additionally, most analyses using PH are limited to computing the existence of nontrivial features. Precise localization of these features is not generally attempted because, by definition, localized representations are not unique and because of even higher computation cost. For scientific applications, such a precise location is a sine qua non for determining functional significance. Here, we provide a strategy and algorithms to compute tight representative boundaries around nontrivial robust features in large data sets. To showcase the efficiency of our algorithms and the precision of computed boundaries, we analyze three data sets from different scientific fields. In the human genome, we found an unexpected effect on loops through chromosome 13 and the sex chromosomes, upon impairment of chromatin loop formation. In a distribution of galaxies in the universe, we found statistically significant voids. In protein homologs with significantly different topology, we found voids attributable to ligand-interaction, mutation, and differences between species.
    Risk Measures and Upper Probabilities: Coherence and Stratification. (arXiv:2206.03183v1 [cs.LG])
    Machine learning typically presupposes classical probability theory which implies that aggregation is built upon expectation. There are now multiple reasons to motivate looking at richer alternatives to classical probability theory as a mathematical foundation for machine learning. We systematically examine a powerful and rich class of such alternatives, known variously as spectral risk measures, Choquet integrals or Lorentz norms. We present a range of characterization results, and demonstrate what makes this spectral family so special. In doing so we demonstrate a natural stratification of all coherent risk measures in terms of the upper probabilities that they induce by exploiting results from the theory of rearrangement invariant Banach spaces. We empirically demonstrate how this new approach to uncertainty helps tackling practical machine learning problems.
    From "Where" to "What": Towards Human-Understandable Explanations through Concept Relevance Propagation. (arXiv:2206.03208v1 [cs.LG])
    The emerging field of eXplainable Artificial Intelligence (XAI) aims to bring transparency to today's powerful but opaque deep learning models. While local XAI methods explain individual predictions in form of attribution maps, thereby identifying where important features occur (but not providing information about what they represent), global explanation techniques visualize what concepts a model has generally learned to encode. Both types of methods thus only provide partial insights and leave the burden of interpreting the model's reasoning to the user. Only few contemporary techniques aim at combining the principles behind both local and global XAI for obtaining more informative explanations. Those methods, however, are often limited to specific model architectures or impose additional requirements on training regimes or data and label availability, which renders the post-hoc application to arbitrarily pre-trained models practically impossible. In this work we introduce the Concept Relevance Propagation (CRP) approach, which combines the local and global perspectives of XAI and thus allows answering both the "where" and "what" questions for individual predictions, without additional constraints imposed. We further introduce the principle of Relevance Maximization for finding representative examples of encoded concepts based on their usefulness to the model. We thereby lift the dependency on the common practice of Activation Maximization and its limitations. We demonstrate the capabilities of our methods in various settings, showcasing that Concept Relevance Propagation and Relevance Maximization lead to more human interpretable explanations and provide deep insights into the model's representations and reasoning through concept atlases, concept composition analyses, and quantitative investigations of concept subspaces and their role in fine-grained decision making.
    Distributionally Invariant Learning: Rationalization and Practical Algorithms. (arXiv:2206.02990v1 [cs.LG])
    The invariance property across environments is at the heart of invariant learning methods for the Out-of-Distribution (OOD) Generalization problem. Although intuitively reasonable, strong assumptions on the availability and quality of environments have to be made for the learnability of the strict invariance property. Recently, to relax the requirements for environments empirically, some works propose to learn pseudo-environments for invariant learning. However, it could be misleading when pursuing strict invariance under latent heterogeneity, since the underlying invariance could have been violated during the pseudo-environment learning procedure. To this end, we come up with the distributional invariance property as a relaxed alternative to the strict invariance, which considers the invariance only among sub-populations down to a prescribed scale and allows a certain degree of variation. We reformulate the invariant learning problem under latent heterogeneity into a relaxed form that pursues the distributional invariance, based on which we propose our novel Distributionally Invariant Learning (DIL) framework as well as two implementations named DIL-MMD and DIL-KL. Theoretically, we provide the guarantees for the distributional invariance as well as bounds of the generalization error gap. Extensive experimental results validate the effectiveness of our proposed algorithms.
    Survey on Causal-based Machine Learning Fairness Notions. (arXiv:2010.09553v7 [cs.LG] UPDATED)
    Addressing the problem of fairness is crucial to safely use machine learning algorithms to support decisions with a critical impact on people's lives such as job hiring, child maltreatment, disease diagnosis, loan granting, etc. Several notions of fairness have been defined and examined in the past decade, such as statistical parity and equalized odds. The most recent fairness notions, however, are causal-based and reflect the now widely accepted idea that using causality is necessary to appropriately address the problem of fairness. This paper examines an exhaustive list of causal-based fairness notions and study their applicability in real-world scenarios. As the majority of causal-based fairness notions are defined in terms of non-observable quantities (e.g., interventions and counterfactuals), their deployment in practice requires to compute or estimate those quantities using observational data. This paper offers a comprehensive report of the different approaches to infer causal quantities from observational data including identifiability (Pearl's SCM framework) and estimation (potential outcome framework). The main contributions of this survey paper are (1) a guideline to help selecting a suitable fairness notion given a specific real-world scenario, and (2) a ranking of the fairness notions according to Pearl's causation ladder indicating how difficult it is to deploy each notion in practice.
    Confounder Analysis in Measuring Representation in Product Funnels. (arXiv:2206.02962v1 [stat.ML])
    This paper discusses an application of Shapley values in the causal inference field, specifically on how to select the top confounder variables for coarsened exact matching method in a scalable way. We use a dataset from an observational experiment involving LinkedIn members as a use case to test its applicability, and show that Shapley values are highly informational and can be leveraged for its robust importance-ranking capability.
    Driving in Real Life with Inverse Reinforcement Learning. (arXiv:2206.03004v1 [cs.RO])
    In this paper, we introduce the first learning-based planner to drive a car in dense, urban traffic using Inverse Reinforcement Learning (IRL). Our planner, DriveIRL, generates a diverse set of trajectory proposals, filters these trajectories with a lightweight and interpretable safety filter, and then uses a learned model to score each remaining trajectory. The best trajectory is then tracked by the low-level controller of our self-driving vehicle. We train our trajectory scoring model on a 500+ hour real-world dataset of expert driving demonstrations in Las Vegas within the maximum entropy IRL framework. DriveIRL's benefits include: a simple design due to only learning the trajectory scoring function, relatively interpretable features, and strong real-world performance. We validated DriveIRL on the Las Vegas Strip and demonstrated fully autonomous driving in heavy traffic, including scenarios involving cut-ins, abrupt braking by the lead vehicle, and hotel pickup/dropoff zones. Our dataset will be made public to help further research in this area.
    Does Crypto Kill? Relationship between Electricity Consumption Carbon Footprints and Bitcoin Transactions. (arXiv:2206.03227v1 [cs.CY])
    Cryptocurrencies are gaining more popularity due to their security, making counterfeits impossible. However, these digital currencies have been criticized for creating a large carbon footprint due to their algorithmic complexity and decentralized system design for proof of work and mining. We hypothesize that the carbon footprint of cryptocurrency transactions has a higher dependency on carbon-rich fuel sources than green or renewable fuel sources. We provide a machine learning framework to model such transactions and correlate them with the electricity generation patterns to estimate and analyze their carbon cost.
    Machine learning models for determination of weldbead shape parameters for gas metal arc welded T-joints -- A comparative study. (arXiv:2206.02794v1 [cs.LG])
    The shape of a weld bead is critical in assessing the quality of the welded joint. In particular, this has a major impact in the accuracy of the results obtained from a numerical analysis. This study focuses on the statistical design techniques and the artificial neural networks, to predict the weld bead shape parameters of shielded Gas Metal Arc Welded (GMAW) fillet joints. Extensive testing was carried out on low carbon mild steel plates of thicknesses ranging from 3mm to 10mm. Welding voltage, welding current, and moving heat source speed were considered as the welding parameters. Three types of multiple linear regression models (MLR) were created to establish an empirical equation for defining GMAW bead shape parameters considering interactive and higher order terms. Additionally, artificial neural network (ANN) models were created based on similar scheme, and the relevance of specific features was investigated using SHapley Additive exPlanations (SHAP). The results reveal that MLR-based approach performs better than the ANN based models in terms of predictability and error assessment. This study shows the usefulness of the predictive tools to aid numerical analysis of welding.
    Intelligent Circuit Design and Implementation with Machine Learning. (arXiv:2206.03032v1 [cs.LG])
    The stagnation of EDA technologies roots from insufficient knowledge reuse. In practice, very similar simulation or optimization results may need to be repeatedly constructed from scratch. This motivates my research on introducing more 'intelligence' to EDA with machine learning (ML), which explores complex correlations in design flows based on prior data. Besides design time, I also propose ML solutions to boost IC performance by assisting the circuit management at runtime. In this dissertation, I present multiple fast yet accurate ML models covering a wide range of chip design stages from the register-transfer level (RTL) to sign-off, solving primary chip-design problems about power, timing, interconnect, IR drop, routability, and design flow tuning. Targeting the RTL stage, I present APOLLO, a fully automated power modeling framework. It constructs an accurate per-cycle power model by extracting the most power-correlated signals. The model can be further implemented on chip for runtime power management with unprecedented low hardware costs. Targeting gate-level netlist, I present Net2 for early estimations on post-placement wirelength. It further enables more accurate timing analysis without actual physical design information. Targeting circuit layout, I present RouteNet for early routability prediction. As the first deep learning-based routability estimator, some feature-extraction and model-design principles proposed in it are widely adopted by later works. I also present PowerNet for fast IR drop estimation. It captures spatial and temporal information about power distribution with a customized CNN architecture. Last, besides targeting a single design step, I present FIST to efficiently tune design flow parameters during both logic synthesis and physical design.
    Beyond spectral gap: The role of the topology in decentralized learning. (arXiv:2206.03093v1 [cs.LG])
    In data-parallel optimization of machine learning models, workers collaborate to improve their estimates of the model: more accurate gradients allow them to use larger learning rates and optimize faster. We consider the setting in which all workers sample from the same dataset, and communicate over a sparse graph (decentralized). In this setting, current theory fails to capture important aspects of real-world behavior. First, the 'spectral gap' of the communication graph is not predictive of its empirical performance in (deep) learning. Second, current theory does not explain that collaboration enables larger learning rates than training alone. In fact, it prescribes smaller learning rates, which further decrease as graphs become larger, failing to explain convergence in infinite graphs. This paper aims to paint an accurate picture of sparsely-connected distributed optimization when workers share the same data distribution. We quantify how the graph topology influences convergence in a quadratic toy problem and provide theoretical results for general smooth and (strongly) convex objectives. Our theory matches empirical observations in deep learning, and accurately describes the relative merits of different graph topologies.
    Shedding a PAC-Bayesian Light on Adaptive Sliced-Wasserstein Distances. (arXiv:2206.03230v1 [stat.ML])
    The Sliced-Wasserstein distance (SW) is a computationally efficient and theoretically grounded alternative to the Wasserstein distance. Yet, the literature on its statistical properties with respect to the distribution of slices, beyond the uniform measure, is scarce. To bring new contributions to this line of research, we leverage the PAC-Bayesian theory and the central observation that SW actually hinges on a slice-distribution-dependent Gibbs risk, the kind of quantity PAC-Bayesian bounds have been designed to characterize. We provide four types of results: i) PAC-Bayesian generalization bounds that hold on what we refer as adaptive Sliced-Wasserstein distances, i.e. distances defined with respect to any distribution of slices, ii) a procedure to learn the distribution of slices that yields a maximally discriminative SW, by optimizing our PAC-Bayesian bounds, iii) an insight on how the performance of the so-called distributional Sliced-Wasserstein distance may be explained through our theory, and iv) empirical illustrations of our findings.
    Federated Hetero-Task Learning. (arXiv:2206.03436v1 [cs.LG])
    To investigate the heterogeneity of federated learning in real-world scenarios, we generalize the classical federated learning to federated hetero-task learning, which emphasizes the inconsistency across the participants in federated learning in terms of both data distribution and learning tasks. We also present B-FHTL, a federated hetero-task learning benchmark consisted of simulation dataset, FL protocols and a unified evaluation mechanism. B-FHTL dataset contains three well-designed federated learning tasks with increasing heterogeneity. Each task simulates the clients with different data distributions and learning tasks. To ensure fair comparison among different FL algorithms, B-FHTL builds in a full suite of FL protocols by providing high-level APIs to avoid privacy leakage, and presets most common evaluation metrics spanning across different learning tasks, such as regression, classification, text generation and etc. Furthermore, we compare the FL algorithms in fields of federated multi-task learning, federated personalization and federated meta learning within B-FHTL, and highlight the influence of heterogeneity and difficulties of federated hetero-task learning. Our benchmark, including the federated dataset, protocols, the evaluation mechanism and the preliminary experiment, is open-sourced at https://github.com/alibaba/FederatedScope/tree/contest/v1.0.
    Per-Instance Privacy Accounting for Differentially Private Stochastic Gradient Descent. (arXiv:2206.02617v2 [cs.LG] UPDATED)
    Differentially private stochastic gradient descent (DP-SGD) is the workhorse algorithm for recent advances in private deep learning. It provides a single privacy guarantee to all datapoints in the dataset. We propose an efficient algorithm to compute per-instance privacy guarantees for individual examples when running DP-SGD. We use our algorithm to investigate per-instance privacy losses across a number of datasets. We find that most examples enjoy stronger privacy guarantees than the worst-case bounds. We further discover that the loss and the privacy loss on an example are well-correlated. This implies groups that are underserved in terms of model utility are simultaneously underserved in terms of privacy loss. For example, on CIFAR-10, the average $\epsilon$ of the class with the highest loss (Cat) is 32% higher than that of the class with the lowest loss (Ship). We also run membership inference attacks to show this reflects disparate empirical privacy risks.
    PyTSK: A Python Toolbox for TSK Fuzzy Systems. (arXiv:2206.03310v1 [cs.LG])
    This paper presents PyTSK, a Python toolbox for developing Takagi-Sugeno-Kang (TSK) fuzzy systems. Based on scikit-learn and PyTorch, PyTSK allows users to optimize TSK fuzzy systems using fuzzy clustering or mini-batch gradient descent (MBGD) based algorithms. Several state-of-the-art MBGD-based optimization algorithms are implemented in the toolbox, which can improve the generalization performance of TSK fuzzy systems, especially for big data applications. PyTSK can also be easily extended and customized for more complicated algorithms, such as modifying the structure of TSK fuzzy systems, developing more sophisticated training algorithms, and combining TSK fuzzy systems with neural networks. The code of PyTSK can be found at https://github.com/YuqiCui/pytsk.
    Label-Free Explainability for Unsupervised Models. (arXiv:2203.01928v2 [cs.LG] UPDATED)
    Unsupervised black-box models are challenging to interpret. Indeed, most existing explainability methods require labels to select which component(s) of the black-box's output to interpret. In the absence of labels, black-box outputs often are representation vectors whose components do not correspond to any meaningful quantity. Hence, choosing which component(s) to interpret in a label-free unsupervised/self-supervised setting is an important, yet unsolved problem. To bridge this gap in the literature, we introduce two crucial extensions of post-hoc explanation techniques: (1) label-free feature importance and (2) label-free example importance that respectively highlight influential features and training examples for a black-box to construct representations at inference time. We demonstrate that our extensions can be successfully implemented as simple wrappers around many existing feature and example importance methods. We illustrate the utility of our label-free explainability paradigm through a qualitative and quantitative comparison of representation spaces learned by various autoencoders trained on distinct unsupervised tasks.
    Spatial-Temporal Adaptive Graph Convolution with Attention Network for Traffic Forecasting. (arXiv:2206.03128v1 [cs.LG])
    Traffic forecasting is one canonical example of spatial-temporal learning task in Intelligent Traffic System. Existing approaches capture spatial dependency with a pre-determined matrix in graph convolution neural operators. However, the explicit graph structure losses some hidden representations of relationships among nodes. Furthermore, traditional graph convolution neural operators cannot aggregate long-range nodes on the graph. To overcome these limits, we propose a novel network, Spatial-Temporal Adaptive graph convolution with Attention Network (STAAN) for traffic forecasting. Firstly, we adopt an adaptive dependency matrix instead of using a pre-defined matrix during GCN processing to infer the inter-dependencies among nodes. Secondly, we integrate PW-attention based on graph attention network which is designed for global dependency, and GCN as spatial block. What's more, a stacked dilated 1D convolution, with efficiency in long-term prediction, is adopted in our temporal block for capturing the different time series. We evaluate our STAAN on two real-world datasets, and experiments validate that our model outperforms state-of-the-art baselines.
    Stratified Rule-Aware Network for Abstract Visual Reasoning. (arXiv:2002.06838v3 [cs.CV] UPDATED)
    Abstract reasoning refers to the ability to analyze information, discover rules at an intangible level, and solve problems in innovative ways. Raven's Progressive Matrices (RPM) test is typically used to examine the capability of abstract reasoning. The subject is asked to identify the correct choice from the answer set to fill the missing panel at the bottom right of RPM (e.g., a 3$\times$3 matrix), following the underlying rules inside the matrix. Recent studies, taking advantage of Convolutional Neural Networks (CNNs), have achieved encouraging progress to accomplish the RPM test. However, they partly ignore necessary inductive biases of RPM solver, such as order sensitivity within each row/column and incremental rule induction. To address this problem, in this paper we propose a Stratified Rule-Aware Network (SRAN) to generate the rule embeddings for two input sequences. Our SRAN learns multiple granularity rule embeddings at different levels, and incrementally integrates the stratified embedding flows through a gated fusion module. With the help of embeddings, a rule similarity metric is applied to guarantee that SRAN can not only be trained using a tuplet loss but also infer the best answer efficiently. We further point out the severe defects existing in the popular RAVEN dataset for RPM test, which prevent from the fair evaluation of the abstract reasoning ability. To fix the defects, we propose an answer set generation algorithm called Attribute Bisection Tree (ABT), forming an improved dataset named Impartial-RAVEN (I-RAVEN for short). Extensive experiments are conducted on both PGM and I-RAVEN datasets, showing that our SRAN outperforms the state-of-the-art models by a considerable margin.
    Improving Model Understanding and Trust with Counterfactual Explanations of Model Confidence. (arXiv:2206.02790v1 [cs.LG])
    In this paper, we show that counterfactual explanations of confidence scores help users better understand and better trust an AI model's prediction in human-subject studies. Showing confidence scores in human-agent interaction systems can help build trust between humans and AI systems. However, most existing research only used the confidence score as a form of communication, and we still lack ways to explain why the algorithm is confident. This paper also presents two methods for understanding model confidence using counterfactual explanation: (1) based on counterfactual examples; and (2) based on visualisation of the counterfactual space.
    A Bird's-Eye Tutorial of Graph Attention Architectures. (arXiv:2206.02849v1 [cs.LG])
    Graph Neural Networks (GNNs) have shown tremendous strides in performance for graph-structured problems especially in the domains of natural language processing, computer vision and recommender systems. Inspired by the success of the transformer architecture, there has been an ever-growing body of work on attention variants of GNNs attempting to advance the state of the art in many of these problems. Incorporating "attention" into graph mining has been viewed as a way to overcome the noisiness, heterogenity and complexity associated with graph-structured data as well as to encode soft-inductive bias. It is hence crucial and advantageous to study these variants from a bird's-eye view to assess their strengths and weaknesses. We provide a systematic and focused tutorial centered around attention based GNNs in a hope to benefit researchers dealing with graph-structured problems. Our tutorial looks at GNN variants from the point of view of the attention function and iteratively builds the reader's understanding of different graph attention variants.
    On the Effectiveness of Fine-tuning Versus Meta-reinforcement Learning. (arXiv:2206.03271v1 [cs.LG])
    Intelligent agents should have the ability to leverage knowledge from previously learned tasks in order to learn new ones quickly and efficiently. Meta-learning approaches have emerged as a popular solution to achieve this. However, meta-reinforcement learning (meta-RL) algorithms have thus far been restricted to simple environments with narrow task distributions. Moreover, the paradigm of pretraining followed by fine-tuning to adapt to new tasks has emerged as a simple yet effective solution in supervised and self-supervised learning. This calls into question the benefits of meta-learning approaches also in reinforcement learning, which typically come at the cost of high complexity. We hence investigate meta-RL approaches in a variety of vision-based benchmarks, including Procgen, RLBench, and Atari, where evaluations are made on completely novel tasks. Our findings show that when meta-learning approaches are evaluated on different tasks (rather than different variations of the same task), multi-task pretraining with fine-tuning on new tasks performs equally as well, or better, than meta-pretraining with meta test-time adaptation. This is encouraging for future research, as multi-task pretraining tends to be simpler and computationally cheaper than meta-RL. From these findings, we advocate for evaluating future meta-RL methods on more challenging tasks and including multi-task pretraining with fine-tuning as a simple, yet strong baseline.
    SHRED: 3D Shape Region Decomposition with Learned Local Operations. (arXiv:2206.03480v1 [cs.CV])
    We present SHRED, a method for 3D SHape REgion Decomposition. SHRED takes a 3D point cloud as input and uses learned local operations to produce a segmentation that approximates fine-grained part instances. We endow SHRED with three decomposition operations: splitting regions, fixing the boundaries between regions, and merging regions together. Modules are trained independently and locally, allowing SHRED to generate high-quality segmentations for categories not seen during training. We train and evaluate SHRED with fine-grained segmentations from PartNet; using its merge-threshold hyperparameter, we show that SHRED produces segmentations that better respect ground-truth annotations compared with baseline methods, at any desired decomposition granularity. Finally, we demonstrate that SHRED is useful for downstream applications, out-performing all baselines on zero-shot fine-grained part instance segmentation and few-shot fine-grained semantic segmentation when combined with methods that learn to label shape regions.
    Intra-agent speech permits zero-shot task acquisition. (arXiv:2206.03139v1 [cs.LG])
    Human language learners are exposed to a trickle of informative, context-sensitive language, but a flood of raw sensory data. Through both social language use and internal processes of rehearsal and practice, language learners are able to build high-level, semantic representations that explain their perceptions. Here, we take inspiration from such processes of "inner speech" in humans (Vygotsky, 1934) to better understand the role of intra-agent speech in embodied behavior. First, we formally pose intra-agent speech as a semi-supervised problem and develop two algorithms that enable visually grounded captioning with little labeled language data. We then experimentally compute scaling curves over different amounts of labeled data and compare the data efficiency against a supervised learning baseline. Finally, we incorporate intra-agent speech into an embodied, mobile manipulator agent operating in a 3D virtual world, and show that with as few as 150 additional image captions, intra-agent speech endows the agent with the ability to manipulate and answer questions about a new object without any related task-directed experience (zero-shot). Taken together, our experiments suggest that modelling intra-agent speech is effective in enabling embodied agents to learn new tasks efficiently and without direct interaction experience.
    Robust Sparse Mean Estimation via Sum of Squares. (arXiv:2206.03441v1 [cs.DS])
    We study the problem of high-dimensional sparse mean estimation in the presence of an $\epsilon$-fraction of adversarial outliers. Prior work obtained sample and computationally efficient algorithms for this task for identity-covariance subgaussian distributions. In this work, we develop the first efficient algorithms for robust sparse mean estimation without a priori knowledge of the covariance. For distributions on $\mathbb R^d$ with "certifiably bounded" $t$-th moments and sufficiently light tails, our algorithm achieves error of $O(\epsilon^{1-1/t})$ with sample complexity $m = (k\log(d))^{O(t)}/\epsilon^{2-2/t}$. For the special case of the Gaussian distribution, our algorithm achieves near-optimal error of $\tilde O(\epsilon)$ with sample complexity $m = O(k^4 \mathrm{polylog}(d))/\epsilon^2$. Our algorithms follow the Sum-of-Squares based, proofs to algorithms approach. We complement our upper bounds with Statistical Query and low-degree polynomial testing lower bounds, providing evidence that the sample-time-error tradeoffs achieved by our algorithms are qualitatively the best possible.
    Selection in the Presence of Implicit Bias: The Advantage of Intersectional Constraints. (arXiv:2202.01661v2 [cs.CY] UPDATED)
    In selection processes such as hiring, promotion, and college admissions, implicit bias toward socially-salient attributes such as race, gender, or sexual orientation of candidates is known to produce persistent inequality and reduce aggregate utility for the decision maker. Interventions such as the Rooney Rule and its generalizations, which require the decision maker to select at least a specified number of individuals from each affected group, have been proposed to mitigate the adverse effects of implicit bias in selection. Recent works have established that such lower-bound constraints can be very effective in improving aggregate utility in the case when each individual belongs to at most one affected group. However, in several settings, individuals may belong to multiple affected groups and, consequently, face more extreme implicit bias due to this intersectionality. We consider independently drawn utilities and show that, in the intersectional case, the aforementioned non-intersectional constraints can only recover part of the total utility achievable in the absence of implicit bias. On the other hand, we show that if one includes appropriate lower-bound constraints on the intersections, almost all the utility achievable in the absence of implicit bias can be recovered. Thus, intersectional constraints can offer a significant advantage over a reductionist dimension-by-dimension non-intersectional approach to reducing inequality.
    Distributive Justice as the Foundational Premise of Fair ML: Unification, Extension, and Interpretation of Group Fairness Metrics. (arXiv:2206.02897v1 [cs.CY])
    Group fairness metrics are an established way of assessing the fairness of prediction-based decision-making systems. However, these metrics are still insufficiently linked to philosophical theories, and their moral meaning is often unclear. We propose a general framework for analyzing the fairness of decision systems based on theories of distributive justice, encompassing different established ``patterns of justice'' that correspond to different normative positions. We show that the most popular group fairness metrics can be interpreted as special cases of our approach. Thus, we provide a unifying and interpretative framework for group fairness metrics that reveals the normative choices associated with each of them and that allows understanding their moral substance. At the same time, we provide an extension of the space of possible fairness metrics beyond the ones currently discussed in the fair ML literature. Our framework also allows overcoming several limitations of group fairness metrics that have been criticized in the literature, most notably (1) that they are parity-based, i.e., that they demand some form of equality between groups, which may sometimes be harmful to marginalized groups, (2) that they only compare decisions across groups, but not the resulting consequences for these groups, and (3) that the full breadth of the distributive justice literature is not sufficiently represented.
    Invertible Sharpening Network for MRI Reconstruction Enhancement. (arXiv:2206.02838v1 [eess.IV])
    High-quality MRI reconstruction plays a critical role in clinical applications. Deep learning-based methods have achieved promising results on MRI reconstruction. However, most state-of-the-art methods were designed to optimize the evaluation metrics commonly used for natural images, such as PSNR and SSIM, whereas the visual quality is not primarily pursued. Compared to the fully-sampled images, the reconstructed images are often blurry, where high-frequency features might not be sharp enough for confident clinical diagnosis. To this end, we propose an invertible sharpening network (InvSharpNet) to improve the visual quality of MRI reconstructions. During training, unlike the traditional methods that learn to map the input data to the ground truth, InvSharpNet adapts a backward training strategy that learns a blurring transform from the ground truth (fully-sampled image) to the input data (blurry reconstruction). During inference, the learned blurring transform can be inverted to a sharpening transform leveraging the network's invertibility. The experiments on various MRI datasets demonstrate that InvSharpNet can improve reconstruction sharpness with few artifacts. The results were also evaluated by radiologists, indicating better visual quality and diagnostic confidence of our proposed method.
    Interpolation-based Correlation Reduction Network for Semi-Supervised Graph Learning. (arXiv:2206.02796v1 [cs.LG])
    Graph Neural Networks (GNNs) have achieved promising performance in semi-supervised node classification in recent years. However, the problem of insufficient supervision, together with representation collapse, largely limits the performance of the GNNs in this field. To alleviate the collapse of node representations in semi-supervised scenario, we propose a novel graph contrastive learning method, termed Interpolation-based Correlation Reduction Network (ICRN). In our method, we improve the discriminative capability of the latent feature by enlarging the margin of decision boundaries and improving the cross-view consistency of the latent representation. Specifically, we first adopt an interpolation-based strategy to conduct data augmentation in the latent space and then force the prediction model to change linearly between samples. Second, we enable the learned network to tell apart samples across two interpolation-perturbed views through forcing the correlation matrix across views to approximate an identity matrix. By combining the two settings, we extract rich supervision information from both the abundant unlabeled nodes and the rare yet valuable labeled nodes for discriminative representation learning. Extensive experimental results on six datasets demonstrate the effectiveness and the generality of ICRN compared to the existing state-of-the-art methods.
    FedNST: Federated Noisy Student Training for Automatic Speech Recognition. (arXiv:2206.02797v1 [eess.AS])
    Federated Learning (FL) enables training state-of-the-art Automatic Speech Recognition (ASR) models on user devices (clients) in distributed systems, hence preventing transmission of raw user data to a central server. A key challenge facing practical adoption of FL for ASR is obtaining ground-truth labels on the clients. Existing approaches rely on clients to manually transcribe their speech, which is impractical for obtaining large training corpora. A promising alternative is using semi-/self-supervised learning approaches to leverage unlabelled user data. To this end, we propose a new Federated ASR method called FedNST for noisy student training of distributed ASR models with private unlabelled user data. We explore various facets of FedNST , such as training models with different proportions of unlabelled and labelled data, and evaluate the proposed approach on 1173 simulated clients. Evaluating FedNST on LibriSpeech, where 960 hours of speech data is split equally into server (labelled) and client (unlabelled) data, showed a 22.5% relative word error rate reduction (WERR) over a supervised baseline trained only on server data.
    Parametric Chordal Sparsity for SDP-based Neural Network Verification. (arXiv:2206.03482v1 [cs.LG])
    Many future technologies rely on neural networks, but verifying the correctness of their behavior remains a major challenge. It is known that neural networks can be fragile in the presence of even small input perturbations, yielding unpredictable outputs. The verification of neural networks is therefore vital to their adoption, and a number of approaches have been proposed in recent years. In this paper we focus on semidefinite programming (SDP) based techniques for neural network verification, which are particularly attractive because they can encode expressive behaviors while ensuring a polynomial time decision. Our starting point is the DeepSDP framework proposed by Fazlyab et al, which uses quadratic constraints to abstract the verification problem into a large-scale SDP. When the size of the neural network grows, however, solving this SDP quickly becomes intractable. Our key observation is that by leveraging chordal sparsity and specific parametrizations of DeepSDP, we can decompose the primary computational bottleneck of DeepSDP -- a large linear matrix inequality (LMI) -- into an equivalent collection of smaller LMIs. Our parametrization admits a tunable parameter, allowing us to trade-off efficiency and accuracy in the verification procedure. We call our formulation Chordal-DeepSDP, and provide experimental evaluation to show that it can: (1) effectively increase accuracy with the tunable parameter and (2) outperform DeepSDP on deeper networks.
    A Justice-Based Framework for the Analysis of Algorithmic Fairness-Utility Trade-Offs. (arXiv:2206.02891v1 [cs.CY])
    In prediction-based decision-making systems, different perspectives can be at odds: The short-term business goals of the decision makers are often in conflict with the decision subjects' wish to be treated fairly. Balancing these two perspectives is a question of values. We provide a framework to make these value-laden choices clearly visible. For this, we assume that we are given a trained model and want to find decision rules that balance the perspective of the decision maker and of the decision subjects. We provide an approach to formalize both perspectives, i.e., to assess the utility of the decision maker and the fairness towards the decision subjects. In both cases, the idea is to elicit values from decision makers and decision subjects that are then turned into something measurable. For the fairness evaluation, we build on the literature on welfare-based fairness and ask what a fair distribution of utility (or welfare) looks like. In this step, we build on well-known theories of distributive justice. This allows us to derive a fairness score that we then compare to the decision maker's utility for many different decision rules. This way, we provide an approach for balancing the utility of the decision maker and the fairness towards the decision subjects for a prediction-based decision-making system.
    Towards Job-Transition-Tag Graph for a Better Job Title Representation Learning. (arXiv:2206.02782v1 [cs.LG])
    Works on learning job title representation are mainly based on \textit{Job-Transition Graph}, built from the working history of talents. However, since these records are usually messy, this graph is very sparse, which affects the quality of the learned representation and hinders further analysis. To address this specific issue, we propose to enrich the graph with additional nodes that improve the quality of job title representation. Specifically, we construct \textit{Job-Transition-Tag Graph}, a heterogeneous graph containing two types of nodes, i.e., job titles and tags (i.e., words related to job responsibilities or functionalities). Along this line, we reformulate job title representation learning as the task of learning node embedding on the \textit{Job-Transition-Tag Graph}. Experiments on two datasets show the interest of our approach.
    FIFA: Making Fairness More Generalizable in Classifiers Trained on Imbalanced Data. (arXiv:2206.02792v1 [cs.LG])
    Algorithmic fairness plays an important role in machine learning and imposing fairness constraints during learning is a common approach. However, many datasets are imbalanced in certain label classes (e.g. "healthy") and sensitive subgroups (e.g. "older patients"). Empirically, this imbalance leads to a lack of generalizability not only of classification, but also of fairness properties, especially in over-parameterized models. For example, fairness-aware training may ensure equalized odds (EO) on the training data, but EO is far from being satisfied on new users. In this paper, we propose a theoretically-principled, yet Flexible approach that is Imbalance-Fairness-Aware (FIFA). Specifically, FIFA encourages both classification and fairness generalization and can be flexibly combined with many existing fair learning methods with logits-based losses. While our main focus is on EO, FIFA can be directly applied to achieve equalized opportunity (EqOpt); and under certain conditions, it can also be applied to other fairness notions. We demonstrate the power of FIFA by combining it with a popular fair classification algorithm, and the resulting algorithm achieves significantly better fairness generalization on several real-world datasets.
    Robust Time Series Dissimilarity Measure for Outlier Detection and Periodicity Detection. (arXiv:2206.02956v1 [cs.LG])
    Dynamic time warping (DTW) is an effective dissimilarity measure in many time series applications. Despite its popularity, it is prone to noises and outliers, which leads to singularity problem and bias in the measurement. The time complexity of DTW is quadratic to the length of time series, making it inapplicable in real-time applications. In this paper, we propose a novel time series dissimilarity measure named RobustDTW to reduce the effects of noises and outliers. Specifically, the RobustDTW estimates the trend and optimizes the time warp in an alternating manner by utilizing our designed temporal graph trend filtering. To improve efficiency, we propose a multi-level framework that estimates the trend and the warp function at a lower resolution, and then repeatedly refines them at a higher resolution. Based on the proposed RobustDTW, we further extend it to periodicity detection and outlier time series detection. Experiments on real-world datasets demonstrate the superior performance of RobustDTW compared to DTW variants in both outlier time series detection and periodicity detection.
    DynaMaR: Dynamic Prompt with Mask Token Representation. (arXiv:2206.02982v1 [cs.CL])
    Recent research has shown that large language models pretrained using unsupervised approaches can achieve significant performance improvement on many downstream tasks. Typically when adapting these language models to downstream tasks, like a classification or regression task, we employ a fine-tuning paradigm in which the sentence representation from the language model is input to a task-specific head; the model is then fine-tuned end-to-end. However, with the emergence of models like GPT-3, prompt-based fine-tuning has been proven to be a successful approach for few-shot tasks. Inspired by this work, we study discrete prompt technologies in practice. There are two issues that arise with the standard prompt approach. First, it can overfit on the prompt template. Second, it requires manual effort to formulate the downstream task as a language model problem. In this paper, we propose an improvement to prompt-based fine-tuning that addresses these two issues. We refer to our approach as DynaMaR -- Dynamic Prompt with Mask Token Representation. Results show that DynaMaR can achieve an average improvement of 10% in few-shot settings and improvement of 3.7% in data-rich settings over the standard fine-tuning approach on four e-commerce applications.
    Collaborative Intelligence Orchestration: Inconsistency-Based Fusion of Semi-Supervised Learning and Active Learning. (arXiv:2206.03288v1 [cs.LG])
    While annotating decent amounts of data to satisfy sophisticated learning models can be cost-prohibitive for many real-world applications. Active learning (AL) and semi-supervised learning (SSL) are two effective, but often isolated, means to alleviate the data-hungry problem. Some recent studies explored the potential of combining AL and SSL to better probe the unlabeled data. However, almost all these contemporary SSL-AL works use a simple combination strategy, ignoring SSL and AL's inherent relation. Further, other methods suffer from high computational costs when dealing with large-scale, high-dimensional datasets. Motivated by the industry practice of labeling data, we propose an innovative Inconsistency-based virtual aDvErsarial Active Learning (IDEAL) algorithm to further investigate SSL-AL's potential superiority and achieve mutual enhancement of AL and SSL, i.e., SSL propagates label information to unlabeled samples and provides smoothed embeddings for AL, while AL excludes samples with inconsistent predictions and considerable uncertainty for SSL. We estimate unlabeled samples' inconsistency by augmentation strategies of different granularities, including fine-grained continuous perturbation exploration and coarse-grained data transformations. Extensive experiments, in both text and image domains, validate the effectiveness of the proposed algorithm, comparing it against state-of-the-art baselines. Two real-world case studies visualize the practical industrial value of applying and deploying the proposed data sampling algorithm.
    Boundary informed inverse PDE problems on discrete Riemann surfaces. (arXiv:2206.02911v1 [math.NA])
    We employ neural networks to tackle inverse partial differential equations on discretized Riemann surfaces with boundary. To this end, we introduce the concept of a graph with boundary which models these surfaces in a natural way. Our method uses a message passing technique to keep track of an unknown differential operator while using neural ODE solvers through the method of lines to capture the evolution in time. As training data, we use noisy and incomplete observations of sheaves on graphs at various timestamps. The novelty of this approach is in working with manifolds with nontrivial topology and utilizing the data on the graph boundary through a teacher forcing technique. Despite the increasing interest in learning dynamical systems from finite observations, many current methods are limited in two general ways: first, they work with topologically trivial spaces, and second, they fail to handle the boundary data on the ground space in a systematic way. The present work is an attempt at addressing these limitations. We run experiments with synthetic data of linear and nonlinear diffusion systems on orientable surfaces with positive genus and boundary, and moreover, provide evidences for improvements upon the existing paradigms.
    Fooling Explanations in Text Classifiers. (arXiv:2206.03178v1 [cs.LG])
    State-of-the-art text classification models are becoming increasingly reliant on deep neural networks (DNNs). Due to their black-box nature, faithful and robust explanation methods need to accompany classifiers for deployment in real-life scenarios. However, it has been shown in vision applications that explanation methods are susceptible to local, imperceptible perturbations that can significantly alter the explanations without changing the predicted classes. We show here that the existence of such perturbations extends to text classifiers as well. Specifically, we introduceTextExplanationFooler (TEF), a novel explanation attack algorithm that alters text input samples imperceptibly so that the outcome of widely-used explanation methods changes considerably while leaving classifier predictions unchanged. We evaluate the performance of the attribution robustness estimation performance in TEF on five sequence classification datasets, utilizing three DNN architectures and three transformer architectures for each dataset. TEF can significantly decrease the correlation between unchanged and perturbed input attributions, which shows that all models and explanation methods are susceptible to TEF perturbations. Moreover, we evaluate how the perturbations transfer to other model architectures and attribution methods, and show that TEF perturbations are also effective in scenarios where the target model and explanation method are unknown. Finally, we introduce a semi-universal attack that is able to compute fast, computationally light perturbations with no knowledge of the attacked classifier nor explanation method. Overall, our work shows that explanations in text classifiers are very fragile and users need to carefully address their robustness before relying on them in critical applications.
    Impossibility of Collective Intelligence. (arXiv:2206.02786v1 [cs.LG])
    Democratization of AI involves training and deploying machine learning models across heterogeneous and potentially massive environments. Diversity of data opens up a number of possibilities to advance AI systems, but also introduces pressing concerns such as privacy, security, and equity that require special attention. This work shows that it is theoretically impossible to design a rational learning algorithm that has the ability to successfully learn across heterogeneous environments, which we decoratively call collective intelligence (CI). By representing learning algorithms as choice correspondences over a hypothesis space, we are able to axiomatize them with essential properties. Unfortunately, the only feasible algorithm compatible with all of the axioms is the standard empirical risk minimization (ERM) which learns arbitrarily from a single environment. Our impossibility result reveals informational incomparability between environments as one of the foremost obstacles for researchers who design novel algorithms that learn from multiple environments, which sheds light on prerequisites for success in critical areas of machine learning such as out-of-distribution generalization, federated learning, algorithmic fairness, and multi-modal learning.
    Graph Rationalization with Environment-based Augmentations. (arXiv:2206.02886v1 [cs.LG])
    Rationale is defined as a subset of input features that best explains or supports the prediction by machine learning models. Rationale identification has improved the generalizability and interpretability of neural networks on vision and language data. In graph applications such as molecule and polymer property prediction, identifying representative subgraph structures named as graph rationales plays an essential role in the performance of graph neural networks. Existing graph pooling and/or distribution intervention methods suffer from lack of examples to learn to identify optimal graph rationales. In this work, we introduce a new augmentation operation called environment replacement that automatically creates virtual data examples to improve rationale identification. We propose an efficient framework that performs rationale-environment separation and representation learning on the real and augmented examples in latent spaces to avoid the high complexity of explicit graph decoding and encoding. Comparing against recent techniques, experiments on seven molecular and four polymer real datasets demonstrate the effectiveness and efficiency of the proposed augmentation-based graph rationalization framework.
    Collaborative Linear Bandits with Adversarial Agents: Near-Optimal Regret Bounds. (arXiv:2206.02834v1 [cs.LG])
    We consider a linear stochastic bandit problem involving $M$ agents that can collaborate via a central server to minimize regret. A fraction $\alpha$ of these agents are adversarial and can act arbitrarily, leading to the following tension: while collaboration can potentially reduce regret, it can also disrupt the process of learning due to adversaries. In this work, we provide a fundamental understanding of this tension by designing new algorithms that balance the exploration-exploitation trade-off via carefully constructed robust confidence intervals. We also complement our algorithms with tight analyses. First, we develop a robust collaborative phased elimination algorithm that achieves $\tilde{O}\left(\alpha+ 1/\sqrt{M}\right) \sqrt{dT}$ regret for each good agent; here, $d$ is the model-dimension and $T$ is the horizon. For small $\alpha$, our result thus reveals a clear benefit of collaboration despite adversaries. Using an information-theoretic argument, we then prove a matching lower bound, thereby providing the first set of tight, near-optimal regret bounds for collaborative linear bandits with adversaries. Furthermore, by leveraging recent advances in high-dimensional robust statistics, we significantly extend our algorithmic ideas and results to (i) the generalized linear bandit model that allows for non-linear observation maps; and (ii) the contextual bandit setting that allows for time-varying feature vectors.
    Self-supervised Learning for Human Activity Recognition Using 700,000 Person-days of Wearable Data. (arXiv:2206.02909v1 [eess.SP])
    Advances in deep learning for human activity recognition have been relatively limited due to the lack of large labelled datasets. In this study, we leverage self-supervised learning techniques on the UK-Biobank activity tracker dataset--the largest of its kind to date--containing more than 700,000 person-days of unlabelled wearable sensor data. Our resulting activity recognition model consistently outperformed strong baselines across seven benchmark datasets, with an F1 relative improvement of 2.5%-100% (median 18.4%), the largest improvements occurring in the smaller datasets. In contrast to previous studies, our results generalise across external datasets, devices, and environments. Our open-source model will help researchers and developers to build customisable and generalisable activity classifiers with high performance.
    A Simple and Optimal Policy Design for Online Learning with Safety against Heavy-tailed Risk. (arXiv:2206.02969v1 [stat.ML])
    We design simple and optimal policies that ensure safety against heavy-tailed risk in the classical multi-armed bandit problem. We start by showing that some widely used policies such as the standard Upper Confidence Bound policy and the Thompson Sampling policy incur heavy-tailed risk; that is, the worst-case probability of incurring a linear regret slowly decays at a polynomial rate of $1/T$, where $T$ is the time horizon. We further show that this heavy-tailed risk exists for all "instance-dependent consistent" policies. To ensure safety against such heavy-tailed risk, for the two-armed bandit setting, we provide a simple policy design that (i) has the worst-case optimality for the expected regret at order $\tilde O(\sqrt{T})$ and (ii) has the worst-case tail probability of incurring a linear regret decay at an exponential rate $\exp(-\Omega(\sqrt{T}))$. We further prove that this exponential decaying rate of the tail probability is optimal across all policies that have worst-case optimality for the expected regret. Finally, we improve the policy design and analysis to the general $K$-armed bandit setting. We provide detailed characterization of the tail probability bound for any regret threshold under our policy design. Namely, the worst-case probability of incurring a regret larger than $x$ is upper bounded by $\exp(-\Omega(x/\sqrt{KT}))$. Numerical experiments are conducted to illustrate the theoretical findings. Our results reveal insights on the incompatibility between consistency and light-tailed risk, whereas indicate that worst-case optimality on expected regret and light-tailed risk are compatible.
    Universal Speech Enhancement with Score-based Diffusion. (arXiv:2206.03065v1 [cs.SD])
    Removing background noise from speech audio has been the subject of considerable research and effort, especially in recent years due to the rise of virtual communication and amateur sound recording. Yet background noise is not the only unpleasant disturbance that can prevent intelligibility: reverb, clipping, codec artifacts, problematic equalization, limited bandwidth, or inconsistent loudness are equally disturbing and ubiquitous. In this work, we propose to consider the task of speech enhancement as a holistic endeavor, and present a universal speech enhancement system that tackles 55 different distortions at the same time. Our approach consists of a generative model that employs score-based diffusion, together with a multi-resolution conditioning network that performs enhancement with mixture density networks. We show that this approach significantly outperforms the state of the art in a subjective test performed by expert listeners. We also show that it achieves competitive objective scores with just 4-8 diffusion steps, despite not considering any particular strategy for fast sampling. We hope that both our methodology and technical contributions encourage researchers and practitioners to adopt a universal approach to speech enhancement, possibly framing it as a generative task.
    Preconditioned Gradient Descent for Overparameterized Nonconvex Burer--Monteiro Factorization with Global Optimality Certification. (arXiv:2206.03345v1 [math.OC])
    We consider using gradient descent to minimize the nonconvex function $f(X)=\phi(XX^{T})$ over an $n\times r$ factor matrix $X$, in which $\phi$ is an underlying smooth convex cost function defined over $n\times n$ matrices. While only a second-order stationary point $X$ can be provably found in reasonable time, if $X$ is additionally rank deficient, then its rank deficiency certifies it as being globally optimal. This way of certifying global optimality necessarily requires the search rank $r$ of the current iterate $X$ to be overparameterized with respect to the rank $r^{\star}$ of the global minimizer $X^{\star}$. Unfortunately, overparameterization significantly slows down the convergence of gradient descent, from a linear rate with $r=r^{\star}$ to a sublinear rate when $r>r^{\star}$, even when $\phi$ is strongly convex. In this paper, we propose an inexpensive preconditioner that restores the convergence rate of gradient descent back to linear in the overparameterized case, while also making it agnostic to possible ill-conditioning in the global minimizer $X^{\star}$.
    Better Best of Both Worlds Bounds for Bandits with Switching Costs. (arXiv:2206.03098v1 [cs.LG])
    We study best-of-both-worlds algorithms for bandits with switching cost, recently addressed by Rouyer, Seldin and Cesa-Bianchi, 2021. We introduce a surprisingly simple and effective algorithm that simultaneously achieves minimax optimal regret bound of $\mathcal{O}(T^{2/3})$ in the oblivious adversarial setting and a bound of $\mathcal{O}(\min\{\log (T)/\Delta^2,T^{2/3}\})$ in the stochastically-constrained regime, both with (unit) switching costs, where $\Delta$ is the gap between the arms. In the stochastically constrained case, our bound improves over previous results due to Rouyer et al., that achieved regret of $\mathcal{O}(T^{1/3}/\Delta)$. We accompany our results with a lower bound showing that, in general, $\tilde{\Omega}(\min\{1/\Delta^2,T^{2/3}\})$ regret is unavoidable in the stochastically-constrained case for algorithms with $\mathcal{O}(T^{2/3})$ worst-case regret.
    Conditional Seq2Seq model for the time-dependent two-level system. (arXiv:2206.02889v1 [quant-ph])
    We apply the deep learning neural network architecture to the two-level system in quantum optics to solve the time-dependent Schrodinger equation. By carefully designing the network structure and tuning parameters, above 90 percent accuracy in super long-term predictions can be achieved in the case of random electric fields, which indicates a promising new method to solve the time-dependent equation for two-level systems. By slightly modifying this network, we think that this method can solve the two- or three-dimensional time-dependent Schrodinger equation more efficiently than traditional approaches.
    Group Meritocratic Fairness in Linear Contextual Bandits. (arXiv:2206.03150v1 [stat.ML])
    We study the linear contextual bandit problem where an agent has to select one candidate from a pool and each candidate belongs to a sensitive group. In this setting, candidates' rewards may not be directly comparable between groups, for example when the agent is an employer hiring candidates from different ethnic groups and some groups have a lower reward due to discriminatory bias and/or social injustice. We propose a notion of fairness that states that the agent's policy is fair when it selects a candidate with highest relative rank, which measures how good the reward is when compared to candidates from the same group. This is a very strong notion of fairness, since the relative rank is not directly observed by the agent and depends on the underlying reward model and on the distribution of rewards. Thus we study the problem of learning a policy which approximates a fair policy under the condition that the contexts are independent between groups and the distribution of rewards of each group is absolutely continuous. In particular, we design a greedy policy which at each round constructs a ridge regression estimator from the observed context-reward pairs, and then computes an estimate of the relative rank of each candidate using the empirical cumulative distribution function. We prove that the greedy policy achieves, after $T$ rounds, up to log factors and with high probability, a fair pseudo-regret of order $\sqrt{dT}$, where $d$ is the dimension of the context vectors. The policy also satisfies demographic parity at each round when averaged over all possible information available before the selection. We finally show with a proof of concept simulation that our policy achieves sub-linear fair pseudo-regret also in practice.
    Neuro-Nav: A Library for Neurally-Plausible Reinforcement Learning. (arXiv:2206.03312v1 [cs.NE])
    In this work we propose Neuro-Nav, an open-source library for neurally plausible reinforcement learning (RL). RL is among the most common modeling frameworks for studying decision making, learning, and navigation in biological organisms. In utilizing RL, cognitive scientists often handcraft environments and agents to meet the needs of their particular studies. On the other hand, artificial intelligence researchers often struggle to find benchmarks for neurally and biologically plausible representation and behavior (e.g., in decision making or navigation). In order to streamline this process across both fields with transparency and reproducibility, Neuro-Nav offers a set of standardized environments and RL algorithms drawn from canonical behavioral and neural studies in rodents and humans. We demonstrate that the toolkit replicates relevant findings from a number of studies across both cognitive science and RL literatures. We furthermore describe ways in which the library can be extended with novel algorithms (including deep RL) and environments to address future research needs of the field.
    Group privacy for personalized federated learning. (arXiv:2206.03396v1 [cs.LG])
    Federated learning is a type of collaborative machine learning, where participating clients process their data locally, sharing only updates to the collaborative model. This enables to build privacy-aware distributed machine learning models, among others. The goal is the optimization of a statistical model's parameters by minimizing a cost function of a collection of datasets which are stored locally by a set of clients. This process exposes the clients to two issues: leakage of private information and lack of personalization of the model. On the other hand, with the recent advancements in techniques to analyze data, there is a surge of concern for the privacy violation of the participating clients. To mitigate this, differential privacy and its variants serve as a standard for providing formal privacy guarantees. Often the clients represent very heterogeneous communities and hold data which are very diverse. Therefore, aligned with the recent focus of the FL community to build a framework of personalized models for the users representing their diversity, it is also of utmost importance to protect against potential threats against the sensitive and personal information of the clients. $d$-privacy, which is a generalization of geo-indistinguishability, the lately popularized paradigm of location privacy, uses a metric-based obfuscation technique that preserves the spatial distribution of the original data. To address the issue of protecting the privacy of the clients and allowing for personalized model training to enhance the fairness and utility of the system, we propose a method to provide group privacy guarantees exploiting some key properties of $d$-privacy which enables personalized models under the framework of FL. We provide with theoretical justifications to the applicability and experimental validation on real-world datasets to illustrate the working of the proposed method.
    Imitating Past Successes can be Very Suboptimal. (arXiv:2206.03378v1 [cs.LG])
    Prior work has proposed a simple strategy for reinforcement learning (RL): label experience with the outcomes achieved in that experience, and then imitate the relabeled experience. These outcome-conditioned imitation learning methods are appealing because of their simplicity, strong performance, and close ties with supervised learning. However, it remains unclear how these methods relate to the standard RL objective, reward maximization. In this paper, we prove that existing outcome-conditioned imitation learning methods do not necessarily improve the policy; rather, in some settings they can decrease the expected reward. Nonetheless, we show that a simple modification results in a method that does guarantee policy improvement, under some assumptions. Our aim is not to develop an entirely new method, but rather to explain how a variant of outcome-conditioned imitation learning can be used to maximize rewards.
    Efficient decentralized multi-agent learning in asymmetric queuing systems. (arXiv:2206.03324v1 [cs.LG])
    We study decentralized multi-agent learning in bipartite queuing systems, a standard model for service systems. In particular, $N$ agents request service from $K$ servers in a fully decentralized way, i.e, by running the same algorithm without communication. Previous decentralized algorithms are restricted to symmetric systems, have performance that is degrading exponentially in the number of servers, require communication through shared randomness and unique agent identities, and are computationally demanding. In contrast, we provide a simple learning algorithm that, when run decentrally by each agent, leads the queuing system to have efficient performance in general asymmetric bipartite queuing systems while also having additional robustness properties. Along the way, we provide the first UCB-based algorithm for the centralized case of the problem, which resolves an open question by Krishnasamy et al. (2016,2021).
    Marvolo: Programmatic Data Augmentation for Practical ML-Driven Malware Detection. (arXiv:2206.03265v1 [cs.CR])
    Data augmentation has been rare in the cyber security domain due to technical difficulties in altering data in a manner that is semantically consistent with the original data. This shortfall is particularly onerous given the unique difficulty of acquiring benign and malicious training data that runs into copyright restrictions, and that institutions like banks and governments receive targeted malware that will never exist in large quantities. We present MARVOLO, a binary mutator that programmatically grows malware (and benign) datasets in a manner that boosts the accuracy of ML-driven malware detectors. MARVOLO employs semantics-preserving code transformations that mimic the alterations that malware authors and defensive benign developers routinely make in practice , allowing us to generate meaningful augmented data. Crucially, semantics-preserving transformations also enable MARVOLO to safely propagate labels from original to newly-generated data samples without mandating expensive reverse engineering of binaries. Further, MARVOLO embeds several key optimizations that keep costs low for practitioners by maximizing the density of diverse data samples generated within a given time (or resource) budget. Experiments using wide-ranging commercial malware datasets and a recent ML-driven malware detector show that MARVOLO boosts accuracies by up to 5%, while operating on only a small fraction (15%) of the potential input binaries.
    Generalization Error Bounds for Deep Neural Networks Trained by SGD. (arXiv:2206.03299v1 [cs.LG])
    Generalization error bounds for deep neural networks trained by stochastic gradient descent (SGD) are derived by combining a dynamical control of an appropriate parameter norm and the Rademacher complexity estimate based on parameter norms. The bounds explicitly depend on the loss along the training trajectory, and work for a wide range of network architectures including multilayer perceptron (MLP) and convolutional neural networks (CNN). Compared with other algorithm-depending generalization estimates such as uniform stability-based bounds, our bounds do not require $L$-smoothness of the nonconvex loss function, and apply directly to SGD instead of Stochastic Langevin gradient descent (SGLD). Numerical results show that our bounds are non-vacuous and robust with the change of optimizer and network hyperparameters.
    AS2T: Arbitrary Source-To-Target Adversarial Attack on Speaker Recognition Systems. (arXiv:2206.03351v1 [cs.SD])
    Recent work has illuminated the vulnerability of speaker recognition systems (SRSs) against adversarial attacks, raising significant security concerns in deploying SRSs. However, they considered only a few settings (e.g., some combinations of source and target speakers), leaving many interesting and important settings in real-world attack scenarios alone. In this work, we present AS2T, the first attack in this domain which covers all the settings, thus allows the adversary to craft adversarial voices using arbitrary source and target speakers for any of three main recognition tasks. Since none of the existing loss functions can be applied to all the settings, we explore many candidate loss functions for each setting including the existing and newly designed ones. We thoroughly evaluate their efficacy and find that some existing loss functions are suboptimal. Then, to improve the robustness of AS2T towards practical over-the-air attack, we study the possible distortions occurred in over-the-air transmission, utilize different transformation functions with different parameters to model those distortions, and incorporate them into the generation of adversarial voices. Our simulated over-the-air evaluation validates the effectiveness of our solution in producing robust adversarial voices which remain effective under various hardware devices and various acoustic environments with different reverberation, ambient noises, and noise levels. Finally, we leverage AS2T to perform thus far the largest-scale evaluation to understand transferability among 14 diverse SRSs. The transferability analysis provides many interesting and useful insights which challenge several findings and conclusion drawn in previous works in the image domain. Our study also sheds light on future directions of adversarial attacks in the speaker recognition domain.
    Decentralized Low-Latency Collaborative Inference via Ensembles on the Edge. (arXiv:2206.03165v1 [cs.LG])
    The success of deep neural networks (DNNs) is heavily dependent on computational resources. While DNNs are often employed on cloud servers, there is a growing need to operate DNNs on edge devices. Edge devices are typically limited in their computational resources, yet, often multiple edge devices are deployed in the same environment and can reliably communicate with each other. In this work we propose to facilitate the application of DNNs on the edge by allowing multiple users to collaborate during inference to improve their accuracy. Our mechanism, coined {\em edge ensembles}, is based on having diverse predictors at each device, which form an ensemble of models during inference. To mitigate the communication overhead, the users share quantized features, and we propose a method for aggregating multiple decisions into a single inference rule. We analyze the latency induced by edge ensembles, showing that its performance improvement comes at the cost of a minor additional delay under common assumptions on the communication network. Our experiments demonstrate that collaborative inference via edge ensembles equipped with compact DNNs substantially improves the accuracy over having each user infer locally, and can outperform using a single centralized DNN larger than all the networks in the ensemble together.
    Subject Membership Inference Attacks in Federated Learning. (arXiv:2206.03317v1 [cs.LG])
    Privacy in Federated Learning (FL) is studied at two different granularities: item-level, which protects individual data points, and user-level, which protects each user (participant) in the federation. Nearly all of the private FL literature is dedicated to studying privacy attacks and defenses at these two granularities. Recently, subject-level privacy has emerged as an alternative privacy granularity to protect the privacy of individuals (data subjects) whose data is spread across multiple (organizational) users in cross-silo FL settings. An adversary might be interested in recovering private information about these individuals (a.k.a. \emph{data subjects}) by attacking the trained model. A systematic study of these patterns requires complete control over the federation, which is impossible with real-world datasets. We design a simulator for generating various synthetic federation configurations, enabling us to study how properties of the data, model design and training, and the federation itself impact subject privacy risk. We propose three attacks for \emph{subject membership inference} and examine the interplay between all factors within a federation that affect the attacks' efficacy. We also investigate the effectiveness of Differential Privacy in mitigating this threat. Our takeaways generalize to real-world datasets like FEMNIST, giving credence to our findings.
    GRETEL: A unified framework for Graph Counterfactual Explanation Evaluation. (arXiv:2206.02957v1 [cs.LG])
    Machine Learning (ML) systems are a building part of the modern tools which impact our daily life in several application domains. Due to their black-box nature, those systems are hardly adopted in application domains (e.g. health, finance) where understanding the decision process is of paramount importance. Explanation methods were developed to explain how the ML model has taken a specific decision for a given case/instance. Graph Counterfactual Explanations (GCE) is one of the explanation techniques adopted in the Graph Learning domain. The existing works of Graph Counterfactual Explanations diverge mostly in the problem definition, application domain, test data, and evaluation metrics, and most existing works do not compare exhaustively against other counterfactual explanation techniques present in the literature. We present GRETEL, a unified framework to develop and test GCE methods in several settings. GRETEL is a highly extensible evaluation framework which promotes the Open Science and the evaluations reproducibility by providing a set of well-defined mechanisms to integrate and manage easily: both real and synthetic datasets, ML models, state-of-the-art explanation techniques, and evaluation measures. To present GRETEL, we show the experiments conducted to integrate and test several synthetic and real datasets with several existing explanation techniques and base ML models.
    Recent Advances in Bayesian Optimization. (arXiv:2206.03301v1 [cs.LG])
    Bayesian optimization has emerged at the forefront of expensive black-box optimization due to its data efficiency. Recent years have witnessed a proliferation of studies on the development of new Bayesian optimization algorithms and their applications. Hence, this paper attempts to provide a comprehensive and updated survey of recent advances in Bayesian optimization and identify interesting open problems. We categorize the existing work on Bayesian optimization into nine main groups according to the motivations and focus of the proposed algorithms. For each category, we present the main advances with respect to the construction of surrogate models and adaptation of the acquisition functions. Finally, we discuss the open questions and suggest promising future research directions, in particular with regard to heterogeneity, privacy preservation, and fairness in distributed and federated optimization systems.
    Self-Knowledge Distillation based Self-Supervised Learning for Covid-19 Detection from Chest X-Ray Images. (arXiv:2206.03009v1 [eess.IV])
    The global outbreak of the Coronavirus 2019 (COVID-19) has overloaded worldwide healthcare systems. Computer-aided diagnosis for COVID-19 fast detection and patient triage is becoming critical. This paper proposes a novel self-knowledge distillation based self-supervised learning method for COVID-19 detection from chest X-ray images. Our method can use self-knowledge of images based on similarities of their visual features for self-supervised learning. Experimental results show that our method achieved an HM score of 0.988, an AUC of 0.999, and an accuracy of 0.957 on the largest open COVID-19 chest X-ray dataset.
    Decomposed Linear Dynamical Systems (dLDS) for learning the latent components of neural dynamics. (arXiv:2206.02972v1 [stat.ML])
    Learning interpretable representations of neural dynamics at a population level is a crucial first step to understanding how neural activity relates to perception and behavior. Models of neural dynamics often focus on either low-dimensional projections of neural activity, or on learning dynamical systems that explicitly relate to the neural state over time. We discuss how these two approaches are interrelated by considering dynamical systems as representative of flows on a low-dimensional manifold. Building on this concept, we propose a new decomposed dynamical system model that represents complex non-stationary and nonlinear dynamics of time-series data as a sparse combination of simpler, more interpretable components. The decomposed nature of the dynamics generalizes over previous switched approaches and enables modeling of overlapping and non-stationary drifts in the dynamics. We further present a dictionary learning-driven approach to model fitting, where we leverage recent results in tracking sparse vectors over time. We demonstrate that our model can learn efficient representations and smooth transitions between dynamical modes in both continuous-time and discrete-time examples. We show results on low-dimensional linear and nonlinear attractors to demonstrate that our decomposed dynamical systems model can well approximate nonlinear dynamics. Additionally, we apply our model to C. elegans data, illustrating a diversity of dynamics that is obscured when classified into discrete states.
    Recall Distortion in Neural Network Pruning and the Undecayed Pruning Algorithm. (arXiv:2206.02976v1 [cs.LG])
    Pruning techniques have been successfully used in neural networks to trade accuracy for sparsity. However, the impact of network pruning is not uniform: prior work has shown that the recall for underrepresented classes in a dataset may be more negatively affected. In this work, we study such relative distortions in recall by hypothesizing an intensification effect that is inherent to the model. Namely, that pruning makes recall relatively worse for a class with recall below accuracy and, conversely, that it makes recall relatively better for a class with recall above accuracy. In addition, we propose a new pruning algorithm aimed at attenuating such effect. Through statistical analysis, we have observed that intensification is less severe with our algorithm but nevertheless more pronounced with relatively more difficult tasks, less complex models, and higher pruning ratios. More surprisingly, we conversely observe a de-intensification effect with lower pruning ratios.
    An Empirical Study of IoT Security Aspects at Sentence-Level in Developer Textual Discussions. (arXiv:2206.03079v1 [cs.CR])
    IoT is a rapidly emerging paradigm that now encompasses almost every aspect of our modern life. As such, ensuring the security of IoT devices is crucial. IoT devices can differ from traditional computing, thereby the design and implementation of proper security measures can be challenging in IoT devices. We observed that IoT developers discuss their security-related challenges in developer forums like Stack Overflow(SO). However, we find that IoT security discussions can also be buried inside non-security discussions in SO. In this paper, we aim to understand the challenges IoT developers face while applying security practices and techniques to IoT devices. We have two goals: (1) Develop a model that can automatically find security-related IoT discussions in SO, and (2) Study the model output to learn about IoT developer security-related challenges. First, we download 53K posts from SO that contain discussions about IoT. Second, we manually labeled 5,919 sentences from 53K posts as 1 or 0. Third, we use this benchmark to investigate a suite of deep learning transformer models. The best performing model is called SecBot. Fourth, we apply SecBot on the entire posts and find around 30K security related sentences. Fifth, we apply topic modeling to the security-related sentences. Then we label and categorize the topics. Sixth, we analyze the evolution of the topics in SO. We found that (1) SecBot is based on the retraining of the deep learning model RoBERTa. SecBot offers the best F1-Score of 0.935, (2) there are six error categories in misclassified samples by SecBot. SecBot was mostly wrong when the keywords/contexts were ambiguous (e.g., gateway can be a security gateway or a simple gateway), (3) there are 9 security topics grouped into three categories: Software, Hardware, and Network, and (4) the highest number of topics belongs to software security, followed by network security.
    Histogram Estimation under User-level Privacy with Heterogeneous Data. (arXiv:2206.03008v1 [cs.LG])
    We study the problem of histogram estimation under user-level differential privacy, where the goal is to preserve the privacy of all entries of any single user. While there is abundant literature on this classical problem under the item-level privacy setup where each user contributes only one data point, little has been known for the user-level counterpart. We consider the heterogeneous scenario where both the quantity and distribution of data can be different for each user. We propose an algorithm based on a clipping strategy that almost achieves a two-approximation with respect to the best clipping threshold in hindsight. This result holds without any distribution assumptions on the data. We also prove that the clipping bias can be significantly reduced when the counts are from non-i.i.d. Poisson distributions and show empirically that our debiasing method provides improvements even without such constraints. Experiments on both real and synthetic datasets verify our theoretical findings and demonstrate the effectiveness of our algorithms.
    RORL: Robust Offline Reinforcement Learning via Conservative Smoothing. (arXiv:2206.02829v1 [cs.LG])
    Offline reinforcement learning (RL) provides a promising direction to exploit the massive amount of offline data for complex decision-making tasks. Due to the distribution shift issue, current offline RL algorithms are generally designed to be conservative for value estimation and action selection. However, such conservatism impairs the robustness of learned policies, leading to a significant change even for a small perturbation on observations. To trade off robustness and conservatism, we propose Robust Offline Reinforcement Learning (RORL) with a novel conservative smoothing technique. In RORL, we explicitly introduce regularization on the policy and the value function for states near the dataset and additional conservative value estimation on these OOD states. Theoretically, we show RORL enjoys a tighter suboptimality bound than recent theoretical results in linear MDPs. We demonstrate that RORL can achieve the state-of-the-art performance on the general offline RL benchmark and is considerably robust to adversarial observation perturbation.
    CitySpec: An Intelligent Assistant System for Requirement Specification in Smart Cities. (arXiv:2206.03132v1 [cs.AI])
    An increasing number of monitoring systems have been developed in smart cities to ensure that real-time operations of a city satisfy safety and performance requirements. However, many existing city requirements are written in English with missing, inaccurate, or ambiguous information. There is a high demand for assisting city policy makers in converting human-specified requirements to machine-understandable formal specifications for monitoring systems. To tackle this limitation, we build CitySpec, the first intelligent assistant system for requirement specification in smart cities. To create CitySpec, we first collect over 1,500 real-world city requirements across different domains from over 100 cities and extract city-specific knowledge to generate a dataset of city vocabulary with 3,061 words. We also build a translation model and enhance it through requirement synthesis and develop a novel online learning framework with validation under uncertainty. The evaluation results on real-world city requirements show that CitySpec increases the sentence-level accuracy of requirement specification from 59.02% to 86.64%, and has strong adaptability to a new city and a new domain (e.g., F1 score for requirements in Seattle increases from 77.6% to 93.75% with online learning).
    Instance-Dependent Label-Noise Learning with Manifold-Regularized Transition Matrix Estimation. (arXiv:2206.02791v1 [cs.LG])
    In label-noise learning, estimating the transition matrix has attracted more and more attention as the matrix plays an important role in building statistically consistent classifiers. However, it is very challenging to estimate the transition matrix T(x), where x denotes the instance, because it is unidentifiable under the instance-dependent noise(IDN). To address this problem, we have noticed that, there are psychological and physiological evidences showing that we humans are more likely to annotate instances of similar appearances to the same classes, and thus poor-quality or ambiguous instances of similar appearances are easier to be mislabeled to the correlated or same noisy classes. Therefore, we propose assumption on the geometry of T(x) that "the closer two instances are, the more similar their corresponding transition matrices should be". More specifically, we formulate above assumption into the manifold embedding, to effectively reduce the degree of freedom of T(x) and make it stably estimable in practice. The proposed manifold-regularized technique works by directly reducing the estimation error without hurting the approximation error about the estimation problem of T(x). Experimental evaluations on four synthetic and two real-world datasets demonstrate that our method is superior to state-of-the-art approaches for label-noise learning under the challenging IDN.
    Predicting Electricity Infrastructure Induced Wildfire Risk in California. (arXiv:2206.02930v1 [eess.SY])
    This paper examines the use of risk models to predict the timing and location of wildfires caused by electricity infrastructure. Our data include historical ignition and wire-down points triggered by grid infrastructure collected between 2015 to 2019 in Pacific Gas & Electricity territory along with various weather, vegetation, and very high resolution data on grid infrastructure including location, age, materials. With these data we explore a range of machine learning methods and strategies to manage training data imbalance. The best area under the receiver operating characteristic we obtain is 0.776 for distribution feeder ignitions and 0.824 for transmission line wire-down events, both using the histogram-based gradient boosting tree algorithm (HGB) with under-sampling. We then use these models to identify which information provides the most predictive value. After line length, we find that weather and vegetation features dominate the list of top important features for ignition or wire-down risk. Distribution ignition models show more dependence on slow-varying vegetation variables such as burn index, energy release content, and tree height, whereas transmission wire-down models rely more on primary weather variables such as wind speed and precipitation. These results point to the importance of improved vegetation modeling for feeder ignition risk models, and improved weather forecasting for transmission wire-down models. We observe that infrastructure features make small but meaningful improvements to risk model predictive power.
    Sampling without Replacement Leads to Faster Rates in Finite-Sum Minimax Optimization. (arXiv:2206.02953v1 [math.OC])
    We analyze the convergence rates of stochastic gradient algorithms for smooth finite-sum minimax optimization and show that, for many such algorithms, sampling the data points without replacement leads to faster convergence compared to sampling with replacement. For the smooth and strongly convex-strongly concave setting, we consider gradient descent ascent and the proximal point method, and present a unified analysis of two popular without-replacement sampling strategies, namely Random Reshuffling (RR), which shuffles the data every epoch, and Single Shuffling or Shuffle Once (SO), which shuffles only at the beginning. We obtain tight convergence rates for RR and SO and demonstrate that these strategies lead to faster convergence than uniform sampling. Moving beyond convexity, we obtain similar results for smooth nonconvex-nonconcave objectives satisfying a two-sided Polyak-{\L}ojasiewicz inequality. Finally, we demonstrate that our techniques are general enough to analyze the effect of data-ordering attacks, where an adversary manipulates the order in which data points are supplied to the optimizer. Our analysis also recovers tight rates for the incremental gradient method, where the data points are not shuffled at all.
    Zeroth-Order SciML: Non-intrusive Integration of Scientific Software with Deep Learning. (arXiv:2206.02785v1 [cs.LG])
    Using deep learning (DL) to accelerate and/or improve scientific workflows can yield discoveries that are otherwise impossible. Unfortunately, DL models have yielded limited success in complex scientific domains due to large data requirements. In this work, we propose to overcome this issue by integrating the abundance of scientific knowledge sources (SKS) with the DL training process. Existing knowledge integration approaches are limited to using differentiable knowledge source to be compatible with first-order DL training paradigm. In contrast, our proposed approach treats knowledge source as a black-box in turn allowing to integrate virtually any knowledge source. To enable an end-to-end training of SKS-coupled-DL, we propose to use zeroth-order optimization (ZOO) based gradient-free training schemes, which is non-intrusive, i.e., does not require making any changes to the SKS. We evaluate the performance of our ZOO training scheme on two real-world material science applications. We show that proposed scheme is able to effectively integrate scientific knowledge with DL training and is able to outperform purely data-driven model for data-limited scientific applications. We also discuss some limitations of the proposed method and mention potentially worthwhile future directions.
    Remember the Past: Distilling Datasets into Addressable Memories for Neural Networks. (arXiv:2206.02916v1 [cs.LG])
    We propose an algorithm that compresses the critical information of a large dataset into compact addressable memories. These memories can then be recalled to quickly re-train a neural network and recover the performance (instead of storing and re-training on the full original dataset). Building upon the dataset distillation framework, we make a key observation that a shared common representation allows for more efficient and effective distillation. Concretely, we learn a set of bases (aka "memories") which are shared between classes and combined through learned flexible addressing functions to generate a diverse set of training examples. This leads to several benefits: 1) the size of compressed data does not necessarily grow linearly with the number of classes; 2) an overall higher compression rate with more effective distillation is achieved; and 3) more generalized queries are allowed beyond recalling the original classes. We demonstrate state-of-the-art results on the dataset distillation task across five benchmarks, including up to 16.5% and 9.7% in retained accuracy improvement when distilling CIFAR10 and CIFAR100 respectively. We then leverage our framework to perform continual learning, achieving state-of-the-art results on four benchmarks, with 23.2% accuracy improvement on MANY.
    Flexible Group Fairness Metrics for Survival Analysis. (arXiv:2206.03256v1 [cs.CY])
    Algorithmic fairness is an increasingly important field concerned with detecting and mitigating biases in machine learning models. There has been a wealth of literature for algorithmic fairness in regression and classification however there has been little exploration of the field for survival analysis. Survival analysis is the prediction task in which one attempts to predict the probability of an event occurring over time. Survival predictions are particularly important in sensitive settings such as when utilising machine learning for diagnosis and prognosis of patients. In this paper we explore how to utilise existing survival metrics to measure bias with group fairness metrics. We explore this in an empirical experiment with 29 survival datasets and 8 measures. We find that measures of discrimination are able to capture bias well whereas there is less clarity with measures of calibration and scoring rules. We suggest further areas for research including prediction-based fairness metrics for distribution predictions.
    Discrete State-Action Abstraction via the Successor Representation. (arXiv:2206.03467v1 [cs.AI])
    When reinforcement learning is applied with sparse rewards, agents must spend a prohibitively long time exploring the unknown environment without any learning signal. Abstraction is one approach that provides the agent with an intrinsic reward for transitioning in a latent space. Prior work focuses on dense continuous latent spaces, or requires the user to manually provide the representation. Our approach is the first for automatically learning a discrete abstraction of the underlying environment. Moreover, our method works on arbitrary input spaces, using an end-to-end trainable regularized successor representation model. For transitions between abstract states, we train a set of temporally extended actions in the form of options, i.e., an action abstraction. Our proposed algorithm, Discrete State-Action Abstraction (DSAA), iteratively swaps between training these options and using them to efficiently explore more of the environment to improve the state abstraction. As a result, our model is not only useful for transfer learning but also in the online learning setting. We empirically show that our agent is able to explore the environment and solve provided tasks more efficiently than baseline reinforcement learning algorithms. Our code is publicly available at \url{https://github.com/amnonattali/dsaa}.
    Goal-Space Planning with Subgoal Models. (arXiv:2206.02902v1 [cs.LG])
    This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning and avoids learning the transition dynamics entirely. We show that our GSP algorithm can learn significantly faster than a Double DQN baseline in a variety of situations.
    A Human-Centric Take on Model Monitoring. (arXiv:2206.02868v1 [cs.LG])
    Predictive models are increasingly used to make various consequential decisions in high-stakes domains such as healthcare, finance, and policy. It becomes critical to ensure that these models make accurate predictions, are robust to shifts in the data, do not rely on spurious features, and do not unduly discriminate against minority groups. To this end, several approaches spanning various areas such as explainability, fairness, and robustness have been proposed in recent literature. Such approaches need to be human-centered as they cater to the understanding of the models to their users. However, there is a research gap in understanding the human-centric needs and challenges of monitoring machine learning (ML) models once they are deployed. To fill this gap, we conducted an interview study with 13 practitioners who have experience at the intersection of deploying ML models and engaging with customers spanning domains such as financial services, healthcare, hiring, online retail, computational advertising, and conversational assistants. We identified various human-centric challenges and requirements for model monitoring in real-world applications. Specifically, we found the need and the challenge for the model monitoring systems to clarify the impact of the monitoring observations on outcomes. Further, such insights must be actionable, robust, customizable for domain-specific use cases, and cognitively considerate to avoid information overload.
    Training Subset Selection for Weak Supervision. (arXiv:2206.02914v1 [stat.ML])
    Existing weak supervision approaches use all the data covered by weak signals to train a classifier. We show both theoretically and empirically that this is not always optimal. Intuitively, there is a tradeoff between the amount of weakly-labeled data and the precision of the weak labels. We explore this tradeoff by combining pretrained data representations with the cut statistic (Muhlenbach et al., 2004) to select (hopefully) high-quality subsets of the weakly-labeled training data. Subset selection applies to any label model and classifier and is very simple to plug in to existing weak supervision pipelines, requiring just a few lines of code. We show our subset selection method improves the performance of weak supervision for a wide range of label models, classifiers, and datasets. Using less weakly-labeled data improves the accuracy of weak supervision pipelines by up to 19% (absolute) on benchmark tasks.
    Shuffled Check-in: Privacy Amplification towards Practical Distributed Learning. (arXiv:2206.03151v1 [cs.LG])
    Recent studies of distributed computation with formal privacy guarantees, such as differentially private (DP) federated learning, leverage random sampling of clients in each round (privacy amplification by subsampling) to achieve satisfactory levels of privacy. Achieving this however requires strong assumptions which may not hold in practice, including precise and uniform subsampling of clients, and a highly trusted aggregator to process clients' data. In this paper, we explore a more practical protocol, shuffled check-in, to resolve the aforementioned issues. The protocol relies on client making independent and random decision to participate in the computation, freeing the requirement of server-initiated subsampling, and enabling robust modelling of client dropouts. Moreover, a weaker trust model known as the shuffle model is employed instead of using a trusted aggregator. To this end, we introduce new tools to characterize the R\'enyi differential privacy (RDP) of shuffled check-in. We show that our new techniques improve at least three times in privacy guarantee over those using approximate DP's strong composition at various parameter regimes. Furthermore, we provide a numerical approach to track the privacy of generic shuffled check-in mechanism including distributed stochastic gradient descent (SGD) with Gaussian mechanism. To the best of our knowledge, this is also the first evaluation of Gaussian mechanism within the local/shuffle model under the distributed setting in the literature, which can be of independent interest.
  • Open

    Improving Mini-batch Optimal Transport via Partial Transportation. (arXiv:2108.09645v4 [stat.ML] UPDATED)
    Mini-batch optimal transport (m-OT) has been widely used recently to deal with the memory issue of OT in large-scale applications. Despite their practicality, m-OT suffers from misspecified mappings, namely, mappings that are optimal on the mini-batch level but are partially wrong in the comparison with the optimal transportation plan between the original measures. Motivated by the misspecified mappings issue, we propose a novel mini-batch method by using partial optimal transport (POT) between mini-batch empirical measures, which we refer to as mini-batch partial optimal transport (m-POT). Leveraging the insight from the partial transportation, we explain the source of misspecified mappings from the m-OT and motivate why limiting the amount of transported masses among mini-batches via POT can alleviate the incorrect mappings. Finally, we carry out extensive experiments on various applications such as deep domain adaptation, partial domain adaptation, deep generative model, color transfer, and gradient flow to demonstrate the favorable performance of m-POT compared to current mini-batch methods.  ( 2 min )
    Unbiased estimators for random design regression. (arXiv:1907.03411v2 [stat.ML] UPDATED)
    In linear regression we wish to estimate the optimum linear least squares predictor for a distribution over $d$-dimensional input points and real-valued responses, based on a small sample. Under standard random design analysis, where the sample is drawn i.i.d. from the input distribution, the least squares solution for that sample can be viewed as the natural estimator of the optimum. Unfortunately, this estimator almost always incurs an undesirable bias coming from the randomness of the input points, which is a significant bottleneck in model averaging. In this paper we show that it is possible to draw a non-i.i.d. sample of input points such that, regardless of the response model, the least squares solution is an unbiased estimator of the optimum. Moreover, this sample can be produced efficiently by augmenting a previously drawn i.i.d. sample with an additional set of $d$ points, drawn jointly according to a certain determinantal point process constructed from the input distribution rescaled by the squared volume spanned by the points. Motivated by this, we develop a theoretical framework for studying volume-rescaled sampling, and in the process prove a number of new matrix expectation identities. We use them to show that for any input distribution and $\epsilon>0$ there is a random design consisting of $O(d\log d+ d/\epsilon)$ points from which an unbiased estimator can be constructed whose expected square loss over the entire distribution is bounded by $1+\epsilon$ times the loss of the optimum. We provide efficient algorithms for generating such unbiased estimators in a number of practical settings and support our claims experimentally.  ( 2 min )
    Demystifying the Global Convergence Puzzle of Learning Over-parameterized ReLU Nets in Very High Dimensions. (arXiv:2206.03254v1 [cs.LG])
    This theoretical paper is devoted to developing a rigorous theory for demystifying the global convergence phenomenon in a challenging scenario: learning over-parameterized Rectified Linear Unit (ReLU) nets for very high dimensional dataset under very mild assumptions. A major ingredient of our analysis is a fine-grained analysis of random activation matrices. The essential virtue of dissecting activation matrices is that it bridges the dynamics of optimization and angular distribution in high-dimensional data space. This angle-based detailed analysis leads to asymptotic characterizations of gradient norm and directional curvature of objective function at each gradient descent iteration, revealing that the empirical loss function enjoys nice geometrical properties in the overparameterized setting. Along the way, we significantly improve existing theoretical bounds on both over-parameterization condition and learning rate with very mild assumptions for learning very high dimensional data. Moreover, we uncover the role of the geometrical and spectral properties of the input data in determining desired over-parameterization size and global convergence rate. All these clues allow us to discover a novel geometric picture of nonconvex optimization in deep learning: angular distribution in high-dimensional data space $\mapsto$ spectrums of overparameterized activation matrices $\mapsto$ favorable geometrical properties of empirical loss landscape $\mapsto$ global convergence phenomenon. Furthremore, our theoretical results imply that gradient-based nonconvex optimization algorithms have much stronger statistical guarantees with much milder over-parameterization condition than exisiting theory states for learning very high dimensional data, which is rarely explored so far.
    Preconditioned Gradient Descent for Overparameterized Nonconvex Burer--Monteiro Factorization with Global Optimality Certification. (arXiv:2206.03345v1 [math.OC])
    We consider using gradient descent to minimize the nonconvex function $f(X)=\phi(XX^{T})$ over an $n\times r$ factor matrix $X$, in which $\phi$ is an underlying smooth convex cost function defined over $n\times n$ matrices. While only a second-order stationary point $X$ can be provably found in reasonable time, if $X$ is additionally rank deficient, then its rank deficiency certifies it as being globally optimal. This way of certifying global optimality necessarily requires the search rank $r$ of the current iterate $X$ to be overparameterized with respect to the rank $r^{\star}$ of the global minimizer $X^{\star}$. Unfortunately, overparameterization significantly slows down the convergence of gradient descent, from a linear rate with $r=r^{\star}$ to a sublinear rate when $r>r^{\star}$, even when $\phi$ is strongly convex. In this paper, we propose an inexpensive preconditioner that restores the convergence rate of gradient descent back to linear in the overparameterized case, while also making it agnostic to possible ill-conditioning in the global minimizer $X^{\star}$.
    Beyond spectral gap: The role of the topology in decentralized learning. (arXiv:2206.03093v1 [cs.LG])
    In data-parallel optimization of machine learning models, workers collaborate to improve their estimates of the model: more accurate gradients allow them to use larger learning rates and optimize faster. We consider the setting in which all workers sample from the same dataset, and communicate over a sparse graph (decentralized). In this setting, current theory fails to capture important aspects of real-world behavior. First, the 'spectral gap' of the communication graph is not predictive of its empirical performance in (deep) learning. Second, current theory does not explain that collaboration enables larger learning rates than training alone. In fact, it prescribes smaller learning rates, which further decrease as graphs become larger, failing to explain convergence in infinite graphs. This paper aims to paint an accurate picture of sparsely-connected distributed optimization when workers share the same data distribution. We quantify how the graph topology influences convergence in a quadratic toy problem and provide theoretical results for general smooth and (strongly) convex objectives. Our theory matches empirical observations in deep learning, and accurately describes the relative merits of different graph topologies.
    Decomposed Linear Dynamical Systems (dLDS) for learning the latent components of neural dynamics. (arXiv:2206.02972v1 [stat.ML])
    Learning interpretable representations of neural dynamics at a population level is a crucial first step to understanding how neural activity relates to perception and behavior. Models of neural dynamics often focus on either low-dimensional projections of neural activity, or on learning dynamical systems that explicitly relate to the neural state over time. We discuss how these two approaches are interrelated by considering dynamical systems as representative of flows on a low-dimensional manifold. Building on this concept, we propose a new decomposed dynamical system model that represents complex non-stationary and nonlinear dynamics of time-series data as a sparse combination of simpler, more interpretable components. The decomposed nature of the dynamics generalizes over previous switched approaches and enables modeling of overlapping and non-stationary drifts in the dynamics. We further present a dictionary learning-driven approach to model fitting, where we leverage recent results in tracking sparse vectors over time. We demonstrate that our model can learn efficient representations and smooth transitions between dynamical modes in both continuous-time and discrete-time examples. We show results on low-dimensional linear and nonlinear attractors to demonstrate that our decomposed dynamical systems model can well approximate nonlinear dynamics. Additionally, we apply our model to C. elegans data, illustrating a diversity of dynamics that is obscured when classified into discrete states.
    Learning Backward Compatible Embeddings. (arXiv:2206.03040v1 [stat.ML])
    Embeddings, low-dimensional vector representation of objects, are fundamental in building modern machine learning systems. In industrial settings, there is usually an embedding team that trains an embedding model to solve intended tasks (e.g., product recommendation). The produced embeddings are then widely consumed by consumer teams to solve their unintended tasks (e.g., fraud detection). However, as the embedding model gets updated and retrained to improve performance on the intended task, the newly-generated embeddings are no longer compatible with the existing consumer models. This means that historical versions of the embeddings can never be retired or all consumer teams have to retrain their models to make them compatible with the latest version of the embeddings, both of which are extremely costly in practice. Here we study the problem of embedding version updates and their backward compatibility. We formalize the problem where the goal is for the embedding team to keep updating the embedding version, while the consumer teams do not have to retrain their models. We develop a solution based on learning backward compatible embeddings, which allows the embedding model version to be updated frequently, while also allowing the latest version of the embedding to be quickly transformed into any backward compatible historical version of it, so that consumer teams do not have to retrain their models. Under our framework, we explore six methods and systematically evaluate them on a real-world recommender system application. We show that the best method, which we call BC-Aligner, maintains backward compatibility with existing unintended tasks even after multiple model version updates. Simultaneously, BC-Aligner achieves the intended task performance similar to the embedding model that is solely optimized for the intended task.
    FIFA: Making Fairness More Generalizable in Classifiers Trained on Imbalanced Data. (arXiv:2206.02792v1 [cs.LG])
    Algorithmic fairness plays an important role in machine learning and imposing fairness constraints during learning is a common approach. However, many datasets are imbalanced in certain label classes (e.g. "healthy") and sensitive subgroups (e.g. "older patients"). Empirically, this imbalance leads to a lack of generalizability not only of classification, but also of fairness properties, especially in over-parameterized models. For example, fairness-aware training may ensure equalized odds (EO) on the training data, but EO is far from being satisfied on new users. In this paper, we propose a theoretically-principled, yet Flexible approach that is Imbalance-Fairness-Aware (FIFA). Specifically, FIFA encourages both classification and fairness generalization and can be flexibly combined with many existing fair learning methods with logits-based losses. While our main focus is on EO, FIFA can be directly applied to achieve equalized opportunity (EqOpt); and under certain conditions, it can also be applied to other fairness notions. We demonstrate the power of FIFA by combining it with a popular fair classification algorithm, and the resulting algorithm achieves significantly better fairness generalization on several real-world datasets.
    Inferring Unfairness and Error from Population Statistics in Binary and Multiclass Classification. (arXiv:2206.03234v1 [cs.LG])
    We propose methods for making inferences on the fairness and accuracy of a given classifier, using only aggregate population statistics. This is necessary when it is impossible to obtain individual classification data, for instance when there is no access to the classifier or to a representative individual-level validation set. We study fairness with respect to the equalized odds criterion, which we generalize to multiclass classification. We propose a measure of unfairness with respect to this criterion, which quantifies the fraction of the population that is treated unfairly. We then show how inferences on the unfairness and error of a given classifier can be obtained using only aggregate label statistics such as the rate of prediction of each label in each sub-population, as well as the true rate of each label. We derive inference procedures for binary classifiers and for multiclass classifiers, for the case where confusion matrices in each sub-population are known, and for the significantly more challenging case where they are unknown. We report experiments on data sets representing diverse applications, which demonstrate the effectiveness and the wide range of possible uses of the proposed methodology.
    Plant 'n' Seek: Can You Find the Winning Ticket?. (arXiv:2111.11153v2 [cs.LG] UPDATED)
    The lottery ticket hypothesis has sparked the rapid development of pruning algorithms that aim to reduce the computational costs associated with deep learning during training and model deployment. Currently, such algorithms are primarily evaluated on imaging data, for which we lack ground truth information and thus the understanding of how sparse lottery tickets could be. To fill this gap, we develop a framework that allows us to plant and hide winning tickets with desirable properties in randomly initialized neural networks. To analyze the ability of state-of-the-art pruning to identify tickets of extreme sparsity, we design and hide such tickets solving four challenging tasks. In extensive experiments, we observe similar trends as in imaging studies, indicating that our framework can provide transferable insights into realistic problems. Additionally, we can now see beyond such relative trends and highlight limitations of current pruning methods. Based on our results, we conclude that the current limitations in ticket sparsity are likely of algorithmic rather than fundamental nature. We anticipate that comparisons to planted tickets will facilitate future developments of efficient pruning algorithms.
    A Simple and Optimal Policy Design for Online Learning with Safety against Heavy-tailed Risk. (arXiv:2206.02969v1 [stat.ML])
    We design simple and optimal policies that ensure safety against heavy-tailed risk in the classical multi-armed bandit problem. We start by showing that some widely used policies such as the standard Upper Confidence Bound policy and the Thompson Sampling policy incur heavy-tailed risk; that is, the worst-case probability of incurring a linear regret slowly decays at a polynomial rate of $1/T$, where $T$ is the time horizon. We further show that this heavy-tailed risk exists for all "instance-dependent consistent" policies. To ensure safety against such heavy-tailed risk, for the two-armed bandit setting, we provide a simple policy design that (i) has the worst-case optimality for the expected regret at order $\tilde O(\sqrt{T})$ and (ii) has the worst-case tail probability of incurring a linear regret decay at an exponential rate $\exp(-\Omega(\sqrt{T}))$. We further prove that this exponential decaying rate of the tail probability is optimal across all policies that have worst-case optimality for the expected regret. Finally, we improve the policy design and analysis to the general $K$-armed bandit setting. We provide detailed characterization of the tail probability bound for any regret threshold under our policy design. Namely, the worst-case probability of incurring a regret larger than $x$ is upper bounded by $\exp(-\Omega(x/\sqrt{KT}))$. Numerical experiments are conducted to illustrate the theoretical findings. Our results reveal insights on the incompatibility between consistency and light-tailed risk, whereas indicate that worst-case optimality on expected regret and light-tailed risk are compatible.
    Sample Complexity of Nonparametric Off-Policy Evaluation on Low-Dimensional Manifolds using Deep Networks. (arXiv:2206.02887v1 [cs.LG])
    We consider the off-policy evaluation problem of reinforcement learning using deep neural networks. We analyze the deep fitted Q-evaluation method for estimating the expected cumulative reward of a target policy, when the data are generated from an unknown behavior policy. We show that, by choosing network size appropriately, one can leverage the low-dimensional manifold structure in the Markov decision process and obtain a sample-efficient estimator without suffering from the curse of high representation dimensionality. Specifically, we establish a sharp error bound for the fitted Q-evaluation that depends on the intrinsic low dimension, the smoothness of the state-action space, and a function class-restricted $\chi^2$-divergence. It is noteworthy that the restricted $\chi^2$-divergence measures the behavior and target policies' {\it mismatch in the function space}, which can be small even if the two policies are not close to each other in their tabular forms. Numerical experiments are provided to support our theoretical analysis.
    Robust Sparse Mean Estimation via Sum of Squares. (arXiv:2206.03441v1 [cs.DS])
    We study the problem of high-dimensional sparse mean estimation in the presence of an $\epsilon$-fraction of adversarial outliers. Prior work obtained sample and computationally efficient algorithms for this task for identity-covariance subgaussian distributions. In this work, we develop the first efficient algorithms for robust sparse mean estimation without a priori knowledge of the covariance. For distributions on $\mathbb R^d$ with "certifiably bounded" $t$-th moments and sufficiently light tails, our algorithm achieves error of $O(\epsilon^{1-1/t})$ with sample complexity $m = (k\log(d))^{O(t)}/\epsilon^{2-2/t}$. For the special case of the Gaussian distribution, our algorithm achieves near-optimal error of $\tilde O(\epsilon)$ with sample complexity $m = O(k^4 \mathrm{polylog}(d))/\epsilon^2$. Our algorithms follow the Sum-of-Squares based, proofs to algorithms approach. We complement our upper bounds with Statistical Query and low-degree polynomial testing lower bounds, providing evidence that the sample-time-error tradeoffs achieved by our algorithms are qualitatively the best possible.
    Confounder Analysis in Measuring Representation in Product Funnels. (arXiv:2206.02962v1 [stat.ML])
    This paper discusses an application of Shapley values in the causal inference field, specifically on how to select the top confounder variables for coarsened exact matching method in a scalable way. We use a dataset from an observational experiment involving LinkedIn members as a use case to test its applicability, and show that Shapley values are highly informational and can be leveraged for its robust importance-ranking capability.
    Training Subset Selection for Weak Supervision. (arXiv:2206.02914v1 [stat.ML])
    Existing weak supervision approaches use all the data covered by weak signals to train a classifier. We show both theoretically and empirically that this is not always optimal. Intuitively, there is a tradeoff between the amount of weakly-labeled data and the precision of the weak labels. We explore this tradeoff by combining pretrained data representations with the cut statistic (Muhlenbach et al., 2004) to select (hopefully) high-quality subsets of the weakly-labeled training data. Subset selection applies to any label model and classifier and is very simple to plug in to existing weak supervision pipelines, requiring just a few lines of code. We show our subset selection method improves the performance of weak supervision for a wide range of label models, classifiers, and datasets. Using less weakly-labeled data improves the accuracy of weak supervision pipelines by up to 19% (absolute) on benchmark tasks.
    Shedding a PAC-Bayesian Light on Adaptive Sliced-Wasserstein Distances. (arXiv:2206.03230v1 [stat.ML])
    The Sliced-Wasserstein distance (SW) is a computationally efficient and theoretically grounded alternative to the Wasserstein distance. Yet, the literature on its statistical properties with respect to the distribution of slices, beyond the uniform measure, is scarce. To bring new contributions to this line of research, we leverage the PAC-Bayesian theory and the central observation that SW actually hinges on a slice-distribution-dependent Gibbs risk, the kind of quantity PAC-Bayesian bounds have been designed to characterize. We provide four types of results: i) PAC-Bayesian generalization bounds that hold on what we refer as adaptive Sliced-Wasserstein distances, i.e. distances defined with respect to any distribution of slices, ii) a procedure to learn the distribution of slices that yields a maximally discriminative SW, by optimizing our PAC-Bayesian bounds, iii) an insight on how the performance of the so-called distributional Sliced-Wasserstein distance may be explained through our theory, and iv) empirical illustrations of our findings.
    Concentration analysis of multivariate elliptic diffusion processes. (arXiv:2206.03329v1 [math.PR])
    We prove concentration inequalities and associated PAC bounds for continuous- and discrete-time additive functionals for possibly unbounded functions of multivariate, nonreversible diffusion processes. Our analysis relies on an approach via the Poisson equation allowing us to consider a very broad class of subexponentially ergodic processes. These results add to existing concentration inequalities for additive functionals of diffusion processes which have so far been only available for either bounded functions or for unbounded functions of processes from a significantly smaller class. We demonstrate the power of these exponential inequalities by two examples of very different areas. Considering a possibly high-dimensional parametric nonlinear drift model under sparsity constraints, we apply the continuous-time concentration results to validate the restricted eigenvalue condition for Lasso estimation, which is fundamental for the derivation of oracle inequalities. The results for discrete additive functionals are used to investigate the unadjusted Langevin MCMC algorithm for sampling of moderately heavy-tailed densities $\pi$. In particular, we provide PAC bounds for the sample Monte Carlo estimator of integrals $\pi(f)$ for polynomially growing functions $f$ that quantify sufficient sample and step sizes for approximation within a prescribed margin with high probability.
    Adaptive Regularization for Adversarial Training. (arXiv:2206.03353v1 [stat.ML])
    Adversarial training, which is to enhance robustness against adversarial attacks, has received much attention because it is easy to generate human-imperceptible perturbations of data to deceive a given deep neural network. In this paper, we propose a new adversarial training algorithm that is theoretically well motivated and empirically superior to other existing algorithms. A novel feature of the proposed algorithm is to use a data-adaptive regularization for robustifying a prediction model. We apply more regularization to data which are more vulnerable to adversarial attacks and vice versa. Even though the idea of data-adaptive regularization is not new, our data-adaptive regularization has a firm theoretical base of reducing an upper bound of the robust risk. Numerical experiments illustrate that our proposed algorithm improves the generalization (accuracy on clean samples) and robustness (accuracy on adversarial attacks) simultaneously to achieve the state-of-the-art performance.
    On the Convergence of Optimizing Persistent-Homology-Based Losses. (arXiv:2206.02946v1 [cs.LG])
    Topological loss based on persistent homology has shown promise in various applications. A topological loss enforces the model to achieve certain desired topological property. Despite its empirical success, less is known about the optimization behavior of the loss. In fact, the topological loss involves combinatorial configurations that may oscillate during optimization. In this paper, we introduce a general purpose regularized topology-aware loss. We propose a novel regularization term and also modify existing topological loss. These contributions lead to a new loss function that not only enforces the model to have desired topological behavior, but also achieves satisfying convergence behavior. Our main theoretical result guarantees that the loss can be optimized efficiently, under mild assumptions.
    Selection in the Presence of Implicit Bias: The Advantage of Intersectional Constraints. (arXiv:2202.01661v2 [cs.CY] UPDATED)
    In selection processes such as hiring, promotion, and college admissions, implicit bias toward socially-salient attributes such as race, gender, or sexual orientation of candidates is known to produce persistent inequality and reduce aggregate utility for the decision maker. Interventions such as the Rooney Rule and its generalizations, which require the decision maker to select at least a specified number of individuals from each affected group, have been proposed to mitigate the adverse effects of implicit bias in selection. Recent works have established that such lower-bound constraints can be very effective in improving aggregate utility in the case when each individual belongs to at most one affected group. However, in several settings, individuals may belong to multiple affected groups and, consequently, face more extreme implicit bias due to this intersectionality. We consider independently drawn utilities and show that, in the intersectional case, the aforementioned non-intersectional constraints can only recover part of the total utility achievable in the absence of implicit bias. On the other hand, we show that if one includes appropriate lower-bound constraints on the intersections, almost all the utility achievable in the absence of implicit bias can be recovered. Thus, intersectional constraints can offer a significant advantage over a reductionist dimension-by-dimension non-intersectional approach to reducing inequality.
    The Pareto Frontier of Instance-Dependent Guarantees in Multi-Player Multi-Armed Bandits with no Communication. (arXiv:2202.09653v2 [cs.LG] UPDATED)
    We study the stochastic multi-player multi-armed bandit problem. In this problem, $m$ players cooperate to maximize their total reward from $K > m$ arms. However the players cannot communicate and are penalized (e.g. receive no reward) if they pull the same arm at the same time. We ask whether it is possible to obtain optimal instance-dependent regret $\tilde{O}(1/\Delta)$ where $\Delta$ is the gap between the $m$-th and $m+1$-st best arms. Such guarantees were recently achieved in a model allowing the players to implicitly communicate through intentional collisions. Surprisingly, we show that with no communication at all, such guarantees are not achievable. In fact, obtaining the optimal $\tilde{O}(1/\Delta)$ regret for some values of $\Delta$ necessarily implies strictly sub-optimal regret in other regimes. Our main result is a complete characterization of the Pareto optimal instance-dependent trade-offs that are possible with no communication. Our algorithm generalizes that of Bubeck, Budzinski, and the second author. As there, our algorithm succeeds even when feedback upon collision can be corrupted by an adaptive adversary, thanks to a strong no-collision property. Our lower bound is based on topological obstructions at multiple scales and is completely new.
    Adversarial Bandits Robust to $S$-Switch Regret. (arXiv:2205.14839v2 [cs.LG] UPDATED)
    We study the adversarial bandit problem under $S$ number of switching best arms for unknown $S$. For handling this problem, we adopt the master-base framework using the online mirror descent method (OMD). We first provide a master-base algorithm with basic OMD, achieving $\tilde{O}(S^{1/2}K^{1/3}T^{2/3})$. For improving the regret bound with respect to $T$, we propose to use adaptive learning rates for OMD to control variance of loss estimators, and achieve $\tilde{O}(\min\{\mathbb{E}[\sqrt{SKT\rho_T(h^\dagger)}],S\sqrt{KT}\})$, where $\rho_T(h^\dagger)$ is a variance term for loss estimators.
    Machine learning fairness notions: Bridging the gap with real-world applications. (arXiv:2006.16745v5 [cs.LG] UPDATED)
    Fairness emerged as an important requirement to guarantee that Machine Learning (ML) predictive systems do not discriminate against specific individuals or entire sub-populations, in particular, minorities. Given the inherent subjectivity of viewing the concept of fairness, several notions of fairness have been introduced in the literature. This paper is a survey that illustrates the subtleties between fairness notions through a large number of examples and scenarios. In addition, unlike other surveys in the literature, it addresses the question of: which notion of fairness is most suited to a given real-world scenario and why? Our attempt to answer this question consists in (1) identifying the set of fairness-related characteristics of the real-world scenario at hand, (2) analyzing the behavior of each fairness notion, and then (3) fitting these two elements to recommend the most suitable fairness notion in every specific setup. The results are summarized in a decision diagram that can be used by practitioners and policymakers to navigate the relatively large catalog of ML.
    Sampling without Replacement Leads to Faster Rates in Finite-Sum Minimax Optimization. (arXiv:2206.02953v1 [math.OC])
    We analyze the convergence rates of stochastic gradient algorithms for smooth finite-sum minimax optimization and show that, for many such algorithms, sampling the data points without replacement leads to faster convergence compared to sampling with replacement. For the smooth and strongly convex-strongly concave setting, we consider gradient descent ascent and the proximal point method, and present a unified analysis of two popular without-replacement sampling strategies, namely Random Reshuffling (RR), which shuffles the data every epoch, and Single Shuffling or Shuffle Once (SO), which shuffles only at the beginning. We obtain tight convergence rates for RR and SO and demonstrate that these strategies lead to faster convergence than uniform sampling. Moving beyond convexity, we obtain similar results for smooth nonconvex-nonconcave objectives satisfying a two-sided Polyak-{\L}ojasiewicz inequality. Finally, we demonstrate that our techniques are general enough to analyze the effect of data-ordering attacks, where an adversary manipulates the order in which data points are supplied to the optimizer. Our analysis also recovers tight rates for the incremental gradient method, where the data points are not shuffled at all.
    Spectral Bias Outside the Training Set for Deep Networks in the Kernel Regime. (arXiv:2206.02927v1 [stat.ML])
    We provide quantitative bounds measuring the $L^2$ difference in function space between the trajectory of a finite-width network trained on finitely many samples from the idealized kernel dynamics of infinite width and infinite data. An implication of the bounds is that the network is biased to learn the top eigenfunctions of the Neural Tangent Kernel not just on the training set but over the entire input space. This bias depends on the model architecture and input distribution alone and thus does not depend on the target function which does not need to be in the RKHS of the kernel. The result is valid for deep architectures with fully connected, convolutional, and residual layers. Furthermore the width does not need to grow polynomially with the number of samples in order to obtain high probability bounds up to a stopping time. The proof exploits the low-effective-rank property of the Fisher Information Matrix at initialization, which implies a low effective dimension of the model (far smaller than the number of parameters). We conclude that local capacity control from the low effective rank of the Fisher Information Matrix is still underexplored theoretically.
    Deconstructing Distributions: A Pointwise Framework of Learning. (arXiv:2202.09931v2 [cs.LG] UPDATED)
    In machine learning, we traditionally evaluate the performance of a single model, averaged over a collection of test inputs. In this work, we propose a new approach: we measure the performance of a collection of models when evaluated on a $\textit{single input point}$. Specifically, we study a point's $\textit{profile}$: the relationship between models' average performance on the test distribution and their pointwise performance on this individual point. We find that profiles can yield new insights into the structure of both models and data -- in and out-of-distribution. For example, we empirically show that real data distributions consist of points with qualitatively different profiles. On one hand, there are "compatible" points with strong correlation between the pointwise and average performance. On the other hand, there are points with weak and even $\textit{negative}$ correlation: cases where improving overall model accuracy actually $\textit{hurts}$ performance on these inputs. We prove that these experimental observations are inconsistent with the predictions of several simplified models of learning proposed in prior work. As an application, we use profiles to construct a dataset we call CIFAR-10-NEG: a subset of CINIC-10 such that for standard models, accuracy on CIFAR-10-NEG is $\textit{negatively correlated}$ with accuracy on CIFAR-10 test. This illustrates, for the first time, an OOD dataset that completely inverts "accuracy-on-the-line" (Miller, Taori, Raghunathan, Sagawa, Koh, Shankar, Liang, Carmon, and Schmidt 2021)
    Progressive Distillation for Fast Sampling of Diffusion Models. (arXiv:2202.00512v2 [cs.LG] UPDATED)
    Diffusion models have recently shown great promise for generative modeling, outperforming GANs on perceptual quality and autoregressive models at density estimation. A remaining downside is their slow sampling time: generating high quality samples takes many hundreds or thousands of model evaluations. Here we make two contributions to help eliminate this downside: First, we present new parameterizations of diffusion models that provide increased stability when using few sampling steps. Second, we present a method to distill a trained deterministic diffusion sampler, using many steps, into a new diffusion model that takes half as many sampling steps. We then keep progressively applying this distillation procedure to our model, halving the number of required sampling steps each time. On standard image generation benchmarks like CIFAR-10, ImageNet, and LSUN, we start out with state-of-the-art samplers taking as many as 8192 steps, and are able to distill down to models taking as few as 4 steps without losing much perceptual quality; achieving, for example, a FID of 3.0 on CIFAR-10 in 4 steps. Finally, we show that the full progressive distillation procedure does not take more time than it takes to train the original model, thus representing an efficient solution for generative modeling using diffusion at both train and test time.  ( 2 min )
    Impossibility of Collective Intelligence. (arXiv:2206.02786v1 [cs.LG])
    Democratization of AI involves training and deploying machine learning models across heterogeneous and potentially massive environments. Diversity of data opens up a number of possibilities to advance AI systems, but also introduces pressing concerns such as privacy, security, and equity that require special attention. This work shows that it is theoretically impossible to design a rational learning algorithm that has the ability to successfully learn across heterogeneous environments, which we decoratively call collective intelligence (CI). By representing learning algorithms as choice correspondences over a hypothesis space, we are able to axiomatize them with essential properties. Unfortunately, the only feasible algorithm compatible with all of the axioms is the standard empirical risk minimization (ERM) which learns arbitrarily from a single environment. Our impossibility result reveals informational incomparability between environments as one of the foremost obstacles for researchers who design novel algorithms that learn from multiple environments, which sheds light on prerequisites for success in critical areas of machine learning such as out-of-distribution generalization, federated learning, algorithmic fairness, and multi-modal learning.  ( 2 min )
    Building Robust Ensembles via Margin Boosting. (arXiv:2206.03362v1 [cs.LG])
    In the context of adversarial robustness, a single model does not usually have enough power to defend against all possible adversarial attacks, and as a result, has sub-optimal robustness. Consequently, an emerging line of work has focused on learning an ensemble of neural networks to defend against adversarial attacks. In this work, we take a principled approach towards building robust ensembles. We view this problem from the perspective of margin-boosting and develop an algorithm for learning an ensemble with maximum margin. Through extensive empirical evaluation on benchmark datasets, we show that our algorithm not only outperforms existing ensembling techniques, but also large models trained in an end-to-end fashion. An important byproduct of our work is a margin-maximizing cross-entropy (MCE) loss, which is a better alternative to the standard cross-entropy (CE) loss. Empirically, we show that replacing the CE loss in state-of-the-art adversarial training techniques with our MCE loss leads to significant performance improvement.  ( 2 min )
    Benign Underfitting of Stochastic Gradient Descent. (arXiv:2202.13361v3 [cs.LG] UPDATED)
    We study to what extent may stochastic gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, without-replacement) SGD is classically known to minimize the population risk at rate $O(1/\sqrt n)$, and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and generalization gap of $\Omega(1)$. Consequently, it turns out that SGD is not algorithmically stable in any sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the closely related with-replacement SGD, for which we show that an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate. Finally, we interpret our main results in the context of without-replacement SGD for finite-sum convex optimization problems, and derive upper and lower bounds for the multi-epoch regime that significantly improve upon previously known results.  ( 2 min )
    Computational Doob's $h$-transforms for Online Filtering of Discretely Observed Diffusions. (arXiv:2206.03369v1 [stat.ML])
    This paper is concerned with online filtering of discretely observed nonlinear diffusion processes. Our approach is based on the fully adapted auxiliary particle filter, which involves Doob's $h$-transforms that are typically intractable. We propose a computational framework to approximate these $h$-transforms by solving the underlying backward Kolmogorov equations using nonlinear Feynman-Kac formulas and neural networks. The methodology allows one to train a locally optimal particle filter prior to the data-assimilation procedure. Numerical experiments illustrate that the proposed approach can be orders of magnitude more efficient than the bootstrap particle filter in the regime of highly informative observations, when the observations are extreme under the model, and if the state dimension is large.  ( 2 min )
    Unsupervised tree boosting for learning probability distributions. (arXiv:2101.11083v5 [stat.ME] UPDATED)
    We propose an unsupervised tree boosting algorithm for inferring the underlying sampling distribution of an i.i.d. sample based on fitting additive tree ensembles in a fashion analogous to supervised tree boosting. Integral to the algorithm is a new notion of "addition" on probability distributions that leads to a coherent notion of "residualization", i.e., subtracting a probability distribution from an observation to remove the distributional structure from the sampling distribution of the latter. We show that these notions arise naturally for univariate distributions through cumulative distribution function (CDF) transforms and compositions due to several "group-like" properties of univariate CDFs. While the traditional multivariate CDF does not preserve these properties, a new definition of multivariate CDF can restore these properties, thereby allowing the notions of "addition" and "residualization" to be formulated for multivariate settings as well. This then gives rise to the unsupervised boosting algorithm based on forward-stagewise fitting of an additive tree ensemble, which sequentially reduces the Kullback-Leibler divergence from the truth. The algorithm allows analytic evaluation of the fitted density and outputs a generative model that can be readily sampled from. We enhance the algorithm with scale-dependent shrinkage and a two-stage strategy that separately fits the marginals and the copula. The algorithm then performs competitively to state-of-the-art deep-learning approaches in multivariate density estimation on multiple benchmark datasets.  ( 2 min )
    Beyond Lipschitz: Sharp Generalization and Excess Risk Bounds for Full-Batch GD. (arXiv:2204.12446v3 [stat.ML] UPDATED)
    We provide sharp path-dependent generalization and excess risk guarantees for the full-batch Gradient Descent (GD) algorithm on smooth losses (possibly non-Lipschitz, possibly nonconvex), under an interpolation regime. At the heart of our analysis is a new generalization error bound for deterministic symmetric algorithms, which implies that average output stability and a bounded expected optimization error at termination lead to generalization. This result shows that small generalization error occurs along the optimization path, and allows us to bypass Lipschitz or sub-Gaussian assumptions on the loss prevalent in previous works. For nonconvex, Polyak-Lojasiewicz (PL), convex and strongly convex losses, we show the explicit dependence of the generalization error in terms of the accumulated path-dependent optimization error, terminal optimization error, number of samples, and number of iterations. For nonconvex smooth losses, we prove that full-batch GD efficiently generalizes close to any stationary point at termination, under the proper choice of a decreasing step size. Further, if the loss is nonconvex but the objective is PL, we derive quadratically vanishing bounds on the generalization error and the corresponding excess risk, for a choice of a large constant step size. For (resp. strongly-) convex smooth losses, we prove that full-batch GD also generalizes for large constant step sizes, and achieves (resp. quadratically) small excess risk while training fast. In all cases, we close the generalization error gap, by showing matching generalization and optimization error rates. Our full-batch GD generalization error and excess risk bounds are strictly tighter than existing bounds for (stochastic) GD, when the loss is smooth (but possibly non-Lipschitz).  ( 2 min )
    A Robust Classification-autoencoder to Defend Outliers and Adversaries. (arXiv:2106.15927v2 [cs.LG] UPDATED)
    In this paper, a robust classification-autoencoder (CAE) is proposed, which has strong ability to recognize outliers and defend adversaries. The main idea is to change the autoencoder from an unsupervised learning model into a classifier, where the encoder is used to compress samples with different labels into disjoint compression spaces and the decoder is used to recover samples from their compression spaces. The encoder is used both as a compressed feature learner and as a classifier, and the decoder is used to decide whether the classification given by the encoder is correct by comparing the input sample with the output. Since adversary samples are seemingly inevitable for the current DNN framework, the list classifier to defend adversaries is introduced based on CAE, which outputs several labels and the corresponding samples recovered by the CAE. Extensive experimental results are used to show that the CAE achieves state of the art to recognize outliers by finding almost all outliers; the list classifier gives near lossless classification in the sense that the output list contains the correct label for almost all adversaries and the size of the output list is reasonably small.  ( 2 min )
    On Transportation of Mini-batches: A Hierarchical Approach. (arXiv:2102.05912v5 [stat.ML] UPDATED)
    Mini-batch optimal transport (m-OT) has been successfully used in practical applications that involve probability measures with a very high number of supports. The m-OT solves several smaller optimal transport problems and then returns the average of their costs and transportation plans. Despite its scalability advantage, the m-OT does not consider the relationship between mini-batches which leads to undesirable estimation. Moreover, the m-OT does not approximate a proper metric between probability measures since the identity property is not satisfied. To address these problems, we propose a novel mini-batch scheme for optimal transport, named Batch of Mini-batches Optimal Transport (BoMb-OT), that finds the optimal coupling between mini-batches and it can be seen as an approximation to a well-defined distance on the space of probability measures. Furthermore, we show that the m-OT is a limit of the entropic regularized version of the BoMb-OT when the regularized parameter goes to infinity. Finally, we carry out experiments on various applications including deep generative models, deep domain adaptation, approximate Bayesian computation, color transfer, and gradient flow to show that the BoMb-OT can be widely applied and performs well in various applications.  ( 2 min )
    Concentration bounds for SSP Q-learning for average cost MDPs. (arXiv:2206.03328v1 [cs.LG])
    We derive a concentration bound for a Q-learning algorithm for average cost Markov decision processes based on an equivalent shortest path problem, and compare it numerically with the alternative scheme based on relative value iteration.  ( 2 min )
    Group Meritocratic Fairness in Linear Contextual Bandits. (arXiv:2206.03150v1 [stat.ML])
    We study the linear contextual bandit problem where an agent has to select one candidate from a pool and each candidate belongs to a sensitive group. In this setting, candidates' rewards may not be directly comparable between groups, for example when the agent is an employer hiring candidates from different ethnic groups and some groups have a lower reward due to discriminatory bias and/or social injustice. We propose a notion of fairness that states that the agent's policy is fair when it selects a candidate with highest relative rank, which measures how good the reward is when compared to candidates from the same group. This is a very strong notion of fairness, since the relative rank is not directly observed by the agent and depends on the underlying reward model and on the distribution of rewards. Thus we study the problem of learning a policy which approximates a fair policy under the condition that the contexts are independent between groups and the distribution of rewards of each group is absolutely continuous. In particular, we design a greedy policy which at each round constructs a ridge regression estimator from the observed context-reward pairs, and then computes an estimate of the relative rank of each candidate using the empirical cumulative distribution function. We prove that the greedy policy achieves, after $T$ rounds, up to log factors and with high probability, a fair pseudo-regret of order $\sqrt{dT}$, where $d$ is the dimension of the context vectors. The policy also satisfies demographic parity at each round when averaged over all possible information available before the selection. We finally show with a proof of concept simulation that our policy achieves sub-linear fair pseudo-regret also in practice.  ( 2 min )
    Truncated Diffusion Probabilistic Models. (arXiv:2202.09671v2 [stat.ML] UPDATED)
    Employing a forward Markov diffusion chain to gradually map the data to a noise distribution, diffusion probabilistic models learn how to generate the data by inferring a reverse Markov diffusion chain to invert the forward diffusion process. To achieve competitive data generation performance, they demand a long diffusion chain that makes them computationally intensive in not only training but also generation. To significantly improve the computation efficiency, we propose to truncate the forward diffusion chain by abolishing the requirement of diffusing the data to random noise. Consequently, we start the inverse diffusion chain from an implicit generative distribution, rather than random noise, and learn its parameters by matching it to the distribution of the data corrupted by the truncated forward diffusion chain. Experimental results show our truncated diffusion probabilistic models provide consistent improvements over the non-truncated ones in terms of the generation performance and the number of required inverse diffusion steps.  ( 2 min )
    Relaxed Gaussian process interpolation: a goal-oriented approach to Bayesian optimization. (arXiv:2206.03034v1 [stat.CO])
    This work presents a new procedure for obtaining predictive distributions in the context of Gaussian process (GP) modeling, with a relaxation of the interpolation constraints outside some ranges of interest: the mean of the predictive distributions no longer necessarily interpolates the observed values when they are outside ranges of interest, but are simply constrained to remain outside. This method called relaxed Gaussian process (reGP) interpolation provides better predictive distributions in ranges of interest, especially in cases where a stationarity assumption for the GP model is not appropriate. It can be viewed as a goal-oriented method and becomes particularly interesting in Bayesian optimization, for example, for the minimization of an objective function, where good predictive distributions for low function values are important. When the expected improvement criterion and reGP are used for sequentially choosing evaluation points, the convergence of the resulting optimization algorithm is theoretically guaranteed (provided that the function to be optimized lies in the reproducing kernel Hilbert spaces attached to the known covariance of the underlying Gaussian process). Experiments indicate that using reGP instead of stationary GP models in Bayesian optimization is beneficial.  ( 2 min )
    Per-Instance Privacy Accounting for Differentially Private Stochastic Gradient Descent. (arXiv:2206.02617v2 [cs.LG] UPDATED)
    Differentially private stochastic gradient descent (DP-SGD) is the workhorse algorithm for recent advances in private deep learning. It provides a single privacy guarantee to all datapoints in the dataset. We propose an efficient algorithm to compute per-instance privacy guarantees for individual examples when running DP-SGD. We use our algorithm to investigate per-instance privacy losses across a number of datasets. We find that most examples enjoy stronger privacy guarantees than the worst-case bounds. We further discover that the loss and the privacy loss on an example are well-correlated. This implies groups that are underserved in terms of model utility are simultaneously underserved in terms of privacy loss. For example, on CIFAR-10, the average $\epsilon$ of the class with the highest loss (Cat) is 32% higher than that of the class with the lowest loss (Ship). We also run membership inference attacks to show this reflects disparate empirical privacy risks.  ( 2 min )
    Generalization Error Bounds for Deep Neural Networks Trained by SGD. (arXiv:2206.03299v1 [cs.LG])
    Generalization error bounds for deep neural networks trained by stochastic gradient descent (SGD) are derived by combining a dynamical control of an appropriate parameter norm and the Rademacher complexity estimate based on parameter norms. The bounds explicitly depend on the loss along the training trajectory, and work for a wide range of network architectures including multilayer perceptron (MLP) and convolutional neural networks (CNN). Compared with other algorithm-depending generalization estimates such as uniform stability-based bounds, our bounds do not require $L$-smoothness of the nonconvex loss function, and apply directly to SGD instead of Stochastic Langevin gradient descent (SGLD). Numerical results show that our bounds are non-vacuous and robust with the change of optimizer and network hyperparameters.  ( 2 min )
    Learning in Observable POMDPs, without Computationally Intractable Oracles. (arXiv:2206.03446v1 [cs.LG])
    Much of reinforcement learning theory is built on top of oracles that are computationally hard to implement. Specifically for learning near-optimal policies in Partially Observable Markov Decision Processes (POMDPs), existing algorithms either need to make strong assumptions about the model dynamics (e.g. deterministic transitions) or assume access to an oracle for solving a hard optimistic planning or estimation problem as a subroutine. In this work we develop the first oracle-free learning algorithm for POMDPs under reasonable assumptions. Specifically, we give a quasipolynomial-time end-to-end algorithm for learning in "observable" POMDPs, where observability is the assumption that well-separated distributions over states induce well-separated distributions over observations. Our techniques circumvent the more traditional approach of using the principle of optimism under uncertainty to promote exploration, and instead give a novel application of barycentric spanners to constructing policy covers.  ( 2 min )
    Integrating Random Effects in Deep Neural Networks. (arXiv:2206.03314v1 [stat.ML])
    Modern approaches to supervised learning like deep neural networks (DNNs) typically implicitly assume that observed responses are statistically independent. In contrast, correlated data are prevalent in real-life large-scale applications, with typical sources of correlation including spatial, temporal and clustering structures. These correlations are either ignored by DNNs, or ad-hoc solutions are developed for specific use cases. We propose to use the mixed models framework to handle correlated data in DNNs. By treating the effects underlying the correlation structure as random effects, mixed models are able to avoid overfitted parameter estimates and ultimately yield better predictive performance. The key to combining mixed models and DNNs is using the Gaussian negative log-likelihood (NLL) as a natural loss function that is minimized with DNN machinery including stochastic gradient descent (SGD). Since NLL does not decompose like standard DNN loss functions, the use of SGD with NLL presents some theoretical and implementation challenges, which we address. Our approach which we call LMMNN is demonstrated to improve performance over natural competitors in various correlation scenarios on diverse simulated and real datasets. Our focus is on a regression setting and tabular datasets, but we also show some results for classification. Our code is available at https://github.com/gsimchoni/lmmnn.  ( 2 min )
    Reweighting samples under covariate shift using a Wasserstein distance criterion. (arXiv:2010.09267v2 [math.ST] UPDATED)
    Considering two random variables with different laws to which we only have access through finite size iid samples, we address how to reweight the first sample so that its empirical distribution converges towards the true law of the second sample as the size of both samples goes to infinity. We study an optimal reweighting that minimizes the Wasserstein distance between the empirical measures of the two samples, and leads to an expression of the weights in terms of Nearest Neighbors. The consistency and some asymptotic convergence rates in terms of expected Wasserstein distance are derived, and do not need the assumption of absolute continuity of one random variable with respect to the other. These results have some application in Uncertainty Quantification for decoupled estimation and in the bound of the generalization error for the Nearest Neighbor Regression under covariate shift.  ( 2 min )
    Collaborative Linear Bandits with Adversarial Agents: Near-Optimal Regret Bounds. (arXiv:2206.02834v1 [cs.LG])
    We consider a linear stochastic bandit problem involving $M$ agents that can collaborate via a central server to minimize regret. A fraction $\alpha$ of these agents are adversarial and can act arbitrarily, leading to the following tension: while collaboration can potentially reduce regret, it can also disrupt the process of learning due to adversaries. In this work, we provide a fundamental understanding of this tension by designing new algorithms that balance the exploration-exploitation trade-off via carefully constructed robust confidence intervals. We also complement our algorithms with tight analyses. First, we develop a robust collaborative phased elimination algorithm that achieves $\tilde{O}\left(\alpha+ 1/\sqrt{M}\right) \sqrt{dT}$ regret for each good agent; here, $d$ is the model-dimension and $T$ is the horizon. For small $\alpha$, our result thus reveals a clear benefit of collaboration despite adversaries. Using an information-theoretic argument, we then prove a matching lower bound, thereby providing the first set of tight, near-optimal regret bounds for collaborative linear bandits with adversaries. Furthermore, by leveraging recent advances in high-dimensional robust statistics, we significantly extend our algorithmic ideas and results to (i) the generalized linear bandit model that allows for non-linear observation maps; and (ii) the contextual bandit setting that allows for time-varying feature vectors.  ( 2 min )
    RORL: Robust Offline Reinforcement Learning via Conservative Smoothing. (arXiv:2206.02829v1 [cs.LG])
    Offline reinforcement learning (RL) provides a promising direction to exploit the massive amount of offline data for complex decision-making tasks. Due to the distribution shift issue, current offline RL algorithms are generally designed to be conservative for value estimation and action selection. However, such conservatism impairs the robustness of learned policies, leading to a significant change even for a small perturbation on observations. To trade off robustness and conservatism, we propose Robust Offline Reinforcement Learning (RORL) with a novel conservative smoothing technique. In RORL, we explicitly introduce regularization on the policy and the value function for states near the dataset and additional conservative value estimation on these OOD states. Theoretically, we show RORL enjoys a tighter suboptimality bound than recent theoretical results in linear MDPs. We demonstrate that RORL can achieve the state-of-the-art performance on the general offline RL benchmark and is considerably robust to adversarial observation perturbation.  ( 2 min )

  • Open

    “Conscious AI”
    submitted by /u/DANGERD0OM [link] [comments]
    Quick question for all those who are trying to build stuff with AI/ML
    Quick question for all those who are trying to build stuff with AI/ML -Why do you care/not care about reproducible/usable code/models? i know it's a basic question but i'm trying to dive deeper and understand the underlying reasons about why it matters or doesn't matter to you. (5 whys analysis of this question basically) submitted by /u/MLtinkerer [link] [comments]  ( 1 min )
    Sparse Neural Networks Optimize Efficiency with Neuroscience
    submitted by /u/aidev2040 [link] [comments]
    MELODIES POSITIVE: An Artificial Waterfall.
    submitted by /u/cookingandcraft [link] [comments]
    DISCO DIFFUSION 3D AI ART ANIMATION | NIDAVELIR’S MAGNIFICENCE
    submitted by /u/Available_Tadpole829 [link] [comments]
    White Walkers - The Silent Death? - AI Art Experiment in 4K 60 FPS w/ GPT-3
    submitted by /u/MLInsights [link] [comments]
  • Open

    [D] Quick question for all those who are trying to build stuff with AI/ML
    Quick question for all those who are trying to build stuff with AI/ML -Why do you care/not care about reproducible/usable code/models? i know it's a basic question but i'm trying to dive deeper and understand the underlying reasons about why it matters or doesn't matter to you. (5 whys analysis of this question basically) submitted by /u/MLtinkerer [link] [comments]  ( 1 min )
    [D] Neural Network Layers as Operations on Data Collections Types
    I had an observation recently that I wanted to share / get feedback on. Many of the a canonical deep learning layer types can be viewed as an operation on one of the basic data collection types used by Python (and other languages). Dense Layers -> Tuples Recurrent Layers -> Lists Attention Layers -> Sets Graph Neural Network Layers -> Dictionaries Am I missing any? submitted by /u/emuccino [link] [comments]  ( 1 min )
    [D] Masking out loss values
    Hey, I would like to start a discussion about following topic. I have a GAN with a Generator and a Discriminator. If I mask out some loss values randomly by lets say putting 10% of the Loss Values randomly to Zero. How does this affect the training? How does the optimizer handle such masking? Because such random masking of the losses creates Spikes in the loss surface or am I completely wrong? submitted by /u/SeucheAchat9115 [link] [comments]  ( 2 min )
    [D] Can we explain the deep prior regularisation by the differentiation step rather than architecture?
    As the post title says, is it possible to explain the ability of deep prior networks to perform tasks such as image inpainting to the implicit differentiation in the backpropogation rather than the architecture of the network. submitted by /u/vash_stampede08 [link] [comments]  ( 1 min )
    [D] How to balance production and research in a project, especially doing it alone.
    I need some advice as I want to deliver better results. I'm doing a project with provision from a professor but most of the time I do it alone as he does not have much spare time. He want me to produce some results in image processing like object detection and publish some research to conferences, particularly in ML and CV. But after working for a while, I could not produce any meaningful results and haven't published any paper. Basically I'm struggling in both objectives so I hope I can get some advice here. Should I lean more to production or research? Or should I quit after all? submitted by /u/IcySnowy [link] [comments]  ( 1 min )
    [R] What is the best summary of neural tangent kernel research thus far?
    Do folks here have good references for a summary in what progress has been made in neural tangent kernel (NTK) research? There's an excellent and approachable blog post about the state of the field in 2018-2019 (https://rajatvd.github.io/NTK/), but I assume that there's been a lot of follow-on work since. Thanks! submitted by /u/Yukiomo [link] [comments]  ( 1 min )
    [Discussion] Tracking, running and managing experiments in sandbox environment
    Hi everyone, I'm looking for a system for collecting and sharing KPIs, managing and running experiments on local and remote nodes. My requirements are: Sandbox environment: server and nodes are running in a private network with no internet access Nice graphs and easy comparisons Easy way to share datasets Supports local and remote nodes ​ Preferred, but not mandatory: Open source Supports distributed training ​ I've heard a lot of good recommendations on ClearML and Weights & Biases. I tested out ClearML and Weights & Biases to see if they work in a sandbox env, but when the servers tried to validate the free trial license, it failed on connectivity issues. ​ Does anyone knows of other Experiment Management system that can work without internet access, and have similar capabilities to ClearML and Weights & Biases? submitted by /u/Intelligent_Gene_283 [link] [comments]  ( 1 min )
    [P] A shared arxiv-PDF-viewer
    What if you could read a paper and at the same time have a scientific discussion about certain paragraphs or figures. Just mark the sentence or picture and create a new thread about it, ask a question, explain something in greater detail or link to a blog post that explains a concept better. I think that would be awesome and a win-win for the authors and the readers. I am kind of a scientist myself and I would love to see something like it. Does something like that exist? If not, I would like to make that a (shared) project. Looking forward to your suggestions! submitted by /u/mingaflo [link] [comments]  ( 2 min )
    [P] Several of my past and current projects / “Amateur” programmer fought cancer with 50 Nvidia Geforce 1080Ti
    ​ https://preview.redd.it/fzh4ghfmn7491.jpg?width=600&format=pjpg&auto=webp&s=ac0985c3373a27a59c5a819ecc5b833b334651a4 Since this whole series of projects and reports kind of started from this sub Reddit, I feel like it's appropriate to have a thread here to organize them. First, the English translation of the news article: https://howardchen.substack.com/p/this-amateur-programmer-fought-cancer?s=w The original Chinese version: https://www.toutiao.com/article/7094940100450107935/ The video: https://www.bilibili.com/video/BV1x3411V7tL?spm_id_from=333.337.search-card.all.click (In Chinese, more than 2M views at the moment) Youtube version: https://www.youtube.com/watch?v=-t-a6l8a2N0&t=3s The Hacker News discussion a couple weeks ago: https://news.ycombinator.com/item?id=31449147…  ( 1 min )
  • Open

    Help Support Women in AI
    Editor’s Note: I do not normally treat social media posts as articles, but this one is a bit special. Author Andrew Jones contacted me about helping to promote a new scholarship fund of $50,000 to help promote Women in Data Science, in conjunction with Women in AI. It is a worthwhile endeavor, and I hope… Read More »Help Support Women in AI The post Help Support Women in AI appeared first on Data Science Central.  ( 2 min )
    Can Starting with Waterfall Lead to Better Agile? Part II
    In my previous article, I discussed Waterfall, WaterScrumFall, big-A Agile, and business agility.  Dissonance abounds among organizations struggling to transition their approaches to building solutions and maintaining their existing legacy infrastructures while remaking how they evolve themselves.  In attempting to navigate this they often adopt approaches that are destined not to get them where they… Read More »Can Starting with Waterfall Lead to Better Agile? Part II The post Can Starting with Waterfall Lead to Better Agile? Part II appeared first on Data Science Central.  ( 7 min )
    Can Starting with Waterfall Lead to Better Agile?  Part I
    Waterfall, WaterScrumFall, Agile and Agility We should all be aware that business agility is the primary enabler for companies seeking sustainability.  In the past, companies would evolve in chunks, a project at a time.  The costs and risks of implementing change, whether system-related or otherwise, were large, so designing and planning upfront (the Waterfall approach)… Read More »Can Starting with Waterfall Lead to Better Agile?  Part I The post Can Starting with Waterfall Lead to Better Agile?  Part I appeared first on Data Science Central.  ( 5 min )
    Functional Testing in Agile Environment: All You Need to Know
    With today’s customers becoming more tech-savvy and sophisticated, the software businesses are becoming extremely competitive where quality is a critical factor for software projects/products and customers. Today, projects run behind schedule because of multiple factors that include requirements changing so rapidly. The quality of the project/product is determined by other factors like the skill set… Read More »Functional Testing in Agile Environment: All You Need to Know The post Functional Testing in Agile Environment: All You Need to Know appeared first on Data Science Central.  ( 3 min )
    Load Testing: Top Tools
    A subset of performance testing, load testing is just the concept of testing a given software’s ability to withstand the load, i.e., concurrent users. It refers to a kind of performance testing that determines the performance of the systems under real-life load conditions. And, this testing helps determine how the application behaves when accessed by… Read More »Load Testing: Top Tools The post Load Testing: Top Tools appeared first on Data Science Central.  ( 3 min )
  • Open

    Collin Stultz named co-director and MIT lead of the Harvard-MIT Program in Health Sciences and Technology
    MIT professor will leverage his research into machine learning and computer science, as well as his role as a practicing cardiologist, toward educating clinician-scientists and engineers.  ( 6 min )
  • Open

    I trained a NN to play a match 3 type of game
    submitted by /u/blazarious [link] [comments]  ( 1 min )
    Any experts with A2C graphs?
    Trying to improve my model, but needed to understand these graphs. Does anyone understand what they mean please? I know what the rewards graphs indicate, but Im confused with the rest. Any help would be appreciated ​ The Rewards Graph Other training graphs I dont understand submitted by /u/pssword123 [link] [comments]  ( 1 min )
    Procedure cloning
    submitted by /u/dwightschrute1905 [link] [comments]
    Pokemon Showdown AI - Policy Iteration Approach
    Hi Everyone, I have mocked together a self-play pokemon showdown ai that utilises many of the techniques employed in Alphastar. These include: Transformer for Team/Moveset Embeddings Encoding Field / Terrain / Weather Layer Norm LSTM Action Type (Move or Switch) as well as Move and Switch Heads VTrace, UPGO (unique to AlphaStar, cannot find much on it) for Policy loss and TD Lambda for Value loss However, I am confused about how to design a reward function for a pokemon battle. The simple answer is to reward -1 for losing and +1 for winning, but this is too sparse and does not converge fast. I have a reward for fainting and hp % as well as whether a pokemon uses a move that other is immune to / fails. What other rewards could/should I consider? In the Alphastar pseudocode, they calculate the loss on the policy and value networks separately for each reward signal. Is this also the right approach here? How should I weigh these rewards such that the agent does not simply favor fainting and lose sight of winning the game? In Alphastar, they use a discount factor of 1. My understanding is that the longer the episode, the closer the discount factor should be to 1. This makes sense for a game like StarCraft, though what should it be for pokemon (20-40 steps per battle)? My current parameters are very similar to AlphaStar but adjusted to run on my personal computer 12 actors, CPU for rollout, GPU for Learning trajectory length = 32 batch size = 128 learning rate = 3e-5 discount factor = 0.9 entropy discount = 1e-2 ​ Any advice/literature on the issues above would be greatly appreciated. submitted by /u/atomicburn125 [link] [comments]  ( 2 min )
  • Open

    Sparse Neural Networks Optimize Efficiency with Neuroscience
    submitted by /u/aidev2040 [link] [comments]
  • Open

    End-to-end Generative Pre-training for Multimodal Video Captioning
    Posted by Paul Hongsuck Seo and Arsha Nagrani, Research Scientists, Google Research, Perception Team Multimodal video captioning systems utilize both the video frames and speech to generate natural language descriptions (captions) of videos. Such systems are stepping stones towards the longstanding goal of building multimodal conversational systems that effortlessly communicate with users while perceiving environments through multimodal input streams. Unlike video understanding tasks (e.g., video classification and retrieval) where the key challenge lies in processing and understanding multimodal input videos, the task of multimodal video captioning includes the additional challenge of generating grounded captions. The most widely adopted approach for this task is to train an encoder-…  ( 7 min )
  • Open

    Create train, test, and validation splits on your data for machine learning with Amazon SageMaker Data Wrangler
    In this post, we talk about how to split a machine learning (ML) dataset into train, test, and validation datasets with Amazon SageMaker Data Wrangler so you can easily split your datasets with minimal to no code. Data used for ML is typically split into the following datasets: Training – Used to train an algorithm […]  ( 7 min )
    How InfoJobs (Adevinta) improves NLP model prediction performance with AWS Inferentia and Amazon SageMaker
    This is a guest post co-written by Juan Francisco Fernandez, ML Engineer in Adevinta Spain, and AWS AI/ML Specialist Solutions Architects Antonio Rodriguez and João Moura. InfoJobs, a subsidiary company of the Adevinta group, provides the perfect match between candidates looking for their next job position and employers looking for the best hire for the […]  ( 8 min )
  • Open

    Festo Develops With Isaac Sim to Drive Its Industrial Automation
    Dionysios Satikidis was playing FIFA 19 when he realized the simulated soccer game’s realism offered a glimpse into the future for training robots. An expert in AI and autonomous systems at Festo, a German industrial control and automation company, he believed the worlds of gaming and robotics would intersect. “I’ve always been passionate about technology Read article > The post Festo Develops With Isaac Sim to Drive Its Industrial Automation appeared first on NVIDIA Blog.  ( 3 min )
    What Is Zero Trust?
    For all its sophistication, the Internet age has brought on a digital plague of security breaches. The steady drumbeat of data and identity thefts spawned a new movement and a modern mantra that’s even been the subject of a U.S. presidential mandate — zero trust. So, What Is Zero Trust? Zero trust is a cybersecurity Read article > The post What Is Zero Trust? appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    Will Artificial Intelligence take over humanity?
    An AI wrote this article. The last words are very frightening!  ( 3 min )
    Mitigating AI Bias, with …Bias
    This article is part of my Data Trust series of talks. The purpose of these articles are to break down complex but important… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 7 min )
2022-07-07T01:05:15.897Z osmosfeed 1.15.1